Skip to content

Memory profiler causes repeated OOMKills in production - invisible to all Python-level monitoring (v4.1-v4.4) #16491

@mikkelam

Description

@mikkelam

Tracer Version(s)

4.4.0

Python Version(s)

3.12.12

Pip Version(s)

uv 0.9

Bug Report

The ddtrace memory profiler causes sudden, catastrophic OOMKills in our production Kubernetes pods. The memory allocation happens outside of normal Python memory allocation making it invisible to tracemalloc, gc stats, and RSS tracking within the application. This made the issue extremely difficult to diagnose.
My team spent too long debugging repeated OOMKills before identifying the memory profiler as the root cause. ddtrace is the tool we would grab for to resolve such issues.
Environment

  • ddtrace: 4.4.0 (latest)
  • Python: 3.12
  • Framework: FastAPI + uvicorn (2 workers)
  • Workload: REST API with SSE (Server-Sent Events) endpoints via sse-starlette
  • Infrastructure: Kubernetes, 4Gi memory limit

Behavior
Pods start at ~309MB RSS per worker (~618MB combined). RSS slowly creeps to 345MB per worker, with all Python-level metrics (GC tracked objects, tracemalloc, gc collections) completely stable and healthy. Then, between two 5-second monitoring samples, the pod is OOMKilled - jumping to 4Gi+ in under 2 minutes.
This happened consistently, every 3-10 minutes, across multiple pods and restarts.

What made this so hard to find
None of the following could detect the memory growth:

  • tracemalloc (10 frames deep, always-on)
  • gc.get_stats() / gc.get_objects() sampled every 5 seconds
  • Per-request RSS delta tracking
  • GC tracked object counts (stable at ~424k)
    Every metric showed a perfectly healthy application right up until the kernel killed it.

Resolution
Setting DD_PROFILING_MEMORY_ENABLED=false immediately resolved the issue. Pods have been stable for hours with profiling otherwise enabled.

Request
This is noted as a known issue in the v4.4.0 release notes (https://github.com/DataDog/dd-trace-py/releases/tag/v4.4.0), but:

  • There is no tracking issue for it
  • It affects four minor versions (v4.1 through v4.4)
  • The warning is easy to miss in the release notes - people do not always read this. We didn't.
  • There is no indication of severity, timeline, or fix ETA
    For a bug that causes unrecoverable production crashes that are invisible to standard monitoring, this needs a proper tracking issue with a priority fix.

This is unacceptable for telemetry. An observability library should never be the cause of production outages, especially ones invisible to the very monitoring tools it sits alongside.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions