Memory profiler causes repeated OOMKills in production - invisible to all Python-level monitoring (v4.1-v4.4)

### Tracer Version(s)

4.4.0

### Python Version(s)

3.12.12

### Pip Version(s)

uv 0.9

### Bug Report

The ddtrace memory profiler causes sudden, catastrophic OOMKills in our production Kubernetes pods. The memory allocation happens outside of normal Python memory allocation making it invisible to tracemalloc, gc stats, and RSS tracking within the application. This made the issue extremely difficult to diagnose.
My team spent too long debugging repeated OOMKills before identifying the memory profiler as the root cause. ddtrace is the tool we would grab for to resolve such issues.
Environment
- ddtrace: 4.4.0 (latest)
- Python: 3.12
- Framework: FastAPI + uvicorn (2 workers)
- Workload: REST API with SSE (Server-Sent Events) endpoints via sse-starlette
- Infrastructure: Kubernetes, 4Gi memory limit

**Behavior**
Pods start at ~309MB RSS per worker (~618MB combined). RSS slowly creeps to 345MB per worker, with all Python-level metrics (GC tracked objects, tracemalloc, gc collections) completely stable and healthy. Then, between two 5-second monitoring samples, the pod is OOMKilled - jumping to 4Gi+ in under 2 minutes.
This happened consistently, every 3-10 minutes, across multiple pods and restarts.

**What made this so hard to find**
None of the following could detect the memory growth:
- tracemalloc (10 frames deep, always-on)
- gc.get_stats() / gc.get_objects() sampled every 5 seconds
- Per-request RSS delta tracking
- GC tracked object counts (stable at ~424k)
Every metric showed a perfectly healthy application right up until the kernel killed it.

**Resolution**
Setting `DD_PROFILING_MEMORY_ENABLED=false` immediately resolved the issue. Pods have been stable for hours with profiling otherwise enabled.

**Request**
This is noted as a known issue in the v4.4.0 release notes (https://github.com/DataDog/dd-trace-py/releases/tag/v4.4.0), but:
- There is no tracking issue for it
- It affects four minor versions (v4.1 through v4.4)
- The warning is easy to miss in the release notes - people do not always read this. We didn't.
- There is no indication of severity, timeline, or fix ETA
For a bug that causes unrecoverable production crashes that are invisible to standard monitoring, this needs a proper tracking issue with a priority fix.


This is unacceptable for telemetry. An observability library should never be the cause of production outages, especially ones invisible to the very monitoring tools it sits alongside.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory profiler causes repeated OOMKills in production - invisible to all Python-level monitoring (v4.1-v4.4) #16491

Tracer Version(s)

Python Version(s)

Pip Version(s)

Bug Report

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Memory profiler causes repeated OOMKills in production - invisible to all Python-level monitoring (v4.1-v4.4) #16491

Description

Tracer Version(s)

Python Version(s)

Pip Version(s)

Bug Report

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions