-
Notifications
You must be signed in to change notification settings - Fork 482
Description
Tracer Version(s)
4.4.0
Python Version(s)
3.12.12
Pip Version(s)
uv 0.9
Bug Report
The ddtrace memory profiler causes sudden, catastrophic OOMKills in our production Kubernetes pods. The memory allocation happens outside of normal Python memory allocation making it invisible to tracemalloc, gc stats, and RSS tracking within the application. This made the issue extremely difficult to diagnose.
My team spent too long debugging repeated OOMKills before identifying the memory profiler as the root cause. ddtrace is the tool we would grab for to resolve such issues.
Environment
- ddtrace: 4.4.0 (latest)
- Python: 3.12
- Framework: FastAPI + uvicorn (2 workers)
- Workload: REST API with SSE (Server-Sent Events) endpoints via sse-starlette
- Infrastructure: Kubernetes, 4Gi memory limit
Behavior
Pods start at ~309MB RSS per worker (~618MB combined). RSS slowly creeps to 345MB per worker, with all Python-level metrics (GC tracked objects, tracemalloc, gc collections) completely stable and healthy. Then, between two 5-second monitoring samples, the pod is OOMKilled - jumping to 4Gi+ in under 2 minutes.
This happened consistently, every 3-10 minutes, across multiple pods and restarts.
What made this so hard to find
None of the following could detect the memory growth:
- tracemalloc (10 frames deep, always-on)
- gc.get_stats() / gc.get_objects() sampled every 5 seconds
- Per-request RSS delta tracking
- GC tracked object counts (stable at ~424k)
Every metric showed a perfectly healthy application right up until the kernel killed it.
Resolution
Setting DD_PROFILING_MEMORY_ENABLED=false immediately resolved the issue. Pods have been stable for hours with profiling otherwise enabled.
Request
This is noted as a known issue in the v4.4.0 release notes (https://github.com/DataDog/dd-trace-py/releases/tag/v4.4.0), but:
- There is no tracking issue for it
- It affects four minor versions (v4.1 through v4.4)
- The warning is easy to miss in the release notes - people do not always read this. We didn't.
- There is no indication of severity, timeline, or fix ETA
For a bug that causes unrecoverable production crashes that are invisible to standard monitoring, this needs a proper tracking issue with a priority fix.
This is unacceptable for telemetry. An observability library should never be the cause of production outages, especially ones invisible to the very monitoring tools it sits alongside.