Summary
Async endpoints with instrument=True (default) trigger a Rust panic:
assertion failed: monarch_hyperactor::telemetry::PySpan is unsendable,
but sent to another thread
left: ThreadId(257)
right: ThreadId(258)
The SpanWrapper in rust_span_tracing.py:29 creates a PySpan (Rust, marked
unsendable) at endpoint entry. When the async endpoint crosses an await, Python's
event loop may resume on a different thread. PySpan.__exit__ runs on the new thread,
causing the panic.
Repro
In train_distributed_k8s.py (https://github.com/HosseinKaviani-H/torchft/blob/main/examples/monarch/train_distributed_k8s.py), any @endpoint method that is async and long-running
(e.g., start_training, start_replica) will eventually hit this. The crash is
non-deterministic — depends on thread scheduling.
Stack trace
thread '<unnamed>' panicked at pyo3-0.26.0/src/impl_/pyclass.rs:1081:9
9: <monarch_hyperactor::telemetry::PySpan>::__pymethod_exit__
Note
torchmonarch 0.4.1 fixed the Python dispatch path (uses TRACER.start_as_current_span
instead of PySpan directly), but RustTracer.start_span() still creates PySpan
via SpanWrapper internally (rust_span_tracing.py:29):
self._span: PySpan | None = PySpan(name, actor_id)
Workaround
@endpoint(instrument=False) on all endpoints (lines 170, 272, 296 in
train_distributed_k8s.py). This disables all endpoint telemetry.
Summary
Async endpoints with
instrument=True(default) trigger a Rust panic:The
SpanWrapperinrust_span_tracing.py:29creates aPySpan(Rust, markedunsendable) at endpoint entry. When the async endpoint crosses anawait, Python'sevent loop may resume on a different thread.
PySpan.__exit__runs on the new thread,causing the panic.
Repro
In train_distributed_k8s.py (https://github.com/HosseinKaviani-H/torchft/blob/main/examples/monarch/train_distributed_k8s.py), any
@endpointmethod that is async and long-running(e.g.,
start_training,start_replica) will eventually hit this. The crash isnon-deterministic — depends on thread scheduling.
Stack trace
Note
torchmonarch 0.4.1 fixed the Python dispatch path (uses
TRACER.start_as_current_spaninstead of
PySpandirectly), butRustTracer.start_span()still createsPySpanvia
SpanWrapperinternally (rust_span_tracing.py:29):Workaround
@endpoint(instrument=False)on all endpoints (lines 170, 272, 296 intrain_distributed_k8s.py). This disables all endpoint telemetry.