Skip to content

PySpan is unsendable panic when async endpoints cross await boundaries #3437

@HosseinKaviani-H

Description

@HosseinKaviani-H

Summary

Async endpoints with instrument=True (default) trigger a Rust panic:

  assertion failed: monarch_hyperactor::telemetry::PySpan is unsendable,                                                       
  but sent to another thread                      
    left: ThreadId(257)                                                                                                        
   right: ThreadId(258)                                                                                                        

The SpanWrapper in rust_span_tracing.py:29 creates a PySpan (Rust, marked
unsendable) at endpoint entry. When the async endpoint crosses an await, Python's
event loop may resume on a different thread. PySpan.__exit__ runs on the new thread,
causing the panic.

Repro

In train_distributed_k8s.py (https://github.com/HosseinKaviani-H/torchft/blob/main/examples/monarch/train_distributed_k8s.py), any @endpoint method that is async and long-running
(e.g., start_training, start_replica) will eventually hit this. The crash is
non-deterministic — depends on thread scheduling.

Stack trace

  thread '<unnamed>' panicked at pyo3-0.26.0/src/impl_/pyclass.rs:1081:9                                                       
  9: <monarch_hyperactor::telemetry::PySpan>::__pymethod_exit__

Note

torchmonarch 0.4.1 fixed the Python dispatch path (uses TRACER.start_as_current_span
instead of PySpan directly), but RustTracer.start_span() still creates PySpan
via SpanWrapper internally (rust_span_tracing.py:29):

  self._span: PySpan | None = PySpan(name, actor_id)                                                                           

Workaround

@endpoint(instrument=False) on all endpoints (lines 170, 272, 296 in
train_distributed_k8s.py). This disables all endpoint telemetry.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions