When await actor_mesh.endpoint.call(...) is running and one process in the mesh
is killed (SEGFAULT, KILL_PROC), call() does not raise. The surviving actors are
stuck in blocking operations (NCCL collectives) and never return. __supervise__
fires correctly for the dead actor, but the call() on the caller side stays blocked.
Repro
Using train_distributed_k8s.py (https://github.com/HosseinKaviani-H/torchft/blob/main/examples/monarch/train_distributed_k8s.py)
- Spawn 8 TrainingActors on a ProcMesh (line 253-256)
- Call
await training_actors.start_training.call(lighthouse_address) (line 266)
- Kill one process (e.g., SEGFAULT via FailureActor)
__supervise__ fires on the owning ReplicaActor (line 216) — confirmed in logs
call() on line 266 never raises — the 7 surviving processes are stuck in NCCL
Expected
call() should raise when the supervision system detects a dead process in the mesh.
Workaround
We manually stop the ProcMesh from __supervise__ to force call() to raise (line 226-230):
if self._trainers_proc_mesh is not None and self._loop is not None:
pm = self._trainers_proc_mesh
self._trainers_proc_mesh = None
self._loop.call_soon_threadsafe(self._loop.create_task, pm.stop())
This requires workaround for Issue 2 below since __supervise__ runs on a
thread with no event loop
When
await actor_mesh.endpoint.call(...)is running and one process in the meshis killed (SEGFAULT, KILL_PROC),
call()does not raise. The surviving actors arestuck in blocking operations (NCCL collectives) and never return.
__supervise__fires correctly for the dead actor, but the
call()on the caller side stays blocked.Repro
Using train_distributed_k8s.py (https://github.com/HosseinKaviani-H/torchft/blob/main/examples/monarch/train_distributed_k8s.py)
await training_actors.start_training.call(lighthouse_address)(line 266)__supervise__fires on the owning ReplicaActor (line 216) — confirmed in logscall()on line 266 never raises — the 7 surviving processes are stuck in NCCLExpected
call()should raise when the supervision system detects a dead process in the mesh.Workaround
We manually stop the ProcMesh from
__supervise__to forcecall()to raise (line 226-230):This requires workaround for Issue 2 below since
__supervise__runs on athread with no event loop