Skip to content

call() blocks indefinitely when a child process in the mesh is killed #3435

@HosseinKaviani-H

Description

@HosseinKaviani-H

When await actor_mesh.endpoint.call(...) is running and one process in the mesh
is killed (SEGFAULT, KILL_PROC), call() does not raise. The surviving actors are
stuck in blocking operations (NCCL collectives) and never return. __supervise__
fires correctly for the dead actor, but the call() on the caller side stays blocked.

Repro

Using train_distributed_k8s.py (https://github.com/HosseinKaviani-H/torchft/blob/main/examples/monarch/train_distributed_k8s.py)

  1. Spawn 8 TrainingActors on a ProcMesh (line 253-256)
  2. Call await training_actors.start_training.call(lighthouse_address) (line 266)
  3. Kill one process (e.g., SEGFAULT via FailureActor)
  4. __supervise__ fires on the owning ReplicaActor (line 216) — confirmed in logs
  5. call() on line 266 never raises — the 7 surviving processes are stuck in NCCL

Expected

call() should raise when the supervision system detects a dead process in the mesh.

Workaround

We manually stop the ProcMesh from __supervise__ to force call() to raise (line 226-230):

  if self._trainers_proc_mesh is not None and self._loop is not None:                                                          
      pm = self._trainers_proc_mesh               
      self._trainers_proc_mesh = None                                                                                          
      self._loop.call_soon_threadsafe(self._loop.create_task, pm.stop())                                                       

This requires workaround for Issue 2 below since __supervise__ runs on a
thread with no event loop

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions