call() blocks indefinitely when a child process in the mesh is killed

 When `await actor_mesh.endpoint.call(...)` is running and one process in the mesh                                                
  is killed (SEGFAULT, KILL_PROC), `call()` does not raise. The surviving actors are                                               
  stuck in blocking operations (NCCL collectives) and never return. `__supervise__`                                                
  fires correctly for the dead actor, but the `call()` on the caller side stays blocked.                                           
                                                                                                                                   
  ## Repro                                                                                                                         
                                                                                                                                   
Using train_distributed_k8s.py (https://github.com/HosseinKaviani-H/torchft/blob/main/examples/monarch/train_distributed_k8s.py)                                                                    
                                                      
  1. Spawn 8 TrainingActors on a ProcMesh (line 253-256)                                                                           
  2. Call `await training_actors.start_training.call(lighthouse_address)` (line 266)
  3. Kill one process (e.g., SEGFAULT via FailureActor)                                                                            
  4. `__supervise__` fires on the owning ReplicaActor (line 216) — confirmed in logs                                               
  5. `call()` on line 266 never raises — the 7 surviving processes are stuck in NCCL                                               
                                                                                                                                   
  ## Expected                                                                                                                      
                                                                                                                                   
  `call()` should raise when the supervision system detects a dead process in the mesh.                                            
   
  ## Workaround                                                                                                                    
                                                      
  We manually stop the ProcMesh from `__supervise__` to force `call()` to raise (line 226-230):                                    
                                                      
      if self._trainers_proc_mesh is not None and self._loop is not None:                                                          
          pm = self._trainers_proc_mesh               
          self._trainers_proc_mesh = None                                                                                          
          self._loop.call_soon_threadsafe(self._loop.create_task, pm.stop())                                                       
                                                                                                                                   
  This requires workaround for Issue 2 below since `__supervise__` runs on a                                                       
  thread with no event loop

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

call() blocks indefinitely when a child process in the mesh is killed #3435

Repro

Expected

Workaround

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

call() blocks indefinitely when a child process in the mesh is killed #3435

Description

Repro

Expected

Workaround

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions