Skip to content

Skip in-container sshd setup for single-node container runs#224

Merged
atnair-amd merged 1 commit into
mainfrom
atnair/orch-singlenode-setup-sshd-guard
Jun 15, 2026
Merged

Skip in-container sshd setup for single-node container runs#224
atnair-amd merged 1 commit into
mainfrom
atnair/orch-singlenode-setup-sshd-guard

Conversation

@atnair-amd

Copy link
Copy Markdown
Collaborator

Summary

ContainerOrchestrator.setup_sshd() ran a fixed command list ending in
/usr/sbin/sshd -p2224 and asserted every step exited 0 — for every
container run, regardless of node count (cvs/tests/conftest.py calls it
unconditionally). The in-container sshd exists solely so MPI (mpirun's
plm_rsh_args -p 2224, see BaremetalOrchestrator.get_mpi_command) can reach
peer ranks on other nodes. A single-node run execs directly via docker exec
and never distributes over MPI, so the sshd setup is dead weight on one host.

Consequences before this change:

  • A single-node run on a minimal image without /usr/sbin/sshd failed the whole
    orch fixture (pytest.fail("Failed to setup sshd in container")) and never
    ran the workload — surfaced only as the generic SSH setup command failed.
  • Single-node, no-SSH suites on sshd-equipped images still paid a pointless step
    that can fail on unrelated edge cases (e.g. empty ~/.ssh mount).

Change

  • Guard setup_sshd() to return True early when len(self.hosts) <= 1,
    placed after the container_id precondition. Host count lives on the
    orchestrator, so the decision belongs there; the fixture stays unchanged and
    any future caller inherits the behavior.
  • Multinode runs are unaffected — they still set up and validate sshd.

Tests

  • cvs/core/orchestrators/unittests/test_container.py: single-node skips (no
    runtime.exec), container_id precondition still raises even single-node,
    multinode still attempts setup.

Out of scope

  • Multinode + no-sshd image still fails (correctly — MPI needs sshd); its error
    message could be clearer, tracked separately.
  • No new config key; behavior keys purely off node count.

Gate

make test (mirrors CI .github/workflows/ci.yml, Python 3.10): make fmt
clean (213 files unchanged), Ran 370 tests ... OK, 42 CLI tests passed.

ContainerOrchestrator.setup_sshd() ran a fixed command list ending in
`/usr/sbin/sshd -p2224` and asserted every step succeeded, for every
container run regardless of node count. The orch pytest fixture calls it
unconditionally, so a single-node run on a minimal image with no
/usr/sbin/sshd failed the whole fixture with a generic "SSH setup command
failed" message and never ran the workload.

The in-container sshd exists only so MPI (mpirun's plm_rsh_args -p 2224)
can reach peer ranks on other nodes. A single-node run execs directly via
docker exec and never distributes over MPI, so the sshd setup is dead
weight there. Guard setup_sshd() to return True early when len(self.hosts)
<= 1, after the container_id precondition. The host count lives on the
orchestrator, so the decision belongs there; multinode runs are unchanged.
@atnair-amd atnair-amd self-assigned this Jun 12, 2026
@atnair-amd atnair-amd requested a review from cijohnson June 12, 2026 16:54
@atnair-amd atnair-amd merged commit a0d4780 into main Jun 15, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants