Skip to content

Ichristo/cvs monitor helios jumphost#212

Open
cijohnson wants to merge 6 commits into
mainfrom
ichristo/cvs-monitor-helios-jumphost
Open

Ichristo/cvs monitor helios jumphost#212
cijohnson wants to merge 6 commits into
mainfrom
ichristo/cvs-monitor-helios-jumphost

Conversation

@cijohnson

Copy link
Copy Markdown
Collaborator

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

cijohnson and others added 6 commits June 3, 2026 16:26
Implements a new persistent worker mode that maintains long-lived SSH
processes across multiple operations, improving performance for workloads
with repeated SSH commands to the same hosts.

Key Features:
- PersistentPsshSharder: Long-lived worker processes with process pooling
- Worker state management via WorkerTable/WorkerState interfaces
- Configurable via CVS_PERSISTENT_SHARDS environment variable
- Intelligent host chunking with max_workers limit enforcement
- Backwards compatible - defaults to existing transient mode

Components Added:
- cvs/lib/parallel/persistent_pssh_sharder.py: Core persistent worker implementation
- Enhanced ParallelConfig with persistent_shards option
- SharderInterface abstraction for pluggable sharding strategies
- Comprehensive test coverage including performance integration tests

Host Chunking Improvements:
- Simplified chunk_hosts algorithm for better maintainability
- Respects max_workers configuration to prevent resource exhaustion
- Ensures contiguous host distribution across workers

This architecture enables significant performance gains for CVS monitor
operations and other SSH-intensive workloads by eliminating connection
setup overhead on repeated operations.

Signed-off-by: Ignatious Johnson <ichristo@amd.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
registry

Jump host (bastion) support for parallel SSH: ParallelConfig/Pssh/
MultiProcessPssh and the persistent sharder now accept jump_host/user/
password/pkey/port and tunnel target connections via parallel-ssh proxy
parameters (new JumpHostManager helper).

Because this introduced several new CVS_JUMP_* variables, centralize all
supported environment variables in a single registry (cvs/lib/env_vars.py):
name, default, type, and description declared once, read via get(). Refactor
config.from_env and the CLUSTER_FILE / CVS_EXTENSION_PKG_NAMES call sites to
use it, and add a `cvs env` command (listing + masked current values + quick
table) that renders the registry so docs can't drift. `cvs env` is ordered
after `exec` in help output.

Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: Ignatious Johnson <ichristo@amd.com>
unittests

Establish a unittest test suite for the cluster-mon backend ahead of
the SSH migration, capturing the behavior the migration must preserve.

- backend/run_all_unittests.py: backend-rooted unittest discovery
- app/unittests/testing.py: FakeSshManager test double for the SSH-manager API
- per-module unittests/ packages under app/core, app/collectors, app/api
- collector contract tests pinning exec_async -> parsed-output (incl. ERROR/ABORT)
- api logs tests (grep validation + /search filtering)
- SSH-manager contract test against current Pssh (parallel-ssh + probe mocked),
  including event-loop non-blocking assertion
- cluster-mon Makefile (ut + docker-build); repo Makefile ut delegates to it

Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: Ignatious Johnson <ichristo@amd.com>
Phase 1 (adapter + TDD parity gate):
- Add app/core/cluster_ssh_manager.py wrapping MultiProcessPssh. Preserves
  cluster-mon's API (exec_async/exec/exec_cmd_list, get_*_hosts,
  refresh_host_reachability, recreate_client, destroy_clients and the
  host_list/reachable_hosts/unreachable_hosts attrs). Implements the parity
  behaviors not provided by the lib: ABORT-merge for pre-probe-unreachable
  hosts, all-ABORT short-circuit when nothing is reachable, and an exec_async
  offload point (asyncio.to_thread + lazy asyncio.Lock). Direct path runs the
  TCP pre-probe; jump-host path uses libssh2 proxy_* and skips it.
- Refactor test_ssh_manager_contract.py into SshManagerContractMixin so the
  same 10 assertions run against both the legacy Pssh and the new adapter,
  proving parity. 51 tests green.
- Makefile ut target installs cvs (--no-deps, editable) so the adapter import
  resolves without clobbering backend's parallel-ssh pin.

Phase 2 (rewire main.py):
- Construct ClusterSshManager for both direct and jump-host branches in
  lifespan startup and reload_configuration; drop the Pssh/JumpHostPssh
  imports and the dead max_parallel handling. refresh/recreate/destroy call
  sites unchanged.

Co-authored-by: Cursor <cursoragent@cursor.com>
Refactor the image to a multi-stage build that builds the cvs wheel from
the repo-root build context, so `docker build` / `docker compose build`
works with no separate host wheel step. Move the build context to the repo
root, add a root .dockerignore to keep the context lean, ignore stray
build/ artifacts, and extend `make clean` to remove venv, build, and cache
artifacts. The docker-build target auto-detects daemon permissions and
retries with sudo.

Co-authored-by: Cursor <cursoragent@cursor.com>
The migration to ClusterSshManager (backed by cvs.lib.parallel) is complete and
main.py no longer constructs the legacy Pssh/JumpHostPssh classes. Delete those
two modules, drop the legacy Pssh arm of the SSH-manager contract test (keeping
the ClusterSshManager contract), and update README/testing docstrings to point
at cluster_ssh_manager.

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant