Design: Flexible Rank Assignments for Cylinders #635
DLWoodruff
started this conversation in
Ideas
Replies: 1 comment 1 reply
-
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Flexible Rank Assignments for Cylinders
Status: Design Document (Draft)
Date: 2026-03-16
Branch:
flexible_rank_assignmentsMotivation
Currently,
WheelSpinnerenforces that every cylinder (hub and allspokes) receives the same number of MPI ranks and the same scenario
distribution. This is wasteful when different cylinders have different
computational requirements. For example:
from many ranks.
fewer iterations, so it could use fewer ranks (each handling more
scenarios).
ranks.
Allowing different rank counts per cylinder would let users allocate
MPI resources more efficiently, potentially reducing total wall-clock
time for the same number of processors.
User Interface
The user would specify a target ratio for each spoke relative to the
hub. The hub always serves as the reference (ratio 1.0). Example
With 14 total ranks and ratios hub:1.0, lagrangian:0.5, xhat:0.25,
the system would allocate ranks proportionally: 8 for the hub, 4 for the
Lagrangian spoke, 2 for the xhat spoke.
Open questions on the interface:
Config, or specified in thespoke dict as a numeric field?
--spoke-ranksoption that takes explicit countsinstead of ratios? Ratios are more portable across different total
rank counts, but explicit counts give precise control.
count? Options: (a) round and warn, (b) error, (c) adjust ratios
to fit.
Current Architecture
This section describes the pieces that would need to change.
Rank Partitioning (
spin_the_wheel.py)WheelSpinnerrequiresn_proc % n_spcomms == 0and creates twocommunicators via
MPI_Comm_splitstrata_commgroups rank i from each cylinder together (theinter-cylinder communication channel).
cylinder_commgroups allranks within one cylinder (the intra-cylinder channel).
With equal rank counts, there is a clean 1-to-1 correspondence:
strata_commrank 0 is always the hub, rank 1 is always spoke 1, etc.Scenario Distribution (
spbase.py)_calculate_scenario_ranks()distributes scenarios acrossself.n_procranks (the cylinder's rank count) using contiguousblocks. All cylinders currently have the same
n_proc, so rank iin every cylinder gets the same scenarios.
Buffer System (
spwindow.py,spcommunicator.py)Each rank creates an MPI RMA window with per-field buffers. Buffer
sizes for "local" fields (
NONANT,DUALS,BEST_XHAT,RECENT_XHATS,CROSS_SCENARIO_COST) are proportional to therank's local scenario count. Buffer sizes for "global" fields
(
NONANT_LOWER_BOUNDS,NONANT_UPPER_BOUNDS) are proportional tothe total nonant count and are the same on every rank.
On receive,
_validate_recv_field()checks that the remote buffersize matches the local expectation. This check assumes both sides have
the same local scenario count.
Communication Patterns
All inter-cylinder communication uses one-sided MPI (RMA) through
SPWindow. The communication graph is:Hub sends to spokes:
NONANT(local-sized): current nonant values for PH iteratesDUALS(local-sized): W values (dual weights)SHUTDOWN,BEST_OBJECTIVE_BOUNDS(scalars on rank 0 only)Spokes send to hub:
OBJECTIVE_INNER_BOUND,OBJECTIVE_OUTER_BOUND(scalars)BEST_XHAT(local-sized): best feasible solution foundRECENT_XHATS(local-sized, circular buffer)Spoke-to-spoke (via RMA windows, not point-to-point):
BEST_XHATandRECENT_XHATSfrom anInnerBoundSpoke
NONANT_LOWER_BOUNDSandNONANT_UPPER_BOUNDS(global-sized, already works)NONANTandCROSS_SCENARIO_COSTfrom a sourceThe local-sized fields are the ones affected by unequal rank counts.
Design: Multi-Rank Mapping
The core idea is that when a rank in one cylinder needs data from
another cylinder with a different rank count, it reads from multiple
remote ranks and assembles the result (or reads from one remote rank
and extracts a subset).
Overlap Maps
At startup, each rank computes a static mapping for each peer cylinder:
which remote ranks have scenarios that overlap with its own local
scenarios, and at what offsets within those remote buffers.
The existing
scen_names_to_ranks(n_proc)function already computesscenario-to-rank mappings. We would call it once per cylinder's rank
count to get each cylinder's distribution, then compute pairwise
overlaps.
For example, with a 4-rank hub and a 2-rank spoke, both handling 10
scenarios:
======== ================== ==================
Hub (4 ranks) Spoke (2 ranks)
======== ================== ==================
Rank 0 scen0, scen1 scen0--scen4
Rank 1 scen2, scen3 scen5--scen9
Rank 2 scen4, scen5
Rank 3 scen6--scen9
======== ================== ==================
Hub rank 0 needs to read from spoke rank 0 (which has scen0--scen4),
extracting only the portion for scen0--scen1. Hub rank 2 also reads
from spoke rank 0, extracting scen4--scen5.
The overlap map for hub rank 0 reading from the spoke would be
For hub rank 2
For hub rank 3
These maps are computed once at startup and reused for every
communication call. The data structure could be
Note: offsets are in nonant units (not scenario units) because
different scenarios could have different numbers of nonants in
multi-stage problems (though for two-stage problems they are uniform).
Window Topology Options
Option A: Single global window on MPI_COMM_WORLD
Every rank publishes its buffers in one shared window. Readers address
remote ranks by their global rank. The
strata_buffer_layouts(currently exchanged via
strata_comm.allgather) would instead beexchanged on
MPI_COMM_WORLDor a dedicated intercommunicator.Pros:
MPI_Win_createcall.Cons:
MPI_Win_lockgranularity is per-window; concurrent accesses todifferent cylinders' data may contend.
communicate with most other ranks.
Option B: Per-cylinder-pair intercommunicators with separate windows
For each pair of cylinders that need to communicate, create an
MPI_Intercommand a window on it. For example, hub-lagrangian andhub-xhat would each get their own window.
Pros:
Cons:
With spoke-to-spoke communication, this could be O(n^2) in the
worst case (though in practice most spokes only talk to the hub
plus at most one other spoke).
Option C: Keep strata_comm but make it asymmetric
Create
strata_commgroupings where ranks from different-sizedcylinders are grouped together. Ranks without a counterpart in a
smaller cylinder would be grouped with the "nearest" rank in that
cylinder.
For example, with 4 hub ranks and 2 spoke ranks, strata groups would
be
Pros:
Cons:
issues with MPI (a rank can only be in one communicator of a given
color).
MPI_Comm_splitbecause a rankcan only appear once per split. Would need to use
MPI_Comm_create_groupor manual point-to-point addressinginstead.
Option D: Replace strata_comm with direct RMA addressing
Abandon
strata_commentirely. UseMPI_COMM_WORLDfor thewindow, and have each rank directly address remote ranks by their
global rank. The overlap maps provide the addressing information.
This is essentially Option A but with the explicit framing that
strata_commis removed rather than modified.Pros:
Cons:
SPCommunicatorandSPWindowto workwithout
strata_comm.strata_comm.allgather)would need to use a different mechanism (e.g.,
fullcomm.allgatherfollowed by filtering).
Recommendation: Option D is the cleanest long-term solution.
Option A is equivalent but Option D better describes the intent. The
window is on
fullcomm(or a subset), and each rank knows whichglobal ranks to read from via the overlap maps.
Multi-Source Read Assembly
The current
get_receive_buffer()reads one contiguous buffer fromone remote rank. With multi-rank mapping, it would need to:
MPI_Getfor thatsegment from the appropriate remote rank.
buffer.
Schematically
This requires
SPWindow.get()to support partial reads (offset +count within a field), which is straightforward with
MPI_Getdisplacement parameters.
For the common case where both cylinders have the same rank count,
the overlap map has exactly one segment per peer rank and the offsets
are identity — so the behavior degenerates to the current code.
Write-ID Coherence
The current system uses a
write_idinteger appended to each buffer.When a sender writes new data, it increments
write_id. Thereceiver checks whether the
write_idhas changed since the lastread to determine if the data is "new."
With multi-source reads, a receiver reads from multiple remote ranks
in the same cylinder. These ranks may not have updated their buffers
at exactly the same time (the system is asynchronous). This raises
the question: should we require all ranks in a cylinder to have the
same
write_idbefore accepting the data?The answer depends on the field. Some fields carry coupled data that
must come from the same iteration to be mathematically valid, while
others are independent candidates that tolerate staleness.
Per-field coherence requirements
Fields requiring strict coherence:
NONANT(xbar values from the hub) andDUALS(W values fromthe hub) are coupled: they define the Lagrangian subproblem at a
specific PH iteration. If a spoke assembles xbar from iteration k
for some scenarios and iteration k+1 for others, the dual function
is evaluated at an inconsistent dual point and the resulting bound
is invalid.
In practice this is manageable because the hub is synchronous PH:
all hub ranks complete the same iteration before writing their
buffers, so their
write_idvalues naturally agree after eachPH iteration. The receiver just needs to verify they match before
accepting the assembled data.
Fields tolerating staleness (relaxed coherence):
BEST_XHAT— a feasible solution candidate. Even if assembledfrom slightly different spoke iterations, it is still a valid
candidate; the receiver evaluates it fresh.
RECENT_XHATS— candidate points for FWPH column generation.Same reasoning as
BEST_XHAT.NONANT_LOWER_BOUNDS,NONANT_UPPER_BOUNDS— bound tighteningis monotonic, so staleness only means slightly looser bounds.
These fields are already global-sized and unaffected by rank
asymmetry.
OBJECTIVE_INNER_BOUND,OBJECTIVE_OUTER_BOUND)— already fine.
Coherence options
Option 1: Per-field strict check
For fields that require coherence (
NONANT,DUALS): read allwrite_idvalues first (cheap: one int per segment), check theymatch, then read the data. If they don't match, retry later.
For fields that tolerate staleness (
BEST_XHAT, etc.): accept datafrom each source rank independently. Track
write_idper sourcerank.
Pros: correct semantics for each field type. No unnecessary stalls
for fields that don't need coherence.
Cons: two code paths for multi-source reads (strict and relaxed).
Option 2: Cylinder-wide iteration counter (synchronized)
Each cylinder maintains a shared iteration counter (via
cylinder_comm.Allreduceafter each update). Receivers checkthis counter for fields requiring coherence.
Pros: clean coherence semantics.
Cons: adds synchronization within the cylinder, which is exactly
what the async design tries to avoid. Only acceptable for
synchronous algorithms (PH, not APH).
Option 3: Accept staleness everywhere (fully relaxed)
Accept data from each source rank independently for all fields.
Pros: simplest implementation, no stalls.
Cons: Lagrangian bounds may be invalid when assembled from mixed
iterations.
Recommendation: Option 1 (per-field strict check). The hub is
synchronous PH, so its ranks naturally have matching
write_idvalues after each iteration — the strict check is almost free (just
verify, rarely retry). For spoke-to-hub fields like
BEST_XHAT,use the relaxed path. This gives correct Lagrangian bounds without
sacrificing async performance where it isn't needed.
Impact on Existing Components
spin_the_wheel.pyn_proc % n_spcomms == 0check.MPI_Comm_splitwith a rank assignmentalgorithm that respects per-cylinder rank counts.
cylinder_commcreation still usesMPI_Comm_split(allranks in the same cylinder get the same color).
strata_commis either removed (Option D) or replaced witha global window communicator.
communicator_list[strata_rank]indexing must change becausestrata_rankno longer has a fixed meaning. Each rank needs toknow its cylinder index directly.
spbase.py_calculate_scenario_ranks()already works with anyn_proc.No change needed; each cylinder just calls it with its own rank
count.
spwindow.pyFieldLengthsstays the same (it's per-rank, based on localscenarios).
SPWindowmust support partialget()calls (offset + count).cylinders (not just
strata_comm).expected size may differ from the sender's buffer size, and that's
OK as long as the requested segment fits within the sender's buffer.
spcommunicator.pyregister_receive_fields()must build overlap maps instead ofassuming 1-to-1 rank correspondence.
get_receive_buffer()must support multi-source assembly.put_send_buffer()is unchanged (each rank writes its own data)._validate_recv_field()must be relaxed or replaced withsegment-level validation (check that each requested segment fits
within the remote buffer, not that the total sizes match).
synchronizeparameter inget_receive_buffer()usescylinder_comm.Barrier()andAllreduce— this still workswithin a cylinder but the cross-cylinder sync semantics change.
hub.pyandspoke.pysend_nonants(),update_nonants(), and similar methodscurrently iterate over local scenarios and pack/unpack linearly.
The packing is unchanged (each rank packs its own data). The
unpacking on the receiver side must use the overlap map to
correctly place data from multi-source reads.
cfg_vanilla.pyandconfig.pyshared_options()may need to carry the rank ratio informationso that
SPCommunicatorcan compute overlap maps at init time.generic_cylinders.pyWheelSpinnervia thehub/spoke dicts.
Phased Implementation Plan
Phase 1: Infrastructure
counts and scenario lists, produce
OverlapSegmentlists).SPWindow.get().WheelSpinner.WheelSpinnerto support unequalcylinder sizes.
Phase 2: Communication layer
strata_comm-based buffer layout exchange withfullcomm-based exchange.get_receive_buffer()using overlap maps._validate_recv_field()to check per-segment instead oftotal.
ranks.
Phase 3: Integration and testing
(if any).
problem?
Phase 4: Spoke-to-spoke communication
BEST_XHATfrom an InnerBoundSpoke with different rank count).Phase 5: APH support
write-ID coherence model.
Backward Compatibility
When all rank ratios are 1.0 (the default), the system must behave
identically to the current implementation. The overlap maps degenerate
to single-segment identity mappings, and multi-source reads become
single-source reads. This should be verified by running the full
existing test suite with the new code and default ratios.
Open Questions
Should the minimum rank count for any cylinder be 1? Some spoke
types may require at least 2 ranks for internal collective
operations.
How should
all_scenario_namesbe handled? Currently it is thesame list for all cylinders. With different rank counts, each
cylinder still handles all scenarios — just distributed differently.
But should we also support cylinders that handle a subset of
scenarios? (That would be a much larger change and is out of scope
for this design.)
What is the interaction with proper bundles? Bundles change the
scenario structure, so rank ratios would apply to the bundled
scenario count, not the original count.
For the FWPH spoke-to-spoke case, the FWPH spoke reads
RECENT_XHATSwhich is a circular buffer of multiple xhatsolutions. Each entry is local-sized. The multi-source read
would need to be applied to each entry in the circular buffer.
Is this feasible, or should FWPH be restricted to equal rank
counts?
Memory overhead: with Option D (global window on fullcomm), the
window includes buffers from all ranks in all cylinders. For
large problems with many ranks, this could be significant. Should
we provide a way to limit which ranks participate in the window?
Beta Was this translation helpful? Give feedback.
All reactions