Align Enhanced SIM-ONE Pipeline With 7‑Epoch Comprehensive Dataset Workflow#40
Align Enhanced SIM-ONE Pipeline With 7‑Epoch Comprehensive Dataset Workflow#40
Conversation
WalkthroughUpdates span documentation, training defaults, and core model internals. Training epochs/time and dataset paths are revised. Config broadens model/training hyperparameters and adds a new field. Core transformer gains new Enhanced model, RoPE attention, and refactored MoE. Ancillary modules adjust typing and governance defaults. Launch/preflight/train scripts align to new paths and epochs. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor User
participant Shell as launch_simone_enhanced.sh
participant Preflight as enhanced_preflight.py
participant Trainer as enhanced_train.py
participant Config as PrioritaryConfig
participant Model as EnhancedSIMONEModel
participant Data as mvlm_comprehensive_dataset
User->>Shell: Run launch script
Shell->>Preflight: --data_dir ./mvlm_training_dataset_complete/mvlm_comprehensive_dataset
Preflight-->>Shell: OK/Diagnostics
Shell->>Trainer: Start training (epochs=7, lr=3e-4, batch=12)
Trainer->>Config: Load hyperparameters (updated dims/epochs/lr)
Trainer->>Data: Load dataset
Trainer->>Model: Initialize (RoPE, MoE, Governance)
loop For each epoch/batch
Trainer->>Model: forward(input_ids, masks, policy_guidance, use_cache)
Model->>Model: RoPE Q/K, governance attention, MoE FFN
Model-->>Trainer: logits, governance outputs
Trainer->>Trainer: loss/backprop/opt (grad accum)
end
Trainer-->>User: Final checkpoint ./models/simone_enhanced
sequenceDiagram
autonumber
participant App as Inference App
participant Model as EnhancedSIMONEModel
participant Attn as EnhancedGovernanceAttention
participant MoE as MoELayer
participant Gov as GovernanceAggregator
App->>Model: generate(input_ids, use_cache=True, prophetic_state)
Model->>Model: _precompute_prophetic_modulations
loop steps until max_length/EOS
Model->>Attn: forward(Q,K,V, masks, prophetic_state, cache)
Attn-->>Model: context, policy/memory/trace
Model->>MoE: feedforward(x)
MoE-->>Model: transformed x (+load-balance penalty tracked)
Model->>Gov: aggregate per-layer governance
Model-->>App: next token (sampling: temp/top_k/top_p adaptive)
end
Model->>Model: disable_cache()
Estimated code review effort🎯 4 (Complex) | ⏱️ ~70 minutes Possibly related PRs
Suggested labels
Poem
Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting
| def _precompute_prophetic_modulations( | ||
| self, | ||
| prophetic_state: Optional['PropheticSingularityState'], | ||
| seq_len: int, | ||
| device: torch.device, | ||
| dtype: torch.dtype, | ||
| past_length: int = 0 | ||
| ) -> Optional[ | ||
| Tuple[ | ||
| PropheticSingularityState, | ||
| List[torch.Tensor], | ||
| PropheticSingularityState | ||
| ] | ||
| ]: | ||
| """ | ||
| Pre-compute all layer modulations once for efficiency. | ||
| Avoids repeated computation in each layer. | ||
| """ | ||
| if prophetic_state is None: | ||
| return None | ||
|
|
||
| # Check cache validity | ||
| total_len = seq_len + past_length | ||
| cache_key = (id(prophetic_state), total_len, past_length, device, dtype) | ||
| if (self._precomputed_prophetic_cache is not None and | ||
| self._precomputed_prophetic_cache[0] == cache_key): | ||
| return self._precomputed_prophetic_cache[1] | ||
|
|
||
| # Align state to total sequence length (past + current tokens) | ||
| aligned_total = prophetic_state.align_to_length(total_len).to( | ||
| device=device, | ||
| dtype=dtype | ||
| ) | ||
|
|
||
| start_idx = max(total_len - seq_len, 0) | ||
|
|
||
| # Slice helper to preserve normalization metadata | ||
| def _slice_state( | ||
| state: PropheticSingularityState, | ||
| start: int | ||
| ) -> PropheticSingularityState: | ||
| if start == 0 and seq_len == total_len: | ||
| return state | ||
|
|
||
| return PropheticSingularityState( | ||
| intensity=state.intensity[..., start:], |
There was a problem hiding this comment.
[P1] Import removed for runtime use of PropheticSingularityState
The new _precompute_prophetic_modulations helper creates and returns PropheticSingularityState instances, but the module now only imports that type inside a TYPE_CHECKING guard. When prophetic_state is passed at runtime, the first call to _slice_state will raise NameError: PropheticSingularityState is not defined, breaking any training or generation run that enables prophetic governance. The fix is to import the class unconditionally (or reference it via the object passed in) so the name exists at execution time.
Useful? React with 👍 / 👎.
| # Apply load balancing penalty during training | ||
| if self.training and self.load_balancing_weight > 0: | ||
| # Encourage balanced expert usage | ||
| expert_probs = F.softmax(router_logits, dim=-1) | ||
| expert_usage_batch = expert_probs.mean(dim=0) | ||
|
|
||
| # Update running statistics | ||
| self.expert_usage = 0.9 * self.expert_usage + 0.1 * expert_usage_batch | ||
| self.total_tokens += num_tokens | ||
|
|
||
| # Add load balancing loss (encourages uniform distribution) | ||
| target_usage = 1.0 / self.num_experts | ||
| load_balance_loss = ((self.expert_usage - target_usage) ** 2).sum() | ||
| router_logits = router_logits - self.load_balancing_weight * load_balance_loss | ||
|
|
||
| # Get top-k experts for each token | ||
| top_k_logits, top_k_indices = torch.topk( | ||
| router_logits, self.num_experts_per_token, dim=-1 | ||
| ) | ||
|
|
||
| # Softmax over selected experts | ||
| top_k_weights = F.softmax(top_k_logits, dim=-1) | ||
|
|
||
| # OPTIMIZED: Vectorized expert processing | ||
| output = torch.zeros_like(x_flat) | ||
|
|
||
| # Create routing tensors for efficient batching | ||
| for expert_idx in range(self.num_experts): | ||
| # Find all tokens and positions where this expert is selected | ||
| expert_positions = (top_k_indices == expert_idx) | ||
|
|
||
| if expert_positions.any(): | ||
| # Get token indices and k-positions for this expert | ||
| token_indices, k_positions = expert_positions.nonzero(as_tuple=True) | ||
|
|
||
| if len(token_indices) > 0: | ||
| # Batch process all tokens for this expert | ||
| expert_tokens = x_flat[token_indices] | ||
| expert_output = self.experts[expert_idx](expert_tokens) | ||
|
|
||
| # Get corresponding weights | ||
| expert_weights = top_k_weights[token_indices, k_positions].unsqueeze(-1) | ||
|
|
||
| # Accumulate weighted outputs | ||
| output.index_add_(0, token_indices, expert_weights * expert_output) | ||
|
|
||
| return output.view(batch_size, seq_len, dim) | ||
|
|
||
| def get_load_balancing_loss(self) -> torch.Tensor: | ||
| if self._load_balancing_loss is None: | ||
| return self.router.weight.new_zeros(()) | ||
| return self._load_balancing_loss | ||
|
|
||
| def get_last_assignment_counts(self) -> Optional[torch.Tensor]: | ||
| """Return the most recent per-expert token counts if available.""" | ||
| return self._last_assignment_counts | ||
|
|
||
| """Get the current load balancing loss for regularization.""" | ||
| if self.total_tokens > 0: | ||
| target_usage = 1.0 / self.num_experts | ||
| return ((self.expert_usage - target_usage) ** 2).sum() | ||
| return torch.tensor(0.0, device=self.expert_usage.device) |
There was a problem hiding this comment.
[P1] Load-balancing loss no longer backpropagates in MoELayer
The revised MoE layer computes load_balance_loss from running buffers (self.expert_usage) and simply subtracts that scalar from the router logits, then get_load_balancing_loss returns a detached tensor built from the same buffers. Because the buffers are not part of the forward computation, the returned loss has requires_grad=False and cannot provide gradients to the router, contradicting the existing tests that expect a differentiable auxiliary loss and allowing expert collapse during training. The method should compute a loss based on the current batch logits/probabilities so it participates in autograd.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (10)
enhanced_preflight.py (1)
12-104: Usage string still points to the old dataset rootWe bumped the
--data_dirdefault to./mvlm_training_dataset_complete/mvlm_comprehensive_dataset, but the Usage block at the top still instructs people to pass the old root directory. That mismatch will send them to the wrong path unless they read the argparse defaults. Please update the example to mirror the new default.Suggested edit:
- --data_dir ./mvlm_training_dataset_complete \ + --data_dir ./mvlm_training_dataset_complete/mvlm_comprehensive_dataset \train_all_models.py (1)
151-153: Bug: cwd + script_path results in duplicated path.You set
cwdto the script’s parent (“SIM-ONE Training”) but still pass a path including that parent. This resolves to “SIM-ONE Training/SIM-ONE Training/enhanced_train.py” and will fail. Either pass only the basename or run from repo root.- cmd = ['python3', script_path] + args + script_abspath = str(Path(script_path).resolve()) + cmd = ['python3', script_abspath] + args @@ - env=env, - cwd=Path(script_path).parent if '/' in script_path else '.' + env=env, + cwd='.'Also applies to: 175-176
agents.md (1)
96-103: Align import names with exported symbols
ReplaceAdvancedBPETokenizerwithBiblicalBPETokenizerandComprehensiveTrainingLosswithComprehensiveBiblicalLossto match the actual classes inprioritary_mvlm.-from prioritary_mvlm import EnhancedPrioritaryTrainer, AdvancedBPETokenizer -from prioritary_mvlm.advanced_losses import ComprehensiveTrainingLoss +from prioritary_mvlm import EnhancedPrioritaryTrainer, BiblicalBPETokenizer +from prioritary_mvlm.advanced_losses import ComprehensiveBiblicalLossSIM-ONE Training/simone_transformer/attention_cache.py (2)
116-126: Causal-mask check is a no-op (always true) → cache-key collisions
if attention_mask.shape[-2:] == attention_mask.shape[-2:]:is tautologically true, so all masks are treated as “causal”, collapsing keys and returning wrong cached patterns.Apply this fix to always hash the actual mask contents (simple and correct):
- # For causal masks, just use shape since they're deterministic - if attention_mask.shape[-2:] == attention_mask.shape[-2:]: # Square mask - return f"causal_{attention_mask.shape[-1]}" - - # For custom masks, hash the pattern - return hashlib.md5(attention_mask.detach().cpu().numpy().tobytes()).hexdigest()[:8] + # Hash the mask content to avoid collisions (works for causal/custom masks) + tensor = attention_mask.detach().contiguous().to(device="cpu") + return hashlib.md5(tensor.numpy().tobytes()).hexdigest()[:8]
92-114: Signature concatenation mixes 2D and 1D tensors → runtime error
torch.cat(signatures)will fail becausepolicy_sig/memory_sigare 2D (2×F) whilekingdom_sigis 1D (S). Flatten first.- if policy_logits is not None: - # Use mean and std as signature - policy_sig = torch.stack([ - policy_logits.mean(dim=(0, 1)), - policy_logits.std(dim=(0, 1)) - ]) - signatures.append(policy_sig) + if policy_logits is not None: + policy_sig = torch.stack( + [policy_logits.mean(dim=(0, 1)), policy_logits.std(dim=(0, 1))] + ).reshape(-1) + signatures.append(policy_sig) @@ - if memory_signals is not None: - memory_sig = torch.stack([ - memory_signals.mean(dim=(0, 1)), - memory_signals.std(dim=(0, 1)) - ]) - signatures.append(memory_sig) + if memory_signals is not None: + memory_sig = torch.stack( + [memory_signals.mean(dim=(0, 1)), memory_signals.std(dim=(0, 1))] + ).reshape(-1) + signatures.append(memory_sig) @@ - if prophetic_state is not None: - # Use kingdom flow as signature - kingdom_sig = prophetic_state.kingdom_flow.mean(dim=0) - signatures.append(kingdom_sig) + if prophetic_state is not None: + kingdom_sig = prophetic_state.kingdom_flow.mean(dim=0).reshape(-1) + signatures.append(kingdom_sig) @@ - if signatures: - return torch.cat(signatures) + if signatures: + return torch.cat(signatures).float()SIM-ONE Training/simone_transformer/shared_governance.py (3)
158-163: Don’t create layers inside forward; register once (policy guidance proj)Allocating
nn.Linearper call leaks params and breaks optimization.class PolicyHead(nn.Module): @@ def __init__(self, governance_dim: int, hidden_dim: int, num_heads: int): super().__init__() @@ self.pattern_controllers = nn.ModuleList([ nn.Linear(governance_dim, 1) for _ in range(num_heads) ]) + # Project external guidance once; infer in_features lazily + self.guidance_proj = nn.LazyLinear(self.governance_dim) @@ - if policy_guidance is not None: - # Integrate external guidance (project to governance dim) - guidance_proj = nn.Linear(policy_guidance.size(-1), self.governance_dim).to(shared_features.device) - policy_input = shared_features + guidance_proj(policy_guidance) + if policy_guidance is not None: + # Integrate external guidance (project to governance dim) + policy_input = shared_features + self.guidance_proj(policy_guidance) else: policy_input = shared_features
236-241: Same issue: dynamic layer creation in memory pathRegister a single projection layer; use LazyLinear to infer input size.
class MemoryHead(nn.Module): @@ self.memory_to_weights = nn.Linear(governance_dim, num_heads) + self.context_proj = nn.LazyLinear(self.governance_dim) @@ - if memory_context is not None: + if memory_context is not None: # Project memory context to governance dimension if needed - if memory_context.size(-1) != self.governance_dim: - context_proj = nn.Linear(memory_context.size(-1), self.governance_dim).to(shared_features.device) - memory_context = context_proj(memory_context) + if memory_context.size(-1) != self.governance_dim: + memory_context = self.context_proj(memory_context)
48-51: Add a precondition that governance_dim is divisible by 4
Raise aValueErrorifself.governance_dim % 4 != 0before instantiatingnn.MultiheadAttentionto prevent PyTorch’s “embed_dim must be divisible by num_heads” error.self.governance_coordination = nn.MultiheadAttention( self.governance_dim, num_heads=4, batch_first=True ) + if self.governance_dim % 4 != 0: + raise ValueError(f"governance_dim ({self.governance_dim}) must be divisible by 4")claude.md (1)
10-12: Inconsistent total duration vs per-model timesHeader says “5–7 hours total,” but Enhanced SIM-ONE alone is ~6–7 hours and GPT-2 is 2–3 hours. Adjust header to ~8–10 hours for both models.
-**Duration**: 5-7 hours total training time +**Duration**: ~8–10 hours total for both models (2–3h GPT‑2 + 6–7h Enhanced SIM‑ONE)H200_SETUP_README.md (1)
139-145: DocumentedTORCH_COMPILEflag is unused
TheTORCH_COMPILEenv var (H200_SETUP_README.md:140) isn’t referenced anywhere in the codebase. Either implement its handling (e.g. conditionally wrap yourtorch.compile(…)calls inif os.getenv("TORCH_COMPILE")) or remove it from the README to avoid confusing users.
🧹 Nitpick comments (18)
READY_FOR_H200.md (1)
53-75: Align training time messagingStep 2 now promises ~6‑7 hours for the run, but “Important Notes” item 4 still claims 6‑9 hours for all three models. That contradiction will confuse folks trying to budget GPU time—especially since
train_all_models.pynow drives only the enhanced run. Please reconcile the wording (e.g., replace the note with the updated duration or clarify the scenario).Apply this diff to keep the guidance consistent:
-4. **6-9 hours total training time** for all three models +4. **~6-7 hours total training time** for the enhanced SIM-ONE runagents.md (3)
39-41: Fix inconsistent total training time across docs.Enhanced SIM‑ONE is stated as ~6–7 hours (7 epochs) here, but “Total Training: 5–7 hours for both models” below contradicts that. Recommend correcting total to ~8–10 hours (2–3h + 6–7h).
-## Performance Expectations -**Total Training**: 5-7 hours for both models +## Performance Expectations +**Total Training**: ~8–10 hours for both modelsAlso applies to: 145-149
101-103: Update dataset path example to match new default.The example still shows “../mvlm_training_dataset_complete”. Defaults now point to the comprehensive subset path.
-# Dataset path from SIM-ONE Training directory -data_dir = "../mvlm_training_dataset_complete" +# Dataset path (repo root) +data_dir = "./mvlm_training_dataset_complete/mvlm_comprehensive_dataset"
75-76: Avoid committing datasets; use .gitignore or LFS.PR mentions thousands of modified files under the dataset. Recommend excluding large data from Git.
Add to .gitignore (outside this file):
+# Datasets +/mvlm_training_dataset_complete/ +!/mvlm_training_dataset_complete/.keepOr track via Git LFS if required. Do you want a small PR to add ignore rules and move data handling to preflight?
TWO_MODEL_SETUP_FINAL.md (2)
63-68: Correct total training duration.Enhanced: ~6–7h (7 epochs) is fine, but “Total: ~5–7 hours for both models” is inconsistent with MVLM‑GPT2 (2–3h). Suggest ~8–10h.
-**Total**: ~5-7 hours for both models +**Total**: ~8–10 hours for both models
46-50: Doc says “Train Both Models,” but orchestrator currently runs only Enhanced.train_all_models.py contains a single model entry (SIM‑ONE‑Enhanced). Either add MVLM‑GPT2 to the orchestrator or update docs to reflect single‑model training.
Would you like a patch that adds MVLM‑GPT2 to self.models (with script, args, and outputs), or shall we reword this section to “Train Enhanced SIM‑ONE”?
launch_simone_enhanced.sh (1)
21-22: Enable unbuffered Python for real‑time logs.
-uimproves tailing in screen sessions.-python3 enhanced_preflight.py --data_dir ./mvlm_training_dataset_complete/mvlm_comprehensive_dataset && \ -python3 train_all_models.py' +python3 -u enhanced_preflight.py --data_dir ./mvlm_training_dataset_complete/mvlm_comprehensive_dataset && \ +python3 -u train_all_models.py'train_all_models.py (1)
365-369: Avoid blocking for input in unattended runs.On failure,
input()will hang in screen. Prefer a non‑interactive flag or default to abort/continue.- response = input(f"\n⚠️ {model['name']} failed. Continue with remaining models? [y/N]: ") - if response.lower() != 'y': - self.logger.info("🛑 Training stopped by user") - break + if os.environ.get("SIMONE_CONTINUE_ON_ERROR", "0") != "1": + self.logger.info("🛑 Training stopped (set SIMONE_CONTINUE_ON_ERROR=1 to continue automatically)") + breakSIM-ONE Training/enhanced_train.py (2)
41-49: Pass-through of training knobs looks fine; consider adding seed for reproducibility.Optional: add
--seedand set deterministic seeds.parser.add_argument("--quiet", action="store_true", help="Reduce logging output") +parser.add_argument("--seed", type=int, default=42, help="Random seed") @@ - config = create_enhanced_config(args) + config = create_enhanced_config(args) + try: + import torch, random, numpy as np + torch.manual_seed(args.seed) + random.seed(args.seed) + np.random.seed(args.seed) + if torch.cuda.is_available(): + torch.cuda.manual_seed_all(args.seed) + except Exception: + pass
115-121: Early data-dir validation is good; consider also ensuring output_dir exists.Trainer may create it, but creating here avoids surprises.
- # Create enhanced configuration + # Create enhanced configuration config = create_enhanced_config(args) + Path(args.output_dir).mkdir(parents=True, exist_ok=True)SIM-ONE Training/simone_transformer/attention_cache.py (2)
65-76: Minor: ensure contiguous CPU buffer before hashingSmall robustness tweak to avoid surprises with non-contiguous tensors.
- gov_hash = hashlib.md5( - governance_signature.detach().cpu().numpy().tobytes() - ).hexdigest()[:8] + buf = governance_signature.detach().contiguous().cpu().numpy().tobytes() + gov_hash = hashlib.md5(buf).hexdigest()[:8] @@ - proph_hash = hashlib.md5( - prophetic_signature.detach().cpu().numpy().tobytes() - ).hexdigest()[:8] + buf = prophetic_signature.detach().contiguous().cpu().numpy().tobytes() + proph_hash = hashlib.md5(buf).hexdigest()[:8]
229-254: Optional: periodic cleanup to control memoryConsider auto-cleaning expired entries on a schedule or based on access count thresholds.
SIM-ONE Training/simone_transformer/shared_governance.py (3)
29-33: Defaultgovernance_dimnow equalshidden_dim→ compute/memory jumpThis doubles shared/governance params vs prior ½ default. Confirm intended, or revert to
hidden_dim // 2to keep footprint lower.- def __init__(self, hidden_dim: int, governance_dim: int = None, num_heads: int = 8): + def __init__(self, hidden_dim: int, governance_dim: int | None = None, num_heads: int = 8): @@ - self.governance_dim = governance_dim or hidden_dim + self.governance_dim = governance_dim or (hidden_dim // 2)
85-90: Prophetic modulation math: confirm intended scale
shared_features * (1 + kingdom*0.1)can amplify by up to ~1.1×. If stronger gating was intended, expose a config weight.
329-345: Entropy computation uses natural log; confirm base/scaleIf you need bits, use log2; if nats are fine, keep as-is. Also clamp weights to avoid log(0).
- attention_entropy = -torch.sum( - attention_weights * torch.log(attention_weights + 1e-8), + w = attention_weights.clamp_min(1e-8) + attention_entropy = -torch.sum( + w * torch.log(w), dim=-1 ).mean(dim=1)H200_SETUP_README.md (2)
73-78: Python version: bump minimum to 3.9+ for PyTorch 2.x wheelsPyTorch 2.x dropped official py3.8 over the 2024–2025 window; recommending 3.9+ avoids install friction.
-- **Python**: 3.8+ +- **Python**: 3.9+
182-184: Tooling nit:iotopis disk I/O, not networkUse
iftop/nloadfor network monitoring.-# Network usage (if applicable) -iotop +# Network usage (if applicable) +# iftop or nload (install if missing) +iftop +# or +nloadSIM-ONE Training/prioritary_mvlm/config.py (1)
384-407: Config expansions look coherent; update docstring to include new fieldsAdd
learning_rate,num_epochs,warmup_steps,weight_decay,gradient_accumulation_steps,dataloader_workers, andlambda_*to the Attributes section.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (16)
H200_SETUP_README.md(4 hunks)README.md(3 hunks)READY_FOR_H200.md(2 hunks)SIM-ONE Training/enhanced_train.py(3 hunks)SIM-ONE Training/prioritary_mvlm/config.py(1 hunks)SIM-ONE Training/simone_transformer/attention_cache.py(6 hunks)SIM-ONE Training/simone_transformer/enhanced_model.py(1 hunks)SIM-ONE Training/simone_transformer/modern_layers.py(1 hunks)SIM-ONE Training/simone_transformer/rope_attention.py(1 hunks)SIM-ONE Training/simone_transformer/shared_governance.py(2 hunks)TWO_MODEL_SETUP_FINAL.md(1 hunks)agents.md(1 hunks)claude.md(2 hunks)enhanced_preflight.py(1 hunks)launch_simone_enhanced.sh(1 hunks)train_all_models.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (6)
SIM-ONE Training/enhanced_train.py (1)
SIM-ONE Training/prioritary_mvlm/config.py (1)
default(325-356)
SIM-ONE Training/simone_transformer/shared_governance.py (1)
SIM-ONE Training/prioritary_mvlm/config.py (1)
PropheticSingularityState(10-356)
SIM-ONE Training/simone_transformer/rope_attention.py (3)
SIM-ONE Training/prioritary_mvlm/config.py (10)
PropheticSingularityState(10-356)device(63-64)dtype(67-68)to(78-103)align_to_length(106-124)compute_policy_mask(133-140)compute_memory_decay(142-147)kingdom_flow(71-75)compute_trace_envelope(149-160)summary(162-178)SIM-ONE Training/simone_transformer/attention_cache.py (3)
CachedAttentionMixin(256-319)_try_get_cached_attention(270-284)_cache_attention_pattern(286-303)SIM-ONE Training/simone_transformer/shared_governance.py (1)
SharedGovernanceBackbone(18-124)
SIM-ONE Training/simone_transformer/attention_cache.py (1)
SIM-ONE Training/prioritary_mvlm/config.py (1)
PropheticSingularityState(10-356)
SIM-ONE Training/prioritary_mvlm/config.py (1)
simone_training/data/tokenizers/base_tokenizer.py (1)
vocab_size(40-42)
SIM-ONE Training/simone_transformer/enhanced_model.py (3)
SIM-ONE Training/prioritary_mvlm/config.py (9)
PropheticSingularityState(10-356)align_to_length(106-124)layer_modulation(127-131)kingdom_flow(71-75)device(63-64)dtype(67-68)to(78-103)summary(162-178)step_statistics(180-191)SIM-ONE Training/simone_transformer/rope_attention.py (7)
EnhancedGovernanceAttention(88-348)create_causal_mask(596-613)forward(40-61)forward(169-348)forward(375-430)forward(458-507)forward(528-593)SIM-ONE Training/simone_transformer/modern_layers.py (10)
RMSNorm(13-27)SwiGLU(30-59)GatedResidualConnection(275-294)BiblicalAttentionBias(354-409)apply_weight_init(430-448)MoELayer(85-190)forward(24-27)forward(46-52)forward(54-59)forward(78-82)
🪛 Ruff (0.13.1)
SIM-ONE Training/simone_transformer/rope_attention.py
140-140: Unused method argument: aligned_state
(ARG002)
262-262: Local variable decay_gate is assigned to but never used
Remove assignment to unused variable decay_gate
(F841)
378-378: Unused method argument: attention_scores
(ARG002)
394-394: Unpacked variable batch_size is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
394-394: Unpacked variable hidden_dim is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
414-414: Loop control variable i not used within loop body
Rename unused i to _i
(B007)
439-439: PEP 484 prohibits implicit Optional
Convert to Optional[T]
(RUF013)
461-461: Unused method argument: attention_scores
(ARG002)
477-477: Unpacked variable batch_size is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
477-477: Unpacked variable hidden_dim is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
532-532: Unused method argument: attention_output
(ARG002)
546-546: Unpacked variable batch_size is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
SIM-ONE Training/simone_transformer/enhanced_model.py
157-157: Local variable biblical_bias is assigned to but never used
Remove assignment to unused variable biblical_bias
(F841)
211-211: Unused method argument: hidden_states
(ARG002)
212-212: Unused method argument: layer_idx
(ARG002)
449-451: Avoid specifying long messages outside the exception class
(TRY003)
582-582: Unused method argument: pad_token_id
(ARG002)
601-601: Local variable batch_size is assigned to but never used
Remove assignment to unused variable batch_size
(F841)
🔇 Additional comments (10)
train_all_models.py (2)
31-43: Paths and epochs look consistent with PR goals.Args now target the comprehensive dataset, 7 epochs, and new log path. Good alignment.
Please confirm enhanced_train.py’s CLI still matches these flags.
88-108: Confirmexpandable_segments:Truesupport on target H200 environment
Run this in your H200 PyTorch image (wheretorchis installed) and verify no parsing errors or warnings; if unsupported, removeexpandable_segments:Trueto avoid silent ignores.python3 - <<'PY' import os, torch os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'max_split_size_mb:512,expandable_segments:True' # Trigger a CUDA allocation to surface any parsing errors torch.zeros(1, device='cuda') print("OK – torch version:", torch.__version__) PYSIM-ONE Training/enhanced_train.py (1)
63-65: Defaults aligned to new workflow.Data/output paths, batch size 12, and 7 epochs match the PR objectives.
Also applies to: 80-82, 98-99, 101-102
agents.md (1)
173-177: Generalize CUDA version guidance.
- agents.md (lines 173–177): change
- **PyTorch**: 2.0+ with CUDA 12.1 support + **PyTorch**: 2.0+ with CUDA 12.x supportPlease manually verify which CUDA 12.x minor versions are supported on your H200 hosts.
SIM-ONE Training/simone_transformer/attention_cache.py (2)
261-269: LGTM: clean caching toggle and initializationThe mixin cleanly hides caching behind a flag; constructor usage is sound.
299-304: Correct: cache only in eval modeAvoids training-time interference. Good.
H200_SETUP_README.md (1)
60-61: Training time update is consistentDocs now reflect 7 epochs and ~6–7 hours. Good.
claude.md (2)
30-33: Doc consistency with config (32K vocab, modern stack) is goodMatches PrioritaryConfig and architecture updates.
161-165: Command updates align with new defaultsData path and 7 epochs reflect PR objectives.
SIM-ONE Training/prioritary_mvlm/config.py (1)
9-357: PropheticSingularityState API matches usage across modulesAlignment,
.to(), and mask builders are consistent with shared_governance and attention_cache references.
| python3 enhanced_preflight.py --data_dir ./mvlm_training_dataset_complete/mvlm_comprehensive_dataset || true; \ | ||
| python3 train_all_models.py' |
There was a problem hiding this comment.
Don’t swallow preflight failures.
|| true causes training to proceed even if preflight fails. Fail fast to avoid wasted GPU time.
-python3 enhanced_preflight.py --data_dir ./mvlm_training_dataset_complete/mvlm_comprehensive_dataset || true; \
+python3 enhanced_preflight.py --data_dir ./mvlm_training_dataset_complete/mvlm_comprehensive_dataset && \📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| python3 enhanced_preflight.py --data_dir ./mvlm_training_dataset_complete/mvlm_comprehensive_dataset || true; \ | |
| python3 train_all_models.py' | |
| python3 enhanced_preflight.py --data_dir ./mvlm_training_dataset_complete/mvlm_comprehensive_dataset && \ | |
| python3 train_all_models.py' |
🤖 Prompt for AI Agents
In launch_simone_enhanced.sh around lines 21-22, the preflight command is masked
by "|| true" so failures are ignored and training starts regardless; remove the
"|| true" and ensure the script exits on non-zero status (e.g., enable "set -e"
near the top or explicitly test the preflight exit code and exit non-zero) so a
failed enhanced_preflight.py aborts and prevents training from running.
| if TYPE_CHECKING: | ||
| from prioritary_mvlm.config import PropheticSingularityState | ||
|
|
There was a problem hiding this comment.
Import PropheticSingularityState at runtime
We instantiate PropheticSingularityState later in _precompute_prophetic_modulations (see Line 394). Because the import currently sits under if TYPE_CHECKING, the symbol is undefined at runtime and a NameError is raised as soon as _slice_state() executes—this happens whenever prophecy modulation is requested, which is a core path in training/inference. Please move the import out of the TYPE_CHECKING guard (or perform a local import within the helper) so the class is available during execution.
-from typing import Optional, Dict, List, Tuple, TYPE_CHECKING
-
-if TYPE_CHECKING:
- from prioritary_mvlm.config import PropheticSingularityState
+from typing import Optional, Dict, List, Tuple, TYPE_CHECKING
+
+from prioritary_mvlm.config import PropheticSingularityState📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if TYPE_CHECKING: | |
| from prioritary_mvlm.config import PropheticSingularityState | |
| -from typing import Optional, Dict, List, Tuple, TYPE_CHECKING | |
| - | |
| -if TYPE_CHECKING: | |
| from typing import Optional, Dict, List, Tuple, TYPE_CHECKING | |
| from prioritary_mvlm.config import PropheticSingularityState |
🤖 Prompt for AI Agents
In SIM-ONE Training/simone_transformer/enhanced_model.py around lines 14-16, the
import of PropheticSingularityState is currently guarded by if TYPE_CHECKING
which makes the symbol unavailable at runtime and causes a NameError when
_slice_state/_precompute_prophetic_modulations runs; fix this by moving the from
prioritary_mvlm.config import PropheticSingularityState out of the TYPE_CHECKING
block (so it is imported at module import time) OR perform a local import of
PropheticSingularityState inside the helper that instantiates it (e.g., at the
top of _precompute_prophetic_modulations or immediately before use) so the class
is defined at runtime.
| if self.training and self.use_gradient_checkpointing: | ||
| # Use gradient checkpointing to trade compute for memory | ||
| x, gov_outputs, _ = torch.utils.checkpoint.checkpoint( | ||
| layer, | ||
| x, | ||
| input_ids, | ||
| attention_mask, | ||
| memory_context, | ||
| policy_guidance, | ||
| output_governance, | ||
| aligned_state, | ||
| precomputed_modulation, | ||
| None, | ||
| False, | ||
| use_reentrant=False # Use new checkpointing API | ||
| ) | ||
| else: |
There was a problem hiding this comment.
Gradient checkpointing path returns non-Tensor outputs
torch.utils.checkpoint.checkpoint only supports functions that return a Tensor or a tuple of Tensors. EnhancedSIMONEBlock.forward returns (Tensor, Dict, Optional[Tuple[Tensor, Tensor]]), so once use_gradient_checkpointing is enabled this call raises TypeError: CheckpointFunction only accepts output that is Tensor or tuple of Tensors, halting training immediately. You’ll need to restructure this branch—e.g., wrap the layer so the checkpointed function returns only the tensor outputs, and handle the governance dict outside that checkpointed call—to keep gradient checkpointing usable.
🤖 Prompt for AI Agents
In SIM-ONE Training/simone_transformer/enhanced_model.py around lines 517 to
533, the gradient-checkpointing branch calls torch.utils.checkpoint.checkpoint
on EnhancedSIMONEBlock.forward which returns (Tensor, Dict,
Optional[Tuple[Tensor, Tensor]]), but checkpoint only accepts Tensor or tuple of
Tensors; wrap the call so the checkpointed function returns only the tensor
outputs (e.g., a small wrapper that calls layer and returns x tensor and any
other tensors as a tuple), call checkpoint on that wrapper, then after
checkpointing reconstruct or separately compute the governance dict and optional
tensors outside the checkpointed call (or run a lightweight non-checkpointed
call for the dict), ensuring the original API (x, gov_outputs, ...) is preserved
and use_reentrant remains false.
| if self.training and self.load_balancing_weight > 0: | ||
| # Encourage balanced expert usage | ||
| expert_probs = F.softmax(router_logits, dim=-1) | ||
| expert_usage_batch = expert_probs.mean(dim=0) | ||
|
|
||
| # Update running statistics | ||
| self.expert_usage = 0.9 * self.expert_usage + 0.1 * expert_usage_batch | ||
| self.total_tokens += num_tokens | ||
|
|
||
| # Add load balancing loss (encourages uniform distribution) | ||
| target_usage = 1.0 / self.num_experts | ||
| load_balance_loss = ((self.expert_usage - target_usage) ** 2).sum() | ||
| router_logits = router_logits - self.load_balancing_weight * load_balance_loss | ||
|
|
||
| # Get top-k experts for each token | ||
| top_k_logits, top_k_indices = torch.topk( | ||
| router_logits, self.num_experts_per_token, dim=-1 | ||
| ) | ||
|
|
||
| # Softmax over selected experts | ||
| top_k_weights = F.softmax(top_k_logits, dim=-1) | ||
|
|
||
| # OPTIMIZED: Vectorized expert processing | ||
| output = torch.zeros_like(x_flat) | ||
|
|
||
| # Create routing tensors for efficient batching | ||
| for expert_idx in range(self.num_experts): | ||
| # Find all tokens and positions where this expert is selected | ||
| expert_positions = (top_k_indices == expert_idx) | ||
|
|
||
| if expert_positions.any(): | ||
| # Get token indices and k-positions for this expert | ||
| token_indices, k_positions = expert_positions.nonzero(as_tuple=True) | ||
|
|
||
| if len(token_indices) > 0: | ||
| # Batch process all tokens for this expert | ||
| expert_tokens = x_flat[token_indices] | ||
| expert_output = self.experts[expert_idx](expert_tokens) | ||
|
|
||
| # Get corresponding weights | ||
| expert_weights = top_k_weights[token_indices, k_positions].unsqueeze(-1) | ||
|
|
||
| # Accumulate weighted outputs | ||
| output.index_add_(0, token_indices, expert_weights * expert_output) | ||
|
|
There was a problem hiding this comment.
Load‑balancing penalty is currently a no-op
Subtracting a scalar load_balance_loss from every router logit just shifts all logits by the same constant, so the softmax is unchanged and no balancing happens. Worse, the running stats are updated without no_grad, so they track autograd history. We need a per-expert correction (or add the penalty to the training loss) and detach the stats update.
Apply this diff to make the penalty effective and keep the buffers gradient-free:
- if self.training and self.load_balancing_weight > 0:
- # Encourage balanced expert usage
- expert_probs = F.softmax(router_logits, dim=-1)
- expert_usage_batch = expert_probs.mean(dim=0)
-
- # Update running statistics
- self.expert_usage = 0.9 * self.expert_usage + 0.1 * expert_usage_batch
- self.total_tokens += num_tokens
-
- # Add load balancing loss (encourages uniform distribution)
- target_usage = 1.0 / self.num_experts
- load_balance_loss = ((self.expert_usage - target_usage) ** 2).sum()
- router_logits = router_logits - self.load_balancing_weight * load_balance_loss
+ if self.training and self.load_balancing_weight > 0:
+ expert_probs = F.softmax(router_logits, dim=-1)
+ expert_usage_batch = expert_probs.mean(dim=0)
+ target_usage = 1.0 / self.num_experts
+
+ # Encourage balanced expert usage per expert (broadcast)
+ router_logits = router_logits - self.load_balancing_weight * (
+ expert_usage_batch - target_usage
+ )
+
+ # Maintain running statistics without tracking grads
+ with torch.no_grad():
+ self.expert_usage.mul_(0.9).add_(0.1 * expert_usage_batch)
+ self.total_tokens += float(num_tokens)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if self.training and self.load_balancing_weight > 0: | |
| # Encourage balanced expert usage | |
| expert_probs = F.softmax(router_logits, dim=-1) | |
| expert_usage_batch = expert_probs.mean(dim=0) | |
| # Update running statistics | |
| self.expert_usage = 0.9 * self.expert_usage + 0.1 * expert_usage_batch | |
| self.total_tokens += num_tokens | |
| # Add load balancing loss (encourages uniform distribution) | |
| target_usage = 1.0 / self.num_experts | |
| load_balance_loss = ((self.expert_usage - target_usage) ** 2).sum() | |
| router_logits = router_logits - self.load_balancing_weight * load_balance_loss | |
| # Get top-k experts for each token | |
| top_k_logits, top_k_indices = torch.topk( | |
| router_logits, self.num_experts_per_token, dim=-1 | |
| ) | |
| # Softmax over selected experts | |
| top_k_weights = F.softmax(top_k_logits, dim=-1) | |
| # OPTIMIZED: Vectorized expert processing | |
| output = torch.zeros_like(x_flat) | |
| # Create routing tensors for efficient batching | |
| for expert_idx in range(self.num_experts): | |
| # Find all tokens and positions where this expert is selected | |
| expert_positions = (top_k_indices == expert_idx) | |
| if expert_positions.any(): | |
| # Get token indices and k-positions for this expert | |
| token_indices, k_positions = expert_positions.nonzero(as_tuple=True) | |
| if len(token_indices) > 0: | |
| # Batch process all tokens for this expert | |
| expert_tokens = x_flat[token_indices] | |
| expert_output = self.experts[expert_idx](expert_tokens) | |
| # Get corresponding weights | |
| expert_weights = top_k_weights[token_indices, k_positions].unsqueeze(-1) | |
| # Accumulate weighted outputs | |
| output.index_add_(0, token_indices, expert_weights * expert_output) | |
| if self.training and self.load_balancing_weight > 0: | |
| expert_probs = F.softmax(router_logits, dim=-1) | |
| expert_usage_batch = expert_probs.mean(dim=0) | |
| target_usage = 1.0 / self.num_experts | |
| # Encourage balanced expert usage per expert (broadcast) | |
| router_logits = router_logits - self.load_balancing_weight * ( | |
| expert_usage_batch - target_usage | |
| ) | |
| # Maintain running statistics without tracking grads | |
| with torch.no_grad(): | |
| self.expert_usage.mul_(0.9).add_(0.1 * expert_usage_batch) | |
| self.total_tokens += float(num_tokens) |
|
@coderabbitai generate docstrings |
✅ Actions performedDocstrings generation started. |
Docstrings generation was requested by @dansasser. * #40 (comment) The following files were modified: * `SIM-ONE Training/enhanced_train.py` * `SIM-ONE Training/simone_transformer/attention_cache.py` * `SIM-ONE Training/simone_transformer/enhanced_model.py` * `SIM-ONE Training/simone_transformer/modern_layers.py` * `SIM-ONE Training/simone_transformer/rope_attention.py` * `SIM-ONE Training/simone_transformer/shared_governance.py` * `enhanced_preflight.py` * `train_all_models.py`
|
Note Generated docstrings for this pull request at #41 |
Summary
Retuned the enhanced stack to match the README’s architecture (restored RoPE attention, shared governance backbone, caching, and modern layers) so inference/training run cleanly again: SIM-ONE Training/simone_transformer/{enhanced_model.py,rope_attention.py,attention_cache.py,shared_governance.py,modern_layers.py,simone_model.py,init.py}.
Updated training defaults (config + CLI + orchestrator) to target mvlm_training_dataset_complete/mvlm_comprehensive_dataset, run 7 epochs, and emit to models/simone_enhanced: SIM-ONE Training/prioritary_mvlm/config.py, SIM-ONE Training/enhanced_train.py, train_all_models.py, enhanced_preflight.py, launch_simone_enhanced.sh.
Refreshed documentation and helper scripts so instructions/monitoring align with the new workflow and hands-off launch path: README.md, H200_SETUP_README.md, READY_FOR_H200.md, TWO_MODEL_SETUP_FINAL.md, agents.md, claude.md.
Testing
./pytorch-env/bin/python enhanced_preflight.py --data_dir ./mvlm_training_dataset_complete/mvlm_comprehensive_dataset
Notes
git status currently shows thousands of modified dataset files under mvlm_training_dataset_complete/mvlm_comprehensive_dataset/**. Please confirm you intended to include those before merging.
Summary by CodeRabbit