Skip to content

Align Enhanced SIM-ONE Pipeline With 7‑Epoch Comprehensive Dataset Workflow#40

Open
dansasser wants to merge 1 commit intomainfrom
fetures/codex-edits
Open

Align Enhanced SIM-ONE Pipeline With 7‑Epoch Comprehensive Dataset Workflow#40
dansasser wants to merge 1 commit intomainfrom
fetures/codex-edits

Conversation

@dansasser
Copy link
Copy Markdown
Owner

@dansasser dansasser commented Sep 28, 2025

Summary

Retuned the enhanced stack to match the README’s architecture (restored RoPE attention, shared governance backbone, caching, and modern layers) so inference/training run cleanly again: SIM-ONE Training/simone_transformer/{enhanced_model.py,rope_attention.py,attention_cache.py,shared_governance.py,modern_layers.py,simone_model.py,init.py}.
Updated training defaults (config + CLI + orchestrator) to target mvlm_training_dataset_complete/mvlm_comprehensive_dataset, run 7 epochs, and emit to models/simone_enhanced: SIM-ONE Training/prioritary_mvlm/config.py, SIM-ONE Training/enhanced_train.py, train_all_models.py, enhanced_preflight.py, launch_simone_enhanced.sh.
Refreshed documentation and helper scripts so instructions/monitoring align with the new workflow and hands-off launch path: README.md, H200_SETUP_README.md, READY_FOR_H200.md, TWO_MODEL_SETUP_FINAL.md, agents.md, claude.md.
Testing

./pytorch-env/bin/python enhanced_preflight.py --data_dir ./mvlm_training_dataset_complete/mvlm_comprehensive_dataset
Notes

git status currently shows thousands of modified dataset files under mvlm_training_dataset_complete/mvlm_comprehensive_dataset/**. Please confirm you intended to include those before merging.

Summary by CodeRabbit

  • New Features
    • Introduced an Enhanced SIM‑ONE model with RoPE attention, improved feedforward/MoE, richer governance signals, and adaptive generation controls.
  • Documentation
    • Updated guides with a hands-off launch flow, monitoring option, revised commands, and new hyperparameters. Training time now ~6–7 hours (7 epochs); expectations section refreshed.
  • Refactor
    • Modernized attention and MoE internals for better efficiency and load balancing.
  • Chores
    • Updated defaults: comprehensive dataset path, output/log locations, batch size 12, epochs 7, learning rate 3e‑4. Preflight and launch scripts aligned.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Sep 28, 2025

Walkthrough

Updates span documentation, training defaults, and core model internals. Training epochs/time and dataset paths are revised. Config broadens model/training hyperparameters and adds a new field. Core transformer gains new Enhanced model, RoPE attention, and refactored MoE. Ancillary modules adjust typing and governance defaults. Launch/preflight/train scripts align to new paths and epochs.

Changes

Cohort / File(s) Summary
Documentation durations & commands
H200_SETUP_README.md, README.md, READY_FOR_H200.md, TWO_MODEL_SETUP_FINAL.md, agents.md, claude.md
Updated training time to ~6–7 hours (7 epochs), revised example commands, dataset paths, and added/adjusted hyperparameters where shown.
Training orchestration & defaults
SIM-ONE Training/enhanced_train.py, SIM-ONE Training/prioritary_mvlm/config.py, enhanced_preflight.py, launch_simone_enhanced.sh, train_all_models.py
Switched default data/output paths, increased batch size and epochs (3→7), raised learning rate, expanded config dims/schedule, added lambda_energy to PrioritaryConfig, updated log path, and preflight default dataset.
Transformer architecture additions
SIM-ONE Training/simone_transformer/enhanced_model.py
Added EnhancedSIMONEModel, EnhancedSIMONEBlock, GovernanceEnhancer/Aggregator, caching, precomputed modulations, generation with adaptive sampling, utilities, and test harness.
Attention subsystem (RoPE & governance)
SIM-ONE Training/simone_transformer/rope_attention.py
Introduced RotaryPositionalEmbedding and EnhancedGovernanceAttention with policy/memory/trace components, causal masking, cache support, and governance-aware biases.
MoE refactor
SIM-ONE Training/simone_transformer/modern_layers.py
Rewrote MoELayer to batched per-expert processing with load balancing; removed capacity_factor and ensure_assignment params; added usage tracking and penalty computation.
Typing and shared governance tweaks
SIM-ONE Training/simone_transformer/attention_cache.py, SIM-ONE Training/simone_transformer/shared_governance.py
Adopted postponed annotations and TYPE_CHECKING imports; adjusted default governance_dim to hidden_dim; replaced direct type refs with forward strings.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant Shell as launch_simone_enhanced.sh
  participant Preflight as enhanced_preflight.py
  participant Trainer as enhanced_train.py
  participant Config as PrioritaryConfig
  participant Model as EnhancedSIMONEModel
  participant Data as mvlm_comprehensive_dataset

  User->>Shell: Run launch script
  Shell->>Preflight: --data_dir ./mvlm_training_dataset_complete/mvlm_comprehensive_dataset
  Preflight-->>Shell: OK/Diagnostics
  Shell->>Trainer: Start training (epochs=7, lr=3e-4, batch=12)
  Trainer->>Config: Load hyperparameters (updated dims/epochs/lr)
  Trainer->>Data: Load dataset
  Trainer->>Model: Initialize (RoPE, MoE, Governance)
  loop For each epoch/batch
    Trainer->>Model: forward(input_ids, masks, policy_guidance, use_cache)
    Model->>Model: RoPE Q/K, governance attention, MoE FFN
    Model-->>Trainer: logits, governance outputs
    Trainer->>Trainer: loss/backprop/opt (grad accum)
  end
  Trainer-->>User: Final checkpoint ./models/simone_enhanced
Loading
sequenceDiagram
  autonumber
  participant App as Inference App
  participant Model as EnhancedSIMONEModel
  participant Attn as EnhancedGovernanceAttention
  participant MoE as MoELayer
  participant Gov as GovernanceAggregator

  App->>Model: generate(input_ids, use_cache=True, prophetic_state)
  Model->>Model: _precompute_prophetic_modulations
  loop steps until max_length/EOS
    Model->>Attn: forward(Q,K,V, masks, prophetic_state, cache)
    Attn-->>Model: context, policy/memory/trace
    Model->>MoE: feedforward(x)
    MoE-->>Model: transformed x (+load-balance penalty tracked)
    Model->>Gov: aggregate per-layer governance
    Model-->>App: next token (sampling: temp/top_k/top_p adaptive)
  end
  Model->>Model: disable_cache()
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~70 minutes

Possibly related PRs

Suggested labels

codex

Poem

I twitch my ears at epochs seven,
RoPE winds stars across token-heaven.
MoE burrows, experts blend,
Governance trails to journey’s end.
Logs in the warren, checkpoints amassed—
A carrot for patience, trained at last! 🥕✨

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Title Check ⚠️ Warning The title emphasizes aligning the Enhanced SIM-ONE pipeline with a 7-epoch comprehensive dataset workflow, which accurately describes the training and documentation updates but omits the substantial model architecture changes such as restoring RoPE attention, the shared governance backbone, and modern layer implementations. As a result, it does not fully capture the most significant code changes introduced in this pull request. Please update the title to reflect both the restoration and retuning of the Enhanced SIM-ONE model architecture (e.g., RoPE attention and governance enhancements) alongside the shift to a 7-epoch comprehensive dataset workflow.
Docstring Coverage ⚠️ Warning Docstring coverage is 62.22% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fetures/codex-edits

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting

Comment on lines +350 to +395
def _precompute_prophetic_modulations(
self,
prophetic_state: Optional['PropheticSingularityState'],
seq_len: int,
device: torch.device,
dtype: torch.dtype,
past_length: int = 0
) -> Optional[
Tuple[
PropheticSingularityState,
List[torch.Tensor],
PropheticSingularityState
]
]:
"""
Pre-compute all layer modulations once for efficiency.
Avoids repeated computation in each layer.
"""
if prophetic_state is None:
return None

# Check cache validity
total_len = seq_len + past_length
cache_key = (id(prophetic_state), total_len, past_length, device, dtype)
if (self._precomputed_prophetic_cache is not None and
self._precomputed_prophetic_cache[0] == cache_key):
return self._precomputed_prophetic_cache[1]

# Align state to total sequence length (past + current tokens)
aligned_total = prophetic_state.align_to_length(total_len).to(
device=device,
dtype=dtype
)

start_idx = max(total_len - seq_len, 0)

# Slice helper to preserve normalization metadata
def _slice_state(
state: PropheticSingularityState,
start: int
) -> PropheticSingularityState:
if start == 0 and seq_len == total_len:
return state

return PropheticSingularityState(
intensity=state.intensity[..., start:],
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P1] Import removed for runtime use of PropheticSingularityState

The new _precompute_prophetic_modulations helper creates and returns PropheticSingularityState instances, but the module now only imports that type inside a TYPE_CHECKING guard. When prophetic_state is passed at runtime, the first call to _slice_state will raise NameError: PropheticSingularityState is not defined, breaking any training or generation run that enables prophetic governance. The fix is to import the class unconditionally (or reference it via the object passed in) so the name exists at execution time.

Useful? React with 👍 / 👎.

Comment on lines +132 to +185
# Apply load balancing penalty during training
if self.training and self.load_balancing_weight > 0:
# Encourage balanced expert usage
expert_probs = F.softmax(router_logits, dim=-1)
expert_usage_batch = expert_probs.mean(dim=0)

# Update running statistics
self.expert_usage = 0.9 * self.expert_usage + 0.1 * expert_usage_batch
self.total_tokens += num_tokens

# Add load balancing loss (encourages uniform distribution)
target_usage = 1.0 / self.num_experts
load_balance_loss = ((self.expert_usage - target_usage) ** 2).sum()
router_logits = router_logits - self.load_balancing_weight * load_balance_loss

# Get top-k experts for each token
top_k_logits, top_k_indices = torch.topk(
router_logits, self.num_experts_per_token, dim=-1
)

# Softmax over selected experts
top_k_weights = F.softmax(top_k_logits, dim=-1)

# OPTIMIZED: Vectorized expert processing
output = torch.zeros_like(x_flat)

# Create routing tensors for efficient batching
for expert_idx in range(self.num_experts):
# Find all tokens and positions where this expert is selected
expert_positions = (top_k_indices == expert_idx)

if expert_positions.any():
# Get token indices and k-positions for this expert
token_indices, k_positions = expert_positions.nonzero(as_tuple=True)

if len(token_indices) > 0:
# Batch process all tokens for this expert
expert_tokens = x_flat[token_indices]
expert_output = self.experts[expert_idx](expert_tokens)

# Get corresponding weights
expert_weights = top_k_weights[token_indices, k_positions].unsqueeze(-1)

# Accumulate weighted outputs
output.index_add_(0, token_indices, expert_weights * expert_output)

return output.view(batch_size, seq_len, dim)

def get_load_balancing_loss(self) -> torch.Tensor:
if self._load_balancing_loss is None:
return self.router.weight.new_zeros(())
return self._load_balancing_loss

def get_last_assignment_counts(self) -> Optional[torch.Tensor]:
"""Return the most recent per-expert token counts if available."""
return self._last_assignment_counts

"""Get the current load balancing loss for regularization."""
if self.total_tokens > 0:
target_usage = 1.0 / self.num_experts
return ((self.expert_usage - target_usage) ** 2).sum()
return torch.tensor(0.0, device=self.expert_usage.device)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P1] Load-balancing loss no longer backpropagates in MoELayer

The revised MoE layer computes load_balance_loss from running buffers (self.expert_usage) and simply subtracts that scalar from the router logits, then get_load_balancing_loss returns a detached tensor built from the same buffers. Because the buffers are not part of the forward computation, the returned loss has requires_grad=False and cannot provide gradients to the router, contradicting the existing tests that expect a differentiable auxiliary loss and allowing expert collapse during training. The method should compute a loss based on the current batch logits/probabilities so it participates in autograd.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (10)
enhanced_preflight.py (1)

12-104: Usage string still points to the old dataset root

We bumped the --data_dir default to ./mvlm_training_dataset_complete/mvlm_comprehensive_dataset, but the Usage block at the top still instructs people to pass the old root directory. That mismatch will send them to the wrong path unless they read the argparse defaults. Please update the example to mirror the new default.

Suggested edit:

-    --data_dir ./mvlm_training_dataset_complete \
+    --data_dir ./mvlm_training_dataset_complete/mvlm_comprehensive_dataset \
train_all_models.py (1)

151-153: Bug: cwd + script_path results in duplicated path.

You set cwd to the script’s parent (“SIM-ONE Training”) but still pass a path including that parent. This resolves to “SIM-ONE Training/SIM-ONE Training/enhanced_train.py” and will fail. Either pass only the basename or run from repo root.

-        cmd = ['python3', script_path] + args
+        script_abspath = str(Path(script_path).resolve())
+        cmd = ['python3', script_abspath] + args
@@
-                    env=env,
-                    cwd=Path(script_path).parent if '/' in script_path else '.'
+                    env=env,
+                    cwd='.'

Also applies to: 175-176

agents.md (1)

96-103: Align import names with exported symbols
Replace AdvancedBPETokenizer with BiblicalBPETokenizer and ComprehensiveTrainingLoss with ComprehensiveBiblicalLoss to match the actual classes in prioritary_mvlm.

-from prioritary_mvlm import EnhancedPrioritaryTrainer, AdvancedBPETokenizer
-from prioritary_mvlm.advanced_losses import ComprehensiveTrainingLoss
+from prioritary_mvlm import EnhancedPrioritaryTrainer, BiblicalBPETokenizer
+from prioritary_mvlm.advanced_losses import ComprehensiveBiblicalLoss
SIM-ONE Training/simone_transformer/attention_cache.py (2)

116-126: Causal-mask check is a no-op (always true) → cache-key collisions

if attention_mask.shape[-2:] == attention_mask.shape[-2:]: is tautologically true, so all masks are treated as “causal”, collapsing keys and returning wrong cached patterns.

Apply this fix to always hash the actual mask contents (simple and correct):

-        # For causal masks, just use shape since they're deterministic
-        if attention_mask.shape[-2:] == attention_mask.shape[-2:]:  # Square mask
-            return f"causal_{attention_mask.shape[-1]}"
-        
-        # For custom masks, hash the pattern
-        return hashlib.md5(attention_mask.detach().cpu().numpy().tobytes()).hexdigest()[:8]
+        # Hash the mask content to avoid collisions (works for causal/custom masks)
+        tensor = attention_mask.detach().contiguous().to(device="cpu")
+        return hashlib.md5(tensor.numpy().tobytes()).hexdigest()[:8]

92-114: Signature concatenation mixes 2D and 1D tensors → runtime error

torch.cat(signatures) will fail because policy_sig/memory_sig are 2D (2×F) while kingdom_sig is 1D (S). Flatten first.

-        if policy_logits is not None:
-            # Use mean and std as signature
-            policy_sig = torch.stack([
-                policy_logits.mean(dim=(0, 1)),
-                policy_logits.std(dim=(0, 1))
-            ])
-            signatures.append(policy_sig)
+        if policy_logits is not None:
+            policy_sig = torch.stack(
+                [policy_logits.mean(dim=(0, 1)), policy_logits.std(dim=(0, 1))]
+            ).reshape(-1)
+            signatures.append(policy_sig)
@@
-        if memory_signals is not None:
-            memory_sig = torch.stack([
-                memory_signals.mean(dim=(0, 1)),
-                memory_signals.std(dim=(0, 1))
-            ])
-            signatures.append(memory_sig)
+        if memory_signals is not None:
+            memory_sig = torch.stack(
+                [memory_signals.mean(dim=(0, 1)), memory_signals.std(dim=(0, 1))]
+            ).reshape(-1)
+            signatures.append(memory_sig)
@@
-        if prophetic_state is not None:
-            # Use kingdom flow as signature
-            kingdom_sig = prophetic_state.kingdom_flow.mean(dim=0)
-            signatures.append(kingdom_sig)
+        if prophetic_state is not None:
+            kingdom_sig = prophetic_state.kingdom_flow.mean(dim=0).reshape(-1)
+            signatures.append(kingdom_sig)
@@
-        if signatures:
-            return torch.cat(signatures)
+        if signatures:
+            return torch.cat(signatures).float()
SIM-ONE Training/simone_transformer/shared_governance.py (3)

158-163: Don’t create layers inside forward; register once (policy guidance proj)

Allocating nn.Linear per call leaks params and breaks optimization.

 class PolicyHead(nn.Module):
@@
     def __init__(self, governance_dim: int, hidden_dim: int, num_heads: int):
         super().__init__()
@@
         self.pattern_controllers = nn.ModuleList([
             nn.Linear(governance_dim, 1) for _ in range(num_heads)
         ])
+        # Project external guidance once; infer in_features lazily
+        self.guidance_proj = nn.LazyLinear(self.governance_dim)
@@
-        if policy_guidance is not None:
-            # Integrate external guidance (project to governance dim)
-            guidance_proj = nn.Linear(policy_guidance.size(-1), self.governance_dim).to(shared_features.device)
-            policy_input = shared_features + guidance_proj(policy_guidance)
+        if policy_guidance is not None:
+            # Integrate external guidance (project to governance dim)
+            policy_input = shared_features + self.guidance_proj(policy_guidance)
         else:
             policy_input = shared_features

236-241: Same issue: dynamic layer creation in memory path

Register a single projection layer; use LazyLinear to infer input size.

 class MemoryHead(nn.Module):
@@
         self.memory_to_weights = nn.Linear(governance_dim, num_heads)
+        self.context_proj = nn.LazyLinear(self.governance_dim)
@@
-        if memory_context is not None:
+        if memory_context is not None:
             # Project memory context to governance dimension if needed
-            if memory_context.size(-1) != self.governance_dim:
-                context_proj = nn.Linear(memory_context.size(-1), self.governance_dim).to(shared_features.device)
-                memory_context = context_proj(memory_context)
+            if memory_context.size(-1) != self.governance_dim:
+                memory_context = self.context_proj(memory_context)

48-51: Add a precondition that governance_dim is divisible by 4
Raise a ValueError if self.governance_dim % 4 != 0 before instantiating nn.MultiheadAttention to prevent PyTorch’s “embed_dim must be divisible by num_heads” error.

         self.governance_coordination = nn.MultiheadAttention(
             self.governance_dim, num_heads=4, batch_first=True
         )
+        if self.governance_dim % 4 != 0:
+            raise ValueError(f"governance_dim ({self.governance_dim}) must be divisible by 4")
claude.md (1)

10-12: Inconsistent total duration vs per-model times

Header says “5–7 hours total,” but Enhanced SIM-ONE alone is ~6–7 hours and GPT-2 is 2–3 hours. Adjust header to ~8–10 hours for both models.

-**Duration**: 5-7 hours total training time
+**Duration**: ~8–10 hours total for both models (2–3h GPT‑2 + 6–7h Enhanced SIM‑ONE)
H200_SETUP_README.md (1)

139-145: Documented TORCH_COMPILE flag is unused
The TORCH_COMPILE env var (H200_SETUP_README.md:140) isn’t referenced anywhere in the codebase. Either implement its handling (e.g. conditionally wrap your torch.compile(…) calls in if os.getenv("TORCH_COMPILE")) or remove it from the README to avoid confusing users.

🧹 Nitpick comments (18)
READY_FOR_H200.md (1)

53-75: Align training time messaging

Step 2 now promises ~6‑7 hours for the run, but “Important Notes” item 4 still claims 6‑9 hours for all three models. That contradiction will confuse folks trying to budget GPU time—especially since train_all_models.py now drives only the enhanced run. Please reconcile the wording (e.g., replace the note with the updated duration or clarify the scenario).

Apply this diff to keep the guidance consistent:

-4. **6-9 hours total training time** for all three models
+4. **~6-7 hours total training time** for the enhanced SIM-ONE run
agents.md (3)

39-41: Fix inconsistent total training time across docs.

Enhanced SIM‑ONE is stated as ~6–7 hours (7 epochs) here, but “Total Training: 5–7 hours for both models” below contradicts that. Recommend correcting total to ~8–10 hours (2–3h + 6–7h).

-## Performance Expectations
-**Total Training**: 5-7 hours for both models
+## Performance Expectations
+**Total Training**: ~8–10 hours for both models

Also applies to: 145-149


101-103: Update dataset path example to match new default.

The example still shows “../mvlm_training_dataset_complete”. Defaults now point to the comprehensive subset path.

-# Dataset path from SIM-ONE Training directory
-data_dir = "../mvlm_training_dataset_complete"
+# Dataset path (repo root)
+data_dir = "./mvlm_training_dataset_complete/mvlm_comprehensive_dataset"

75-76: Avoid committing datasets; use .gitignore or LFS.

PR mentions thousands of modified files under the dataset. Recommend excluding large data from Git.

Add to .gitignore (outside this file):

+# Datasets
+/mvlm_training_dataset_complete/
+!/mvlm_training_dataset_complete/.keep

Or track via Git LFS if required. Do you want a small PR to add ignore rules and move data handling to preflight?

TWO_MODEL_SETUP_FINAL.md (2)

63-68: Correct total training duration.

Enhanced: ~6–7h (7 epochs) is fine, but “Total: ~5–7 hours for both models” is inconsistent with MVLM‑GPT2 (2–3h). Suggest ~8–10h.

-**Total**: ~5-7 hours for both models
+**Total**: ~8–10 hours for both models

46-50: Doc says “Train Both Models,” but orchestrator currently runs only Enhanced.

train_all_models.py contains a single model entry (SIM‑ONE‑Enhanced). Either add MVLM‑GPT2 to the orchestrator or update docs to reflect single‑model training.

Would you like a patch that adds MVLM‑GPT2 to self.models (with script, args, and outputs), or shall we reword this section to “Train Enhanced SIM‑ONE”?

launch_simone_enhanced.sh (1)

21-22: Enable unbuffered Python for real‑time logs.

-u improves tailing in screen sessions.

-python3 enhanced_preflight.py --data_dir ./mvlm_training_dataset_complete/mvlm_comprehensive_dataset && \
-python3 train_all_models.py'
+python3 -u enhanced_preflight.py --data_dir ./mvlm_training_dataset_complete/mvlm_comprehensive_dataset && \
+python3 -u train_all_models.py'
train_all_models.py (1)

365-369: Avoid blocking for input in unattended runs.

On failure, input() will hang in screen. Prefer a non‑interactive flag or default to abort/continue.

-                response = input(f"\n⚠️  {model['name']} failed. Continue with remaining models? [y/N]: ")
-                if response.lower() != 'y':
-                    self.logger.info("🛑 Training stopped by user")
-                    break
+                if os.environ.get("SIMONE_CONTINUE_ON_ERROR", "0") != "1":
+                    self.logger.info("🛑 Training stopped (set SIMONE_CONTINUE_ON_ERROR=1 to continue automatically)")
+                    break
SIM-ONE Training/enhanced_train.py (2)

41-49: Pass-through of training knobs looks fine; consider adding seed for reproducibility.

Optional: add --seed and set deterministic seeds.

 parser.add_argument("--quiet", action="store_true", help="Reduce logging output")
+parser.add_argument("--seed", type=int, default=42, help="Random seed")
@@
-    config = create_enhanced_config(args)
+    config = create_enhanced_config(args)
+    try:
+        import torch, random, numpy as np
+        torch.manual_seed(args.seed)
+        random.seed(args.seed)
+        np.random.seed(args.seed)
+        if torch.cuda.is_available():
+            torch.cuda.manual_seed_all(args.seed)
+    except Exception:
+        pass

115-121: Early data-dir validation is good; consider also ensuring output_dir exists.

Trainer may create it, but creating here avoids surprises.

-    # Create enhanced configuration
+    # Create enhanced configuration
     config = create_enhanced_config(args)
+    Path(args.output_dir).mkdir(parents=True, exist_ok=True)
SIM-ONE Training/simone_transformer/attention_cache.py (2)

65-76: Minor: ensure contiguous CPU buffer before hashing

Small robustness tweak to avoid surprises with non-contiguous tensors.

-            gov_hash = hashlib.md5(
-                governance_signature.detach().cpu().numpy().tobytes()
-            ).hexdigest()[:8]
+            buf = governance_signature.detach().contiguous().cpu().numpy().tobytes()
+            gov_hash = hashlib.md5(buf).hexdigest()[:8]
@@
-            proph_hash = hashlib.md5(
-                prophetic_signature.detach().cpu().numpy().tobytes()
-            ).hexdigest()[:8]
+            buf = prophetic_signature.detach().contiguous().cpu().numpy().tobytes()
+            proph_hash = hashlib.md5(buf).hexdigest()[:8]

229-254: Optional: periodic cleanup to control memory

Consider auto-cleaning expired entries on a schedule or based on access count thresholds.

SIM-ONE Training/simone_transformer/shared_governance.py (3)

29-33: Default governance_dim now equals hidden_dim → compute/memory jump

This doubles shared/governance params vs prior ½ default. Confirm intended, or revert to hidden_dim // 2 to keep footprint lower.

-    def __init__(self, hidden_dim: int, governance_dim: int = None, num_heads: int = 8):
+    def __init__(self, hidden_dim: int, governance_dim: int | None = None, num_heads: int = 8):
@@
-        self.governance_dim = governance_dim or hidden_dim
+        self.governance_dim = governance_dim or (hidden_dim // 2)

85-90: Prophetic modulation math: confirm intended scale

shared_features * (1 + kingdom*0.1) can amplify by up to ~1.1×. If stronger gating was intended, expose a config weight.


329-345: Entropy computation uses natural log; confirm base/scale

If you need bits, use log2; if nats are fine, keep as-is. Also clamp weights to avoid log(0).

-            attention_entropy = -torch.sum(
-                attention_weights * torch.log(attention_weights + 1e-8),
+            w = attention_weights.clamp_min(1e-8)
+            attention_entropy = -torch.sum(
+                w * torch.log(w),
                 dim=-1
             ).mean(dim=1)
H200_SETUP_README.md (2)

73-78: Python version: bump minimum to 3.9+ for PyTorch 2.x wheels

PyTorch 2.x dropped official py3.8 over the 2024–2025 window; recommending 3.9+ avoids install friction.

-- **Python**: 3.8+
+- **Python**: 3.9+

182-184: Tooling nit: iotop is disk I/O, not network

Use iftop/nload for network monitoring.

-# Network usage (if applicable)  
-iotop
+# Network usage (if applicable)
+# iftop or nload (install if missing)
+iftop
+# or
+nload
SIM-ONE Training/prioritary_mvlm/config.py (1)

384-407: Config expansions look coherent; update docstring to include new fields

Add learning_rate, num_epochs, warmup_steps, weight_decay, gradient_accumulation_steps, dataloader_workers, and lambda_* to the Attributes section.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 36ff199 and 3ed2450.

📒 Files selected for processing (16)
  • H200_SETUP_README.md (4 hunks)
  • README.md (3 hunks)
  • READY_FOR_H200.md (2 hunks)
  • SIM-ONE Training/enhanced_train.py (3 hunks)
  • SIM-ONE Training/prioritary_mvlm/config.py (1 hunks)
  • SIM-ONE Training/simone_transformer/attention_cache.py (6 hunks)
  • SIM-ONE Training/simone_transformer/enhanced_model.py (1 hunks)
  • SIM-ONE Training/simone_transformer/modern_layers.py (1 hunks)
  • SIM-ONE Training/simone_transformer/rope_attention.py (1 hunks)
  • SIM-ONE Training/simone_transformer/shared_governance.py (2 hunks)
  • TWO_MODEL_SETUP_FINAL.md (1 hunks)
  • agents.md (1 hunks)
  • claude.md (2 hunks)
  • enhanced_preflight.py (1 hunks)
  • launch_simone_enhanced.sh (1 hunks)
  • train_all_models.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (6)
SIM-ONE Training/enhanced_train.py (1)
SIM-ONE Training/prioritary_mvlm/config.py (1)
  • default (325-356)
SIM-ONE Training/simone_transformer/shared_governance.py (1)
SIM-ONE Training/prioritary_mvlm/config.py (1)
  • PropheticSingularityState (10-356)
SIM-ONE Training/simone_transformer/rope_attention.py (3)
SIM-ONE Training/prioritary_mvlm/config.py (10)
  • PropheticSingularityState (10-356)
  • device (63-64)
  • dtype (67-68)
  • to (78-103)
  • align_to_length (106-124)
  • compute_policy_mask (133-140)
  • compute_memory_decay (142-147)
  • kingdom_flow (71-75)
  • compute_trace_envelope (149-160)
  • summary (162-178)
SIM-ONE Training/simone_transformer/attention_cache.py (3)
  • CachedAttentionMixin (256-319)
  • _try_get_cached_attention (270-284)
  • _cache_attention_pattern (286-303)
SIM-ONE Training/simone_transformer/shared_governance.py (1)
  • SharedGovernanceBackbone (18-124)
SIM-ONE Training/simone_transformer/attention_cache.py (1)
SIM-ONE Training/prioritary_mvlm/config.py (1)
  • PropheticSingularityState (10-356)
SIM-ONE Training/prioritary_mvlm/config.py (1)
simone_training/data/tokenizers/base_tokenizer.py (1)
  • vocab_size (40-42)
SIM-ONE Training/simone_transformer/enhanced_model.py (3)
SIM-ONE Training/prioritary_mvlm/config.py (9)
  • PropheticSingularityState (10-356)
  • align_to_length (106-124)
  • layer_modulation (127-131)
  • kingdom_flow (71-75)
  • device (63-64)
  • dtype (67-68)
  • to (78-103)
  • summary (162-178)
  • step_statistics (180-191)
SIM-ONE Training/simone_transformer/rope_attention.py (7)
  • EnhancedGovernanceAttention (88-348)
  • create_causal_mask (596-613)
  • forward (40-61)
  • forward (169-348)
  • forward (375-430)
  • forward (458-507)
  • forward (528-593)
SIM-ONE Training/simone_transformer/modern_layers.py (10)
  • RMSNorm (13-27)
  • SwiGLU (30-59)
  • GatedResidualConnection (275-294)
  • BiblicalAttentionBias (354-409)
  • apply_weight_init (430-448)
  • MoELayer (85-190)
  • forward (24-27)
  • forward (46-52)
  • forward (54-59)
  • forward (78-82)
🪛 Ruff (0.13.1)
SIM-ONE Training/simone_transformer/rope_attention.py

140-140: Unused method argument: aligned_state

(ARG002)


262-262: Local variable decay_gate is assigned to but never used

Remove assignment to unused variable decay_gate

(F841)


378-378: Unused method argument: attention_scores

(ARG002)


394-394: Unpacked variable batch_size is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


394-394: Unpacked variable hidden_dim is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


414-414: Loop control variable i not used within loop body

Rename unused i to _i

(B007)


439-439: PEP 484 prohibits implicit Optional

Convert to Optional[T]

(RUF013)


461-461: Unused method argument: attention_scores

(ARG002)


477-477: Unpacked variable batch_size is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


477-477: Unpacked variable hidden_dim is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


532-532: Unused method argument: attention_output

(ARG002)


546-546: Unpacked variable batch_size is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

SIM-ONE Training/simone_transformer/enhanced_model.py

157-157: Local variable biblical_bias is assigned to but never used

Remove assignment to unused variable biblical_bias

(F841)


211-211: Unused method argument: hidden_states

(ARG002)


212-212: Unused method argument: layer_idx

(ARG002)


449-451: Avoid specifying long messages outside the exception class

(TRY003)


582-582: Unused method argument: pad_token_id

(ARG002)


601-601: Local variable batch_size is assigned to but never used

Remove assignment to unused variable batch_size

(F841)

🔇 Additional comments (10)
train_all_models.py (2)

31-43: Paths and epochs look consistent with PR goals.

Args now target the comprehensive dataset, 7 epochs, and new log path. Good alignment.

Please confirm enhanced_train.py’s CLI still matches these flags.


88-108: Confirm expandable_segments:True support on target H200 environment
Run this in your H200 PyTorch image (where torch is installed) and verify no parsing errors or warnings; if unsupported, remove expandable_segments:True to avoid silent ignores.

python3 - <<'PY'
import os, torch
os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'max_split_size_mb:512,expandable_segments:True'
# Trigger a CUDA allocation to surface any parsing errors
torch.zeros(1, device='cuda')
print("OK – torch version:", torch.__version__)
PY
SIM-ONE Training/enhanced_train.py (1)

63-65: Defaults aligned to new workflow.

Data/output paths, batch size 12, and 7 epochs match the PR objectives.

Also applies to: 80-82, 98-99, 101-102

agents.md (1)

173-177: Generalize CUDA version guidance.

  • agents.md (lines 173–177): change
- **PyTorch**: 2.0+ with CUDA 12.1 support
+ **PyTorch**: 2.0+ with CUDA 12.x support

Please manually verify which CUDA 12.x minor versions are supported on your H200 hosts.

SIM-ONE Training/simone_transformer/attention_cache.py (2)

261-269: LGTM: clean caching toggle and initialization

The mixin cleanly hides caching behind a flag; constructor usage is sound.


299-304: Correct: cache only in eval mode

Avoids training-time interference. Good.

H200_SETUP_README.md (1)

60-61: Training time update is consistent

Docs now reflect 7 epochs and ~6–7 hours. Good.

claude.md (2)

30-33: Doc consistency with config (32K vocab, modern stack) is good

Matches PrioritaryConfig and architecture updates.


161-165: Command updates align with new defaults

Data path and 7 epochs reflect PR objectives.

SIM-ONE Training/prioritary_mvlm/config.py (1)

9-357: PropheticSingularityState API matches usage across modules

Alignment, .to(), and mask builders are consistent with shared_governance and attention_cache references.

Comment thread launch_simone_enhanced.sh
Comment on lines +21 to 22
python3 enhanced_preflight.py --data_dir ./mvlm_training_dataset_complete/mvlm_comprehensive_dataset || true; \
python3 train_all_models.py'
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don’t swallow preflight failures.

|| true causes training to proceed even if preflight fails. Fail fast to avoid wasted GPU time.

-python3 enhanced_preflight.py --data_dir ./mvlm_training_dataset_complete/mvlm_comprehensive_dataset || true; \
+python3 enhanced_preflight.py --data_dir ./mvlm_training_dataset_complete/mvlm_comprehensive_dataset && \
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
python3 enhanced_preflight.py --data_dir ./mvlm_training_dataset_complete/mvlm_comprehensive_dataset || true; \
python3 train_all_models.py'
python3 enhanced_preflight.py --data_dir ./mvlm_training_dataset_complete/mvlm_comprehensive_dataset && \
python3 train_all_models.py'
🤖 Prompt for AI Agents
In launch_simone_enhanced.sh around lines 21-22, the preflight command is masked
by "|| true" so failures are ignored and training starts regardless; remove the
"|| true" and ensure the script exits on non-zero status (e.g., enable "set -e"
near the top or explicitly test the preflight exit code and exit non-zero) so a
failed enhanced_preflight.py aborts and prevents training from running.

Comment on lines +14 to +16
if TYPE_CHECKING:
from prioritary_mvlm.config import PropheticSingularityState

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Import PropheticSingularityState at runtime

We instantiate PropheticSingularityState later in _precompute_prophetic_modulations (see Line 394). Because the import currently sits under if TYPE_CHECKING, the symbol is undefined at runtime and a NameError is raised as soon as _slice_state() executes—this happens whenever prophecy modulation is requested, which is a core path in training/inference. Please move the import out of the TYPE_CHECKING guard (or perform a local import within the helper) so the class is available during execution.

-from typing import Optional, Dict, List, Tuple, TYPE_CHECKING
-
-if TYPE_CHECKING:
-    from prioritary_mvlm.config import PropheticSingularityState
+from typing import Optional, Dict, List, Tuple, TYPE_CHECKING
+
+from prioritary_mvlm.config import PropheticSingularityState
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if TYPE_CHECKING:
from prioritary_mvlm.config import PropheticSingularityState
-from typing import Optional, Dict, List, Tuple, TYPE_CHECKING
-
-if TYPE_CHECKING:
from typing import Optional, Dict, List, Tuple, TYPE_CHECKING
from prioritary_mvlm.config import PropheticSingularityState
🤖 Prompt for AI Agents
In SIM-ONE Training/simone_transformer/enhanced_model.py around lines 14-16, the
import of PropheticSingularityState is currently guarded by if TYPE_CHECKING
which makes the symbol unavailable at runtime and causes a NameError when
_slice_state/_precompute_prophetic_modulations runs; fix this by moving the from
prioritary_mvlm.config import PropheticSingularityState out of the TYPE_CHECKING
block (so it is imported at module import time) OR perform a local import of
PropheticSingularityState inside the helper that instantiates it (e.g., at the
top of _precompute_prophetic_modulations or immediately before use) so the class
is defined at runtime.

Comment on lines +517 to +533
if self.training and self.use_gradient_checkpointing:
# Use gradient checkpointing to trade compute for memory
x, gov_outputs, _ = torch.utils.checkpoint.checkpoint(
layer,
x,
input_ids,
attention_mask,
memory_context,
policy_guidance,
output_governance,
aligned_state,
precomputed_modulation,
None,
False,
use_reentrant=False # Use new checkpointing API
)
else:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Gradient checkpointing path returns non-Tensor outputs

torch.utils.checkpoint.checkpoint only supports functions that return a Tensor or a tuple of Tensors. EnhancedSIMONEBlock.forward returns (Tensor, Dict, Optional[Tuple[Tensor, Tensor]]), so once use_gradient_checkpointing is enabled this call raises TypeError: CheckpointFunction only accepts output that is Tensor or tuple of Tensors, halting training immediately. You’ll need to restructure this branch—e.g., wrap the layer so the checkpointed function returns only the tensor outputs, and handle the governance dict outside that checkpointed call—to keep gradient checkpointing usable.

🤖 Prompt for AI Agents
In SIM-ONE Training/simone_transformer/enhanced_model.py around lines 517 to
533, the gradient-checkpointing branch calls torch.utils.checkpoint.checkpoint
on EnhancedSIMONEBlock.forward which returns (Tensor, Dict,
Optional[Tuple[Tensor, Tensor]]), but checkpoint only accepts Tensor or tuple of
Tensors; wrap the call so the checkpointed function returns only the tensor
outputs (e.g., a small wrapper that calls layer and returns x tensor and any
other tensors as a tuple), call checkpoint on that wrapper, then after
checkpointing reconstruct or separately compute the governance dict and optional
tensors outside the checkpointed call (or run a lightweight non-checkpointed
call for the dict), ensuring the original API (x, gov_outputs, ...) is preserved
and use_reentrant remains false.

Comment on lines +133 to +177
if self.training and self.load_balancing_weight > 0:
# Encourage balanced expert usage
expert_probs = F.softmax(router_logits, dim=-1)
expert_usage_batch = expert_probs.mean(dim=0)

# Update running statistics
self.expert_usage = 0.9 * self.expert_usage + 0.1 * expert_usage_batch
self.total_tokens += num_tokens

# Add load balancing loss (encourages uniform distribution)
target_usage = 1.0 / self.num_experts
load_balance_loss = ((self.expert_usage - target_usage) ** 2).sum()
router_logits = router_logits - self.load_balancing_weight * load_balance_loss

# Get top-k experts for each token
top_k_logits, top_k_indices = torch.topk(
router_logits, self.num_experts_per_token, dim=-1
)

# Softmax over selected experts
top_k_weights = F.softmax(top_k_logits, dim=-1)

# OPTIMIZED: Vectorized expert processing
output = torch.zeros_like(x_flat)

# Create routing tensors for efficient batching
for expert_idx in range(self.num_experts):
# Find all tokens and positions where this expert is selected
expert_positions = (top_k_indices == expert_idx)

if expert_positions.any():
# Get token indices and k-positions for this expert
token_indices, k_positions = expert_positions.nonzero(as_tuple=True)

if len(token_indices) > 0:
# Batch process all tokens for this expert
expert_tokens = x_flat[token_indices]
expert_output = self.experts[expert_idx](expert_tokens)

# Get corresponding weights
expert_weights = top_k_weights[token_indices, k_positions].unsqueeze(-1)

# Accumulate weighted outputs
output.index_add_(0, token_indices, expert_weights * expert_output)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Load‑balancing penalty is currently a no-op

Subtracting a scalar load_balance_loss from every router logit just shifts all logits by the same constant, so the softmax is unchanged and no balancing happens. Worse, the running stats are updated without no_grad, so they track autograd history. We need a per-expert correction (or add the penalty to the training loss) and detach the stats update.

Apply this diff to make the penalty effective and keep the buffers gradient-free:

-        if self.training and self.load_balancing_weight > 0:
-            # Encourage balanced expert usage
-            expert_probs = F.softmax(router_logits, dim=-1)
-            expert_usage_batch = expert_probs.mean(dim=0)
-            
-            # Update running statistics
-            self.expert_usage = 0.9 * self.expert_usage + 0.1 * expert_usage_batch
-            self.total_tokens += num_tokens
-            
-            # Add load balancing loss (encourages uniform distribution)
-            target_usage = 1.0 / self.num_experts
-            load_balance_loss = ((self.expert_usage - target_usage) ** 2).sum()
-            router_logits = router_logits - self.load_balancing_weight * load_balance_loss
+        if self.training and self.load_balancing_weight > 0:
+            expert_probs = F.softmax(router_logits, dim=-1)
+            expert_usage_batch = expert_probs.mean(dim=0)
+            target_usage = 1.0 / self.num_experts
+
+            # Encourage balanced expert usage per expert (broadcast)
+            router_logits = router_logits - self.load_balancing_weight * (
+                expert_usage_batch - target_usage
+            )
+
+            # Maintain running statistics without tracking grads
+            with torch.no_grad():
+                self.expert_usage.mul_(0.9).add_(0.1 * expert_usage_batch)
+                self.total_tokens += float(num_tokens)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if self.training and self.load_balancing_weight > 0:
# Encourage balanced expert usage
expert_probs = F.softmax(router_logits, dim=-1)
expert_usage_batch = expert_probs.mean(dim=0)
# Update running statistics
self.expert_usage = 0.9 * self.expert_usage + 0.1 * expert_usage_batch
self.total_tokens += num_tokens
# Add load balancing loss (encourages uniform distribution)
target_usage = 1.0 / self.num_experts
load_balance_loss = ((self.expert_usage - target_usage) ** 2).sum()
router_logits = router_logits - self.load_balancing_weight * load_balance_loss
# Get top-k experts for each token
top_k_logits, top_k_indices = torch.topk(
router_logits, self.num_experts_per_token, dim=-1
)
# Softmax over selected experts
top_k_weights = F.softmax(top_k_logits, dim=-1)
# OPTIMIZED: Vectorized expert processing
output = torch.zeros_like(x_flat)
# Create routing tensors for efficient batching
for expert_idx in range(self.num_experts):
# Find all tokens and positions where this expert is selected
expert_positions = (top_k_indices == expert_idx)
if expert_positions.any():
# Get token indices and k-positions for this expert
token_indices, k_positions = expert_positions.nonzero(as_tuple=True)
if len(token_indices) > 0:
# Batch process all tokens for this expert
expert_tokens = x_flat[token_indices]
expert_output = self.experts[expert_idx](expert_tokens)
# Get corresponding weights
expert_weights = top_k_weights[token_indices, k_positions].unsqueeze(-1)
# Accumulate weighted outputs
output.index_add_(0, token_indices, expert_weights * expert_output)
if self.training and self.load_balancing_weight > 0:
expert_probs = F.softmax(router_logits, dim=-1)
expert_usage_batch = expert_probs.mean(dim=0)
target_usage = 1.0 / self.num_experts
# Encourage balanced expert usage per expert (broadcast)
router_logits = router_logits - self.load_balancing_weight * (
expert_usage_batch - target_usage
)
# Maintain running statistics without tracking grads
with torch.no_grad():
self.expert_usage.mul_(0.9).add_(0.1 * expert_usage_batch)
self.total_tokens += float(num_tokens)

@dansasser
Copy link
Copy Markdown
Owner Author

@coderabbitai generate docstrings

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Sep 28, 2025

✅ Actions performed

Docstrings generation started.

coderabbitai Bot added a commit that referenced this pull request Sep 28, 2025
Docstrings generation was requested by @dansasser.

* #40 (comment)

The following files were modified:

* `SIM-ONE Training/enhanced_train.py`
* `SIM-ONE Training/simone_transformer/attention_cache.py`
* `SIM-ONE Training/simone_transformer/enhanced_model.py`
* `SIM-ONE Training/simone_transformer/modern_layers.py`
* `SIM-ONE Training/simone_transformer/rope_attention.py`
* `SIM-ONE Training/simone_transformer/shared_governance.py`
* `enhanced_preflight.py`
* `train_all_models.py`
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Sep 28, 2025

Note

Generated docstrings for this pull request at #41

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant