A reinforcement learning agent that plays Balatro autonomously, combining deep domain knowledge with PPO (Proximal Policy Optimization) to master the game's complex scoring mechanics, economy management, and joker synergy systems.
Goal: Achieve consistent Ante 8 clears (Phase 1), then push toward naneinf — the score ceiling where 64-bit floats overflow and the game displays "naneinf" (~1.80e308).
- How It Works
- Architecture
- Scoring Engine
- Joker Schema Database
- Project Structure
- Prerequisites
- Installation
- Usage
- Training Phases
- Key Design Decisions
- Acknowledgments
- License
Balatron connects to a live instance of Balatro through the BalatroBot mod, which exposes the full game state and accepts action commands via a JSON-RPC 2.0 HTTP API on 127.0.0.1:12346. No screen capture or computer vision is needed — the agent reads structured game data directly.
The agent observes the game state (hand cards, jokers, economy, blind targets, deck composition, shop contents), encodes it into an 814-dimensional vector, and uses a PPO neural network to select actions. A sophisticated heuristic layer validates and enhances the network's decisions with Balatro-specific domain knowledge — optimal hand selection, joker ordering, consumable usage, pack evaluation, and economy management.
Balatro Game
|
| JSON-RPC 2.0 (BalatroBot mod)
v
Game State Encoder (814-dim vector)
|
v
PPO Neural Network (shared trunk + 3 policy heads)
|
v
Action Mask (domain knowledge biasing)
|
v
Heuristic Validator (hand eval, scoring math, economy guards)
|
v
API Action Command --> Balatro Game
Balatron uses a hybrid architecture that combines neural network learning with hard-coded Balatro expertise:
| Layer | Responsibility | Examples |
|---|---|---|
| Neural Network | Strategic decisions under uncertainty | When to reroll, which jokers to prioritize, when to leave the shop, blind skip timing |
| Action Mask | Logit biasing to guide exploration | Boost scoring jokers in shop, suppress bad jokers, penalize unaffordable items |
| Heuristic Guards | Validate and override bad decisions | Block selling your only joker, prevent buying when it tanks interest, force-buy Blueprint/Brainstorm, auto-buy Hermit, Soul card priority in packs |
| Scoring Engine | Full Balatro math for hand evaluation | Compute exact scores with joker effects, retriggers, editions, enhancements, boss debuffs |
| Strategic Advisor | Optimal hand/discard selection | plan_optimal_action() — evaluates all possible plays against blind target with draw probability |
Input (814) --> Shared Trunk (814 -> 768 -> 768 -> 512)
|
|--> Play Head (512 -> 256 -> 45) -- SELECTING_HAND state
|--> Shop Head (512 -> 256 -> 45) -- SHOP state
|--> Blind Head (512 -> 128 -> 45) -- BLIND_SELECT state
|--> Value Head (512 -> 256 -> 1) -- all states
- 3 policy heads specialized for different game phases — the shop head never sees hand-play decisions and vice versa
- Shared trunk learns representations useful across all phases (joker synergies, economy state, deck composition)
- Layer normalization + SiLU activation for training stability
- Action-conditioned target sampling — after selecting an action type, target logits are masked so only valid targets can be chosen (e.g.,
buy_packcan only target pack slots, not joker slots)
The game state is encoded into an 814-dimensional float vector with careful normalization:
| Section | Dimensions | Contents |
|---|---|---|
| Game Meta | 42 | Ante, round, money (log-scaled), hands/discards left, reroll cost, chips/target (log), boss blind info, blind statuses, joker slots |
| Hand Levels | 79 | 13 poker hand types x (level, chips, mult) + play counts, all normalized |
| Deck Composition | 61 | 52 rank x suit counts + 9 enhancement/seal counters |
| Vouchers | 32 | Binary flags for each voucher owned |
| Joker Slots | 270 | 5 slots x 54-field fingerprint: tier weight, effect flags, values (log-scaled), triggers, edition, modifiers, runtime scaling value |
| Hand Cards | 96 | 12 slots x 8 fields: rank, suit, enhancement, seal, edition, debuffed, base chips, is_face |
| Consumables | 12 | 2 slots x 6 fields: type, key hash, cost, value estimate, needs targeting, is negative |
| Shop Contents | 162 | 3 joker fingerprints + 2 vouchers + 2 packs, all with cost/affordability |
| Pack Cards | 10 | 5 slots x 2 fields for opened booster pack contents |
| Boss Blind | 10 | One-hot category + suit debuff encoding |
| Scoring Context | 40 | Projected scores, risk assessment, hand type features |
Design principle: Jokers are encoded as property fingerprints (what the joker does) rather than joker IDs. This means the network generalizes across jokers with similar effects — it doesn't need to memorize 150 individual joker behaviors, it learns that "x2 mult on face cards" is valuable regardless of which joker provides it.
Actions are represented as a 14-element tensor: [action_type(1), card_selection(12), target(1)]
14 action types:
| Index | Action | Used In |
|---|---|---|
| 0 | Play hand | SELECTING_HAND |
| 1 | Discard | SELECTING_HAND |
| 2 | Buy shop card (joker/planet/tarot) | SHOP |
| 3 | Buy voucher | SHOP |
| 4 | Buy pack | SHOP |
| 5 | Sell joker | SHOP |
| 6 | Sell consumable | SHOP |
| 7 | Reroll shop | SHOP |
| 8 | Use consumable | SELECTING_HAND, SHOP |
| 9 | Select blind | BLIND_SELECT |
| 10 | Skip blind | BLIND_SELECT |
| 11 | Select pack card | PACK_OPENED |
| 12 | Skip pack | PACK_OPENED |
| 13 | End shop | SHOP |
19 target slots map to different game objects depending on the action type:
[0-2] Shop joker slots (for buy_joker)
[3-4] Shop voucher slots (for buy_voucher)
[5-6] Shop pack slots (for buy_pack)
[7-11] Owned joker slots (for sell_joker)
[12-13] Consumable slots (for sell/use_consumable)
[14-18] Pack card slots (for select_pack_card)
Action-conditioned targeting: After the network selects an action type, the target logits are masked to only allow valid targets for that action. This eliminates the "mismatched target" problem where independent sampling could pair buy_pack with a joker sell target.
Rewards are carefully structured to align with Balatro's exponential scoring:
| Reward | Value | Description |
|---|---|---|
| Game Win (Ante 8 clear) | +10.0 | Phase 1 primary goal |
| Game Loss | -5.0 + 0.3/ante | Harsh base penalty, softened by progress |
| Naneinf Achievement | +50.0 | Phase 2 ultimate goal |
| Ante Cleared | +3.0 + 0.5/ante | Scales with difficulty |
| Blind Cleared | +1.0 (+1.5 boss) | Per-blind progress signal |
| Score vs Target | +0.5 * log10(ratio) | Log-scaled overkill bonus |
| Money Gain | +0.02/dollar | Economy awareness |
| Money Spent | -0.01/dollar | Light spending penalty (spending is necessary) |
| Scaling Growth | +0.05/log-unit | Encourages scaling joker investment |
All score-based rewards use log10 scaling because Balatro scores grow exponentially — the agent learns to push the exponent higher.
The scoring engine (hand_eval.py) implements the full Balatro scoring formula:
Score = (hand_chips + card_chips + joker_chips) x (hand_mult + joker_mult) x joker_xmult
Features implemented:
- Complete hand classification (High Card through Flush Five)
- All 150 joker effects — chips, mult, xmult, scaling, conditional triggers
- Retrigger system — Hanging Chad (first card +2), Sock and Buskin (face cards +1), Hack (2/3/4/5 +1), Seltzer (all +1), Dusk (last hand), Red Seal (+1)
- Card enhancements: Bonus (+30 chips), Mult (+4 mult), Wild (any suit), Glass (x1.5, may shatter), Steel (x1.5 held), Stone (50 chips, no rank/suit), Gold (+3 money), Lucky (chance mult/money)
- Card editions: Foil (+50 chips), Holographic (+10 mult), Polychrome (x1.5)
- Boss blind debuffs: suit debuffs (Club, Goad, Window, Head), face debuff (Plant), rank debuffs
- Blueprint/Brainstorm copy chain resolution
- Joker ordering optimization (chips -> mult -> xmult left-to-right)
- Draw probability calculation for discard decisions
- Deck composition tracking for suit synergy awareness
- Death tarot and Hanged Man suit-aware targeting
Strategic advisor (plan_optimal_action) evaluates every possible play and discard against:
- Current blind target score
- Remaining hands/discards
- Draw probabilities for completing better hands
- Joker synergies with specific suits/ranks
- Boss blind debuffs on specific suits
- Chase target viability — only discards when the chase target can realistically beat the blind
- Baseline scoring — uses actual expected hand value (planet levels + joker effects) for multi-hand projections
All 150 base-game jokers are encoded in data/jokers.py as structured schemas:
"Photograph": make_joker(
name="Photograph",
xmult=True,
xmult_value=2.0,
triggers=["face_card"],
score_effect=["xmult"],
scoring_timing="during_card",
)Each schema captures:
- Effect type: chip, mult, xmult, economy, scaling, retrigger, copy, game_param
- Trigger conditions: any_hand, specific_hand_type, face_card, specific_suit, specific_rank, scoring_card, periodic, per_dollar_held, per_joker_owned, etc.
- Scoring timing:
during_card(fires per scored card) vsafter_cards(fires once per hand) - Scaling behavior: flat addition, multiplication, start value, increment, decay
- Per-card instance: whether the effect fires once or per qualifying card
- Effect probability: for chance-based jokers (Bloodstone 1/3, Lucky Card 1/5, etc.)
- Tier weights: competitive value rating for shop decisions
balatron/
|-- agent/
| |-- network.py # Neural network (shared trunk + 3 heads + value)
| |-- ppo.py # PPO trainer, rollout buffer, GAE
|
|-- environment/
| |-- game_state.py # 814-dim state vector encoder
| |-- action_space.py # Action masks, target mapping, joker evaluation
| |-- hand_eval.py # Full scoring engine, hand classifier, strategic advisor
| |-- reward.py # Reward shaping (log-scaled, multi-tier)
|
|-- data/
| |-- jokers.py # 150 joker schemas + tier weights
|
|-- training/
| |-- train.py # Main training loop, episode management, auto-play heuristics
|
|-- recorder.py # Automated win recording via ffmpeg gdigrab
|
|-- scripts/
| |-- sim_bloodstone.py # Simulation utilities
|
|-- tests/
| |-- test_scoring.py # Scoring engine validation
|
|-- NOTES.md # Architecture decisions, state vector layout, design rationale
|-- LICENSE # MIT License
|-- README.md # This file
- Python 3.12+
- PyTorch with CUDA support (GPU strongly recommended for training)
- Balatro (Steam version)
- Steamodded (>= 0.9.8) — Balatro mod loader
- BalatroBot mod (v1.4.1+) — JSON-RPC API for game control
- AMD Ryzen 9 9800X3D
- NVIDIA RTX 5070 Ti
- Training runs at ~1500 steps/hour with continuous Balatro gameplay
git clone https://github.com/jarmstrong158/Balatron.git
cd balatronpip install torch numpyFollow the BalatroBot installation guide. The mod requires Steamodded as a dependency.
The mod should be installed to:
%APPDATA%\Balatro\Mods\balatrobot\
pip install balatrobotTraining requires two terminals running simultaneously:
Terminal 1 — Start BalatroBot server + Balatro game:
uvx balatrobot serve --fastThis launches the Balatro game with the BalatroBot mod injected and starts the JSON-RPC API server.
Terminal 2 — Start training:
cd balatron
python -u -m training.train --total-timesteps 1500000 --device cudaTraining progress is printed to the console:
Runs: 50 (lifetime: 500) | Wins: 3 (lifetime: 12)
Avg reward: 8.42 | Best: 22.1 | Avg ante: 5.3
Win rate: 6.0% (lifetime: 2.4%)
Steps: 125000 / 1500000 (8.3%)
python -u -m training.train --total-timesteps 1500000 --device cuda --checkpoint checkpoints/balatron_phase1_final.ptCheckpoints are saved automatically during training.
Wins are automatically recorded via ffmpeg screen capture. Use --no-record to disable recording (reduces CPU/disk overhead):
python -u -m training.train --total-timesteps 1500000 --device cuda --checkpoint checkpoints/balatron_phase1_final.pt --no-recordGoal: Reliably clear Ante 8 on White Stake (base difficulty).
The agent learns:
- Basic hand selection (play the highest-scoring hand)
- Joker evaluation (which jokers improve scoring power)
- Economy management (interest tiers, spending discipline)
- Shop decisions (when to buy, reroll, or leave)
- Consumable usage (planet cards to upgrade hand levels, tarots for deck manipulation)
- Blind selection (when to skip for tags vs. accept for money)
Goal: Push scores to naneinf (~1.80e308) using transfer learning on Phase 1 weights.
The agent will learn:
- Infinite scaling combos (scaling jokers that compound over time)
- Deck thinning strategies (reduce deck to increase consistency)
- Suit-focused builds (concentrate on one suit for flush synergies)
- Boss blind manipulation (skip or prepare for specific boss effects)
- Long-game economy (accumulate wealth for later antes)
- Variable action space — Balatro's valid actions change dramatically between game phases. PPO handles continuous/discrete mixed spaces better than DQN.
- Long horizon — A full Balatro run is 100+ decisions across 8+ antes. PPO's advantage estimation handles long-horizon credit assignment.
- Training stability — PPO's clipped objective prevents catastrophic policy updates, critical when the reward signal is sparse (most feedback comes at game end).
Pure RL would require millions of games to learn basic Balatro math from scratch. The hybrid approach:
- Heuristics handle what's computable — exact scoring, hand classification, joker ordering
- NN handles what requires judgment — shop strategy, risk assessment, build direction
- Action masks bridge the gap — soft biases that guide exploration without hard-blocking learning
Encoding jokers by their properties (chips, mult, xmult, triggers, timing) rather than as one-hot IDs means:
- The network generalizes across similar jokers automatically
- New jokers (mods) work without retraining if their properties are encoded
- The state space stays manageable (54 fields per joker vs. 150-dim one-hot)
This project is built on top of BalatroBot, the JSON-RPC 2.0 API mod that makes programmatic Balatro interaction possible. Without this foundational work, none of this project would exist.
BalatroBot Authors:
- S1M0N38 (primary author)
- stirby
- phughesion
- besteon
- giewev
BalatroBot is licensed under the MIT License.
Balatro is created by LocalThunk. This project is a fan-made AI research project and is not affiliated with or endorsed by LocalThunk or Playstack.
- PyTorch — Neural network framework
- Steamodded — Balatro mod loader
- Claude Code — AI-assisted development
This project is licensed under the MIT License — see LICENSE for details.
The BalatroBot mod (used as a dependency, not included in this repo) is separately licensed under MIT by its respective authors.