Skip to content

Releases: meta-pytorch/tritonparse

TritonParse v0.4.4 Release 🎉

23 Apr 03:32

Choose a tag to compare

TritonParse Release Notes v0.4.4 (23 commits)

  • Date range: 2026-04-09 — 2026-04-22
  • Scope: Feature release — new compat_builder module for automated Triton/LLVM compatibility mapping, PyTorch bisection support, AI-powered diff root cause analysis, CLP archive viewer support, and reproducer correctness fixes.

Highlights

  • 🏗️ New compat_builder Module: Brand-new package (~2,085 lines across 8 modules) that automates generating commits.csv files for LLVM bumps in Triton. Uses a state-machine-driven workflow (CompatBuilder) with 7 phases, git bisect–based compatibility probing (build → import → smoke test), AI-powered fix generation via ClaudeCodeClient, CSV management with metadata headers, and a full CLI with --resume, --verify, and --status modes. Includes 200+ tests covering all pure-logic paths. Integrated into the main tritonparse CLI as the compat-build subcommand.

  • 🔍 PyTorch Bisection Support (#377): Extends the bisect module (~1,030 new lines) to bisect PyTorch commits in addition to Triton/LLVM. New TorchBisector class drives git bisect over a PyTorch repo using user-provided test scripts. Includes build infrastructure scripts for CUDA, cuSparseLt, and Magma installation, plus a prepare_build_pytorch.sh that sets up the PyTorch build environment. Accessible via tritonparse bisect --target torch.

  • 🤖 AI-Powered Diff Root Cause Analysis: Adds Phase 2 AI analysis to tritonparse diff --ai. Deterministic diff results from Phase 1 (metadata, IR stats, source mappings, tensor values) are formatted as structured markdown and sent to an LLM, which returns root cause explanations as DiffNote objects. Architecture includes a Triton-expert system prompt, priority-ordered context builder, and three response parsing strategies (JSON, structured markdown, raw text fallback). Supports both single-kernel and trace-level analysis with significance thresholds.

  • 📦 CLP Archive Support in Log Viewer (#382): The web viewer can now load and parse CLP (Compressed Log Processor) archives directly, completing the pipeline started in #326 where structured logging gained CLP output support. Updates DataSourceSelector, WelcomeScreen, and dataLoader.ts to handle CLP file selection and decompression via clp-ffi-js.

  • 🔧 OVERRIDE_TTIR Constexpr Interleaving Fix (#384): Fixes a TypeError that broke all triton-mpp analyze subcommands (ncu, barrier-analysis, plot-sm-occupancy) when kernel signatures interleave constexpr and non-constexpr parameters. The OVERRIDE_TTIR reproducer branch was removing constexpr args from positional lists, shifting remaining args into wrong positions. Fix passes all non-constexpr args as keyword args, eliminating position-dependent binding entirely.

  • 📝 Documentation Overhaul: Moves all GitHub Wiki pages into a version-controlled docs/ directory (~5,000 lines) with automatic wiki sync via GitHub Actions. Updates API signatures, adds documentation for diff, bisect, and compat-build subcommands, fixes outdated environment variable references, and corrects test commands.

Changes by Area

🏗️ New compat_builder Module

  • State Machine (state.py): CompatBuildPhase 7-phase enum (INITIALIZING → COMPLETED/FAILED), CompatBuildState dataclass with JSON serialization, CompatStateManager for persistence. 218 lines + 251 lines of tests.
  • Core Builder (builder.py, PR2-01): CompatBuilder orchestrator driving the initialize → find_next_incompatible → record_pair → fix_incompatibility loop. 773 lines + 634 lines of tests.
  • CSV Manager (csv_manager.py, PR2-02): CSVManager and BumpBlock for reading, validating, and writing single-bump CSV files with metadata headers. 261 lines + 413 lines of tests.
  • AI Fixer (ai_fixer.py, PR3-01): AI-powered compatibility fixing following a two-phase (deterministic context + AI) pattern. System prompt encoding LLVM API change patterns, structured context builder, AICompatFixer orchestrator. 442 lines + 346 lines of tests.
  • CLI (cli.py, PR3-02): Four modes — default build, --resume, --verify, --status. AI control flags (--ai/--no-ai, --ai-model) and worktree management. 364 lines + 225 lines of tests.
  • CLI Integration (PR3-03): compat-build subparser wired into the main tritonparse CLI.

🔍 Bisect Enhancements

  • PyTorch Bisection (#377): New TorchBisector class (142 lines), shell scripts for CUDA/cuSparseLt/Magma installation and PyTorch builds (~644 lines), CLI extension with --target torch. 130 lines of tests.
  • Torch Bisect Script Fixes (#383): Setup CUDA_HOME, install cuSparseLt libraries, install CI requirements across all bisect build scripts.
  • LLVM Path Comment Fix: Corrected misleading comments in bisect scripts about .llvm-project/ vs llvm-project/ directory layout.

🤖 AI & Diff

  • AI Root Cause Analysis for Diff: diff/fb/ai/ module with system prompt, context builder, and AIDiffAnalyzer orchestrator. --ai flag for both single-kernel and trace-level diff modes. 390 lines + tests (moved to tests/fb/diff/).
  • AI Diff Test Relocation: Moved fb-only AI diff tests from tests/cpu/diff/ to tests/fb/diff/ to prevent ModuleNotFoundError on GitHub CI.

🔧 Reproducer Fixes

  • OVERRIDE_TTIR Constexpr Fix (#384): Pass non-constexpr args as keyword args in the override branch, preventing TypeError when constexprs are interleaved with positional args. 123 lines of new tests.
  • num_warps_base Extraction: Extract original num_warps from TTGIR ttg.num-warps module attribute during parse phase, storing it as metadata["num_warps_base"]. Fixes warp-specialized kernels reporting inflated warp counts to the reproducer and viewer.
  • Per-Hash Tensor Blob Saving (#380): Tensor blob saving counter changed from global to per-compilation-hash. Each autotuned config saves exactly one set of blobs instead of only the first winner. Benchmark (autotune timing) launches are now always skipped.

🌐 Website & Viewer

  • CLP Archive Loading (#382): clp-ffi-js integration for decompressing and parsing CLP archives in the browser-based log viewer.
  • ESLint 10 Upgrade (#378): ESLint v9 → v10, react-hooks canary channel, React 19.2.5, Vite 8.0.7, TypeScript-ESLint 8.58.1.
  • ESLint 10 Lint Fixes (#379): Comprehensive fixes across 10 files for new lint rules — lazy state initialization, useCallback wrapping, extracted utility modules, error cause chaining.
  • Vite Security Bump (#381): Vite 8.0.7 → 8.0.8 (dependabot).

⚙️ Internal Improvements

  • TRITONPARSE_FB_MODE Env Var: Override is_fbcode() detection with TRITONPARSE_FB_MODE=0 (OSS) or =1 (fbcode). Fixes ImportError when running inside fbsource without Meta-internal dependencies.
  • Torch as Hard Dependency: Removed TORCH_INSTALLED conditional flag and 12 guard branches in structured_logging.py. Torch was already a de facto hard dependency.
  • FileCheck Binary Detection: Check package root, AMD backend, and NVIDIA backend paths (not just AMD), matching Triton's own _filecheck.py convention.
  • importlib.resources for Procedure Checks: Fix default_procedure_checks.json loading in PAR archives by switching from Path(__file__).parent to importlib.resources.files().

📝 Documentation & CI

  • Wiki → docs/ Migration: 10 wiki pages (5,000+ lines) moved into version-controlled docs/ directory with automatic sync via GitHub Actions.
  • Wiki Sync Regex Fix (#390): Escape literal ) in sed extended regex to fix sync-wiki.yml workflow.

Compatibility Notes

  • torch is now a hard dependency: The TORCH_INSTALLED guard has been removed. Environments without PyTorch installed will fail at import time rather than silently degrading.
  • TRITONPARSE_FB_MODE env var: New escape hatch for users running inside fbsource without full Meta-internal dependencies — set TRITONPARSE_FB_MODE=0 to force OSS mode.
  • No other breaking changes to the public API.

TritonParse v0.4.3 Release 🎉

08 Apr 23:24

Choose a tag to compare

  • Date range: 2026-04-01 — 2026-04-08
  • Scope: Bug-fix release - OVERRIDE_TTIR reproducer rewrite with stub kernel generation, warp-specialized kernel num_warps fix, Manifold upload scoping to fbcode MAST environments, OSS atexit cleanup fix, and _json_compat extensions.

Highlights

  • 🔧 OVERRIDE_TTIR Reproducer Rewrite (#376): Complete rewrite of the OVERRIDE_TTIR reproducer mode. The previous implementation was broken — it skipped defining the kernel function (causing NameError), only worked for autotuned kernels, and discarded constexpr values. The new approach generates a stub triton.jit function (same name and params, pass body) wrapped with triton.autotune carrying captured constexpr values, compile params, and ir_override pointing to the captured TTGIR. This eliminates the need to copy kernel source code and its transitive dependencies.

  • 🐛 Warp-Specialized Kernel Reproducer Fix: Fixed ptxas "Insufficient registers" failure when reproducing warp-specialized kernels. The Triton compiler overwrites metadata["num_warps"] with the post-expansion count (ttg.total-num-warps), causing the reproducer to double-inflate the warp count. The fix extracts the original ttg.num-warps from TTGIR module attributes instead.

  • 🔒 Manifold Upload Scoping: Manifold upload is now only enabled by default in fbcode MAST environments (detected via torch.version.git_version and MAST_HPC_JOB_NAME), preventing ModuleNotFoundError in OSS environments during atexit cleanup.

Changes by Area

🔧 Reproducer Enhancements

  • OVERRIDE_TTIR Stub Generation (#376): New stub_generator.py (~137 lines) generates stub Triton functions and extracts constexpr values. Rewritten _replace_kernel_import for OVERRIDE_TTIR generates stub + autotune config. _replace_kernel_invocation filters constexpr/compile params (autotune provides them). Captured IR files saved from compilation event's file_content to captured_irs/. Uses lru_cache on extract_params_from_source to avoid redundant AST parses.
  • Warp-Specialized num_warps Fix: At reproducer generation time, extracts original ttg.num-warps from TTGIR module attributes instead of the inflated metadata["num_warps"]. The post-expansion count is preserved as total_num_warps for informational purposes.

JSON Compatibility Layer

  • _json_compat.py Extensions: Added load(f) and dump(obj, f) file-based convenience wrappers delegating to existing loads()/dumps() with file I/O wrapping.
  • CUTracer Migration: All 14 CUTracer production Python files migrated from stdlib json to tritonparse._json_compat, providing a free 3-10x JSON performance upgrade via orjson with graceful degradation.

🔒 Manifold Upload & OSS Fixes

  • Scoped Default (#374, 3337a0c): TRITONPARSE_TRACE_MANIFOLD now defaults to "0" (OFF) and is only auto-enabled when running in fbcode and in a MAST environment. The env var override still works in all environments.
  • OSS atexit Fix (#374): Gated the Manifold upload path in _cleanup() behind is_fbcode() to prevent ModuleNotFoundError: No module named 'tritonparse.fb' during atexit in OSS environments.

🏗️ Infrastructure & CI

  • Packaging Workaround (#370): Added explicit pip install packaging in CI setup to work around PyTorch nightly (2.12.0.dev20260405+) missing dependency on packaging module.
  • Pin Node.js in CI: Pinned Node.js version in GitHub Actions CI workflows for reproducible builds.
  • Website Dependencies: Upgraded website dependencies and fixed Vite 8 / ESLint compatibility. Bumped vite from 8.0.3 to 8.0.5 (security fix).
  • Internal Repo Re-sync (#375): Cleaned up Claude Code configuration files that were incorrectly synced to the OSS repository.

Compatibility Notes

  • No breaking changes: This is a bug-fix release with no API or behavior changes for existing users.
  • Manifold upload default changed: TRITONPARSE_TRACE_MANIFOLD now defaults to OFF in non-fbcode environments. Users who relied on the previous default of ON in OSS should explicitly set TRITONPARSE_TRACE_MANIFOLD=1.
  • OVERRIDE_TTIR reproducer: The reproducer output format for OVERRIDE_TTIR mode has changed (stub kernel + autotune wrapper instead of source copy), but the generated reproducers are functionally equivalent and more reliable.

Upgrade Guidance

  1. Standard upgrade:

    pip install --upgrade tritonparse
  2. Warp-specialized kernel reproducers: Previously failing reproducers for warp-specialized kernels should now work correctly without manual intervention.

TritonParse v0.4.2 Release 🎉

01 Apr 02:33

Choose a tag to compare

TritonParse Release Notes v0.4.2 (45 commits)

  • Date range: 2026-02-27 — 2026-03-30
  • Scope: Feature release - New ai module for LLM-powered analysis, whole-trace --trace diff mode with multi-strategy kernel matching, FileCheck-based procedure detection replacing hardcoded BlockPingpong, orjson performance optimization with free-threading fallback, torch trace kernel attribution, JSON schema validation, and kernel-run-level tensor blob save controls.

Highlights

  • 🤖 New ai Module: LLM client abstraction layer with LLMClient ABC, ClaudeCodeClient for Claude Code CLI integration, MockClient for testing, and output parsers (extract_json, extract_code_block, extract_diff_patch). Foundation for AI-powered analysis features.

  • 🔬 Whole-Trace Diff (--trace mode): Compare all kernels across two trace files with a single command. Multi-strategy KernelMatcher engine matches kernels by hash → name → source similarity → fuzzy name → config similarity. TraceDiffEngine orchestrates matching, per-pair diffing, and summary generation. Autotuning-aware: distinguishes truly absent kernels from unpaired autotuning compilations.

  • 📋 FileCheck-Based Procedure Detection: Complete rewrite of IR analysis from hardcoded Python pattern matching to a JSON-driven, FileCheck-based system. Procedure definitions are declarative with configurable pattern checks and display attributes. Replaces old BlockPingpongCategory with three configurable procedure configs (Small/Medium/Large). Tile size attributes (M, N, K, bits) now displayed.

  • ⚡ orjson Performance + Free-Threading Fallback: New _json_compat.py compatibility layer uses orjson for performance and falls back to stdlib json for CPython 3.14 free-threading builds. All 21 modules migrated. orjson>=3.9 and rich>=13.0 are now default dependencies.

  • 🔍 Torch Trace Kernel Attribution: New torch trace log parser extracts kernel_source_path → CompileInfo mappings from inductor's output code events, enabling kernel-to-compilation-frame attribution when pt_info is missing. Wired through parse pipeline and CLI via --torch-trace-dir.

  • ✅ JSON Schema Validation: New tritonparse/validation/ module with JSON schemas for compilation, launch, launch_diff, and ir_analysis event types. Lightweight validator checks types, required fields, enums, numeric constraints, and $ref resolution.

  • 🎛️ Kernel-Run-Level Tensor Blob Controls: New TRITONPARSE_TENSOR_SAVE_SKIP_RUNS and TRITONPARSE_TENSOR_SAVE_MAX_RUNS environment variables (and Python API) for fine-grained control over which kernel runs get tensor blob snapshots.

Changes by Area

🤖 New ai Module

A new tritonparse/ai/ module (~1,400 lines) providing LLM client abstractions:

  • LLM Client ABC (PR-1): LLMClient abstract base with chat() and chat_stream() interfaces; Message, Response, ToolCall dataclasses; MockClient for testing
  • ClaudeCodeClient (PR-2): Production client wrapping Claude Code CLI with temp file shell escaping, session resumption, model selection, retry logic, JSON/stream-JSON parsing
  • Output Parsers (PR-3): extract_json(), extract_code_block(), extract_diff_patch() fallback parsers for LLM text responses; format_messages(), truncate_context() utilities
  • Error Diagnostics: Improved error handling extracts actual error from stdout JSON "result" field instead of just stderr

🔬 Whole-Trace Diff (--trace mode)

A complete trace-level comparison system (~3,400 lines) with layered architecture:

  • Data Types: MatchMethod enum (HASH/NAME/SOURCE/FUZZY_NAME/CONFIG), KernelMatchResult, TraceDiffResult, TraceDiffSummary, TraceStats, DtypeMismatch
  • KernelMatcher (~505 lines): Three-phase group-aware matching engine:
    • Phase 0: Hash-based exact matching (highest priority, cross-name capable)
    • Phase 1: Group-level matching by exact name → source similarity (threshold 0.75) → fuzzy name (threshold 0.7)
    • Phase 2: Within-group config pairing by (num_stages, num_warps, shared memory) similarity
    • Bounded sampling (_MAX_GROUP_SAMPLES=5) for performance on large traces
  • TraceDiffEngine (~355 lines): Orchestrator computing trace stats → kernel matching → per-pair DiffEngine → summary generation
  • Output: TraceSummaryFormatter for human-readable output; extended ConsolidatedDiffWriter with add_trace_diff()
  • CLI: New --trace flag requiring exactly 2 input files
  • Dtype Mismatch Detection: Surfaces dtype mismatches in tensor value comparison when argument names don't overlap
  • Test Reorganization: Monolithic test_diff.py split into 7 focused files: test_cli.py, test_diff_engine.py, test_fixtures.py, test_kernel_matcher.py, test_tensor_value.py, test_trace_diff.py, test_trace_output.py

CLI Usage:

# Compare all kernels across two trace files
tritonparseoss diff trace_a.ndjson trace_b.ndjson --trace

# With tensor value analysis
tritonparseoss diff trace_a.ndjson trace_b.ndjson --trace --tensor-values --atol 1e-5 --rtol 1e-3

📋 FileCheck-Based Procedure Detection

Complete rewrite of IR analysis (~2,200 lines) from hardcoded Python to JSON-driven FileCheck:

  • FileCheck Integration: Auto-discovers FileCheck binary from Triton's bundled version, FILECHECK_PATH env var, or system PATH
  • JSON Configuration (default_procedure_checks.json): Declarative procedure definitions with pattern_checks (FileCheck patterns) and display_attributes (configurable extraction rules)
  • Attribute Extraction: Multiple sources (module_attrs, ir_content, computed) with rules (regex, count, dot_shape, tile_size_bits, pp_clusters)
  • Tile Size Display: New tile_m, tile_n, tile_k, tile_size_bits attributes
  • BlockPingpong Migration: Old BlockPingpongCategory enum and ~254 lines of hardcoded Python replaced by three JSON-configured procedures (Small/Medium/Large)
  • Website UI: Collapsible/foldable sections per procedure in IRAnalysis page
  • Streamlined Workflow: Procedure detection integrated into main tritonparse parse pipeline

orjson Performance + Free-Threading Fallback

  • _json_compat.py (new): Unified JSON compatibility layer — orjson when available, stdlib json fallback
    • loads() accepts str | bytes | bytearray | memoryview
    • dumps() returns str with indent and sort_keys support
    • Non-string key coercion in fallback path (replicates orjson's OPT_NON_STR_KEYS)
  • Global Migration (#362): All 21 modules migrated from import json to from tritonparse._json_compat import loads, dumps, JSONDecodeError
  • Free-Threading Support (#365): Automatic stdlib json fallback for CPython 3.14 free-threading builds where orjson is unavailable
  • Default Dependencies (#366): orjson>=3.9 and rich>=13.0 added to pyproject.toml dependencies (previously zero dependencies)

🔍 Torch Trace Kernel Attribution

  • Torch Trace Parser (#353): New tritonparse/parse/torch_trace_parser.py (~212 lines) parsing inductor's glog-formatted torch trace logs to extract kernel_source_path → CompileInfo mappings from inductor_output_code events
  • Trace Processor Integration (#354): _build_kernel_attribution_map() and _apply_kernel_attribution() enrich compilation events with pt_info when missing (~126 lines)
  • CLI & Pipeline Wiring (#355): New --torch-trace-dir flag with auto-discovery of torch trace files from the same parent directory

JSON Schema Validation

  • Schema Files (#356): Four JSON schemas for compilation, launch, launch_diff, ir_analysis event types
  • Lightweight Validator (json_validator.py, ~287 lines): Validates required fields, types, enums, numeric constraints (min/max/exclusive), additionalProperties, array items, and $ref resolution
  • validate_trace_file(): Full NDJSON trace file validation with max_errors cap
  • Schema Loader: importlib.resources for PAR compatibility, lazy loading with caching
  • Test Suite: Comprehensive tests (~652 lines) covering all validation scenarios

🎛️ Tensor Blob Save Controls

  • Skip/Max Runs Gating: New environment variables for fine-grained control:
    • TRITONPARSE_TENSOR_SAVE_SKIP_RUNS: Skip tensor blob saving for the first N kernel runs (default: 0)
    • TRITONPARSE_TENSOR_SAVE_MAX_RUNS: Save tensor blobs for at most N kernel runs after skipping (default: 0 = unlimited)
  • Python API: TritonParseManager(tensor_save_skip_runs=N, tensor_save_max_runs=M) and init(tensor_save_skip_runs=N, tensor_save_max_runs=M)
  • Autotune-Aware: Benchmark launches during autotune are excluded from run counting
  • GPU Tests: End-to-end validation of skip/max runs gating

🔧 Reproducer Enhancements

  • CUDA Graph Capture Error (#359): Clear RuntimeError when reproducing kernels launched during CUDA graph capture, explaining that argument extraction was skipped
  • Kernel Name Fallback: Reproducer/info now falls back to matching by kernel name when compilation hash is missing (Inductor kernels where JIT hook didn't fire)

🔧 Bisect Enhancements

  • --triton-repo Flag: Controls culprit commit URL prefix — oai (triton-lang/triton, default) or meta (facebookexperimental/triton); state persisted and restored on resume
  • Rich as Default Dependency (#366): rich>=13.0 moved from optional to default, simplifying bisect UI code

🏗️ Infrastructure & CI

  • GitHub Actions Update (#357): All actions updated to latest versions; Python test matrix changed from 3.11 to 3.13
  • MAST Compatibility: Handle both numeric and string state formats in MAST CLI JSON output
  • Internal Test Reorganization (#358): test_mast_compat.py moved to tests/fb/ for ...
Read more

TritonParse v0.4.1 Release 🎉

25 Feb 18:56

Choose a tag to compare

  • Date range: 2026-01-22 — 2026-02-24
  • Scope: Feature release - New diff CLI subcommand for kernel compilation comparison with tensor value analysis, autotune analysis visualization, profile-aware launch tracing, enhanced reproducer support, bisect auto-setup, and multi-format trace compression support.

Highlights

  • 📊 Autotune Analysis: End-to-end autotune session tracking with frontend visualization. Automatically detects autotune sessions, tracks benchmark vs winner launches, displays configuration comparison tables, and shows winner run count statistics.

  • 🔬 New diff CLI Subcommand (Beta): Complete kernel compilation diff system for comparing two compilation events. Supports metadata analysis, source mapping comparison, IR statistics diff, and tensor value comparison with configurable tolerances (--tensor-values, --atol, --rtol). Output can be appended in-place or written to new files. Note: This feature is in beta — APIs and output formats may change in future releases.

  • ⚡ Profile-Aware Launch Tracing: Transparent integration with torch.profiler via TRITON_TRACE_LAUNCH_WITHIN_PROFILING=1. Monkey-patches torch.profiler.schedule to trace launches only during the profiler's RECORD phase.

  • 🗜️ Multi-Format Compression: Added CLP (Compressed Log Processor) support alongside existing gzip. Trace compression is now disabled by default (TRITON_TRACE_COMPRESSION=none). Magic number detection for transparent decompression.

  • 🔧 Bisect Auto-Setup: New --auto-env-setup flag for --llvm-only bisect mode. Automatically clones/updates Triton and LLVM repositories, creates conda environments.

  • 📦 TMA Kernel Support: TensorDescriptor capture and reconstruction for TMA (Tensor Memory Accelerator) kernel reproducers.

Changes by Area

📊 Autotune Analysis

Profile-Aware Launch Tracing

🔬 New diff CLI Subcommand (Beta)

A complete kernel compilation comparison system (~1500 lines) with layered architecture:

  • Data Types (D1): CompilationDiffResult, DiffNote, DiffSummary, IRStats, IRStatsDiff, MetadataDiff, TensorArgDiff, TensorValueDiff
  • Event Matching (D2): match_events_by_index(), match_events_by_kernel(), find_launch_for_compilation()
  • Diff Engine (D3): Main DiffEngine class orchestrating all analyzers
  • Metadata Analyzer (D4): Compares compilation metadata (num_warps, num_stages, etc.)
  • Sourcemap Analyzer (D5): Compares source mappings between IRs
  • Summary Generator (D6): Generates human-readable diff summaries
  • Output Module (D7): ConsolidatedDiffWriter, append_diff_to_file(), format_summary()
  • CLI Entry Point (D8): tritonparseoss diff command with --events, --kernel, --tensor-values flags
  • Tensor Value Analyzer: Numeric tensor comparison with blob mode (full element-wise) and stats mode (min/max/mean/std fallback)
  • Unit Tests: Phase 1 test coverage for core modules

CLI Usage:

# Compare compilations 0 and 1 in single file
tritonparseoss diff trace.ndjson --events 0,1

# Compare with tensor value analysis
tritonparseoss diff trace.ndjson --tensor-values --atol 1e-5 --rtol 1e-3

# List available compilations
tritonparseoss diff trace.ndjson --list

# Filter by kernel name
tritonparseoss diff trace.ndjson --kernel matmul --events 0,1

Profile-Aware Launch Tracing

  • New environment variable TRITON_TRACE_LAUNCH_WITHIN_PROFILING=1
  • patch_profiler_schedule(): Monkey-patches torch.profiler.schedule
  • enable_launch_tracing() / disable_launch_tracing() API
  • Mutually exclusive with TRITON_TRACE_LAUNCH (validated at init)
  • Unit tests for all three scenarios: no flag, trace all, profile-aware

🗜️ Compression Module

  • Magic number detection: detect_compression() for gzip/zstd/none
  • Transparent reading: open_compressed_file() context manager
  • CLP format support (#326): TRITON_TRACE_COMPRESSION="clp" for Compressed Log Processor format
  • Default change: Compression disabled by default (was gzip)
  • API functions: is_gzip_file(), is_zstd_file(), iter_lines()

🔧 Bisect Enhancements

  • EnvironmentManager (#329-#332):
    • Auto-clone Triton and LLVM repositories from GitHub
    • Create/verify conda environments
    • --auto-env-setup CLI flag for --llvm-only mode
    • Status checking and diagnostics
    • Unit tests for all scenarios

📦 Reproducer Enhancements

  • TensorDescriptor support (#344): Captures base, shape, strides, block_shape, padding for TMA kernels
  • preserve_autotune mode (#328): Preserve autotune configs in reproducer scripts
  • Robustness improvements: Complex kernel handling, function reference detection in call arguments (#348)
  • Verbose args print placeholder (#347): Placeholder for verbose argument printing
  • WS kernel fix (#349): Correct num_warps handling for Warp Specialization kernels
  • Better logging (#346): Improved logging when black/isort unavailable

🌐 Website UI Improvements

  • KernelOverview page: New component for autotune analysis visualization (870 lines)
  • WebSocket ArrayBuffer handling (#340): Direct trace ArrayBuffer via iframe messaging
  • URL normalization (#324): Manifold Explorer and tritonparse URL handling
  • Click-to-highlight tip: Added in CodeComparisonView
  • Title navigation fix: TritonParse title returns to home

🏗️ Infrastructure & API

  • SASS parsing refactor: extract_sass_pc_mappings() for PC-offset-keyed source mapping (for CUTracer integration)
  • Rank-less file support (#341, #342): --rank none for parsing files without rank suffix
  • Launch without compilation (#336): Support launch events when compilation was cached
  • log_dir parameter (#337): TritonParseManager(log_dir=...) API
  • Auto-switch log file: When rank becomes available during execution
  • Error message improvements (#339): Better diagnostics and bug fixes
  • Meta copyright headers: Added to all scripts
  • Dependabot prefix: [dependabot] prefix to PR titles
  • Negative line support (#319): prettify_ndjson handles negative line numbers

Compatibility Notes

  • Default Change: Trace compression is now disabled by default. Set TRITON_TRACE_COMPRESSION="gzip" to restore v0.4.0 behavior.
  • New Feature (Beta): The diff subcommand is additive and doesn't affect existing workflows. It is in beta — APIs and output formats may change.
  • New Feature: Autotune analysis events are automatically generated; frontend displays when available.
  • Mutual Exclusivity: TRITON_TRACE_LAUNCH and TRITON_TRACE_LAUNCH_WITHIN_PROFILING cannot both be set.

Upgrade Guidance

  1. Use diff for kernel comparison:

    # Basic diff
    tritonparseoss diff trace.ndjson --events 0,1
    
    # With tensor value comparison
    tritonparseoss diff trace.ndjson --tensor-values --kernel matmul
  2. Enable profile-aware tracing:

    TRITON_TRACE_LAUNCH_WITHIN_PROFILING=1 python train.py
  3. Use CLP compression (if available):

    TRITON_TRACE_COMPRESSION="clp" python train.py
  4. Bisect with auto-setup:

    tritonparseoss bisect --llvm-only --auto-env-setup \
        --triton-dir ~/oss-triton \
        --good-llvm abc123 --bad-llvm def456 \
        --test-script test.py
  5. TMA kernel reproducers: Now work automatically when TensorDescriptor arguments are present.

TritonParse v0.4.0 Release 🎉

22 Jan 18:07

Choose a tag to compare

TritonParse Release Notes v0.4.0 (115 commits)

  • Date range: 2025-12-26 — 2026-01-21
  • Scope: Major feature release - New bisect CLI subcommand for automated Triton/LLVM regression bisection, SASS source mapping support, BlockPingpong IR analysis, advanced filter syntax, and significant infrastructure improvements.

Highlights

  • 🔍 New bisect CLI Subcommand: Complete regression bisection system for Triton and LLVM. Automatically find culprit commits with git bisect integration, LLVM bump detection, commit pair testing, and Rich TUI real-time progress display. Supports resumable workflows and multiple operation modes.
  • 📊 SASS Source Mapping: Full SASS (NVIDIA assembly) source mapping support with fuzzy matching. Enables bidirectional mapping between SASS and other IR types (TTIR, TTGIR, PTX) in the website UI.
  • 🔬 BlockPingpong Detection: New IR analysis capability to detect and categorize block pingpong scheduling patterns in TTGIR, with color-coded visualization in the website UI.
  • 📦 Standalone Reproducer: New --embed-context flag embeds JSON context directly into generated Python scripts, creating fully self-contained single-file reproducers for easy sharing and bug reports.
  • 🎛️ Advanced Filter Syntax: Enhanced --args-list filtering with support for nested properties (C_ptr.dtype), array indexing (C_ptr.shape[0]), and list matching (C_ptr.shape=[3024, 10752]).
  • 🏗️ Infrastructure Modernization: Parse module refactored into dedicated subdirectory, unified logging system, centralized SVG icons, test directory restructuring, and ESLint integration for website.

Changes by area

🔍 New bisect CLI Subcommand

A complete regression bisection system spanning ~6000+ lines of code across 55+ PRs, organized in 7 architectural layers.

  • Operation modes (PR-43 ~ PR-52):

    • tritonparseoss bisect --good <commit> --bad <commit> - Triton-only bisect
    • --llvm-only - Direct LLVM commit bisection
    • --pair-test - Test (Triton, LLVM) commit pairs from CSV
    • --commits-csv - Full 4-phase workflow (Triton bisect → LLVM bump detection → pair test → LLVM bisect)
    • --resume / --status - Resume interrupted bisect or check status
  • Core bisector architecture (PR-15 ~ PR-21):

    • BaseBisector - Abstract base class with template method pattern
    • TritonBisector - Triton commit bisection with automatic build and test
    • LLVMBisector - LLVM commit bisection with Triton rebuild
    • Commit validation and correct bisect range detection
  • Commit detection and pair testing (PR-22 ~ PR-27):

    • CommitDetector - Automatically detects LLVM version bump commits
    • LLVMBumpInfo - Captures old/new LLVM hash information
    • PairTester - CSV-driven (Triton, LLVM) commit pair testing
    • LLVM range filtering for efficient pair selection
  • State management (PR-28 ~ PR-31):

    • BisectPhase enum: TRITON_BISECT, TYPE_CHECK, PAIR_TEST, LLVM_BISECT, COMPLETED, FAILED
    • BisectState dataclass with JSON serialization
    • StateManager for persistent state with auto-resume support
    • Automatic state file discovery (find_latest_state())
  • Rich TUI interface (PR-32 ~ PR-42):

    • BisectUI - Split-screen layout with progress and output panels
    • Real-time progress updates with phase, commit, and step information
    • Graceful fallback to plain text when Rich unavailable
    • print_final_summary() - Beautiful summary with GitHub links
  • Shell scripts (PR-06 ~ PR-13):

    • bisect_triton.sh - Triton build and test script for git bisect
    • bisect_llvm.sh - LLVM + Triton build with COMPAT_MODE support
    • test_commit_pairs.sh - Sequential pair testing with CSV support
    • scripts/__init__.py - Script path utilities
  • Execution infrastructure (PR-01 ~ PR-05, PR-14):

    • ShellExecutor - Blocking and streaming command execution
    • CommandResult dataclass with duration tracking
    • BisectLogger - Dual logging (file + TUI callback)
    • run_git_bisect_sequence() - Complete git bisect workflow
    • uv package manager support via config.py (PR-54)
    • Clean build environment before each bisect step (PR-55)
  • Unit tests (Test-PR-01 ~ Test-PR-03):

    • Tests for state.py, commit_detector.py, pair_tester.py
    • Tests for executor.py and logger.py (Layer 0)

📊 SASS Source Mapping Support

  • Fuzzy matching for SASS (commit 762844e):

    • New extract_sass_mappings() function in ir_parser.py
    • ignore_column parameter for fuzzy matching (SASS lacks column info)
    • Automatic fuzzy matching when source or target IR is "sass"
    • SASS comment line mapping (//## File "/path", line N)
    • Skip .nv_debug_ptx_txt debug file references
  • Website UI integration (#249):

    • SASS code panel support in IR Code View
    • Bidirectional highlighting between SASS and other IRs
    • Updated default trace with SASS code (commit 1b2d6a9)

🔬 BlockPingpong Detection

  • IR analysis enhancement (commits 50deca4, fe3092f, 0426510, 2dc0eac):
    • New BlockPingpong pattern detection in ir_analysis.py (~257 lines)
    • Automatic categorization of ping-pong scheduling patterns
    • Pattern matching descriptions for each category
    • Color-coded visualization in website UI
    • Dedicated Pingpong section in IR Analysis interface

📦 Reproducer Enhancements

  • Standalone reproducer (#252):

    • New --embed-context CLI flag (default: False)
    • Embeds JSON context directly into Python script
    • Creates fully self-contained single-file reproducer
    • Ideal for sharing, bug reports, and archiving
  • Compile params support (#295):

    • Pass compile parameters to kernel invocation
    • Fixes issue #277
  • Improved identification (#293, #294):

    • line_index added to reproducer filename
    • Metadata comments in generated scripts
  • Bug fixes:

    • Fix reproducer for inductor-generated Triton kernels (commit 430510c)
    • Fix isort reordering issue (#254)

🎛️ Advanced Filter Syntax

  • Nested property filtering (commit 3ee5df5):
    • Dot notation: C_ptr.dtype=torch.bfloat16
    • Array indexing: C_ptr.shape[0]=3024
    • List matching: C_ptr.shape=[3024, 10752]
    • Unified nested dict unwrapping across all value sources
    • Filter kernel launches by tensor metadata (shape, dtype, stride)

🌐 Website UI Improvements

  • Code panel enhancements:

    • Vertical resize capability for IR Code View panels (#253)
    • Horizontal scroll tip banner (#250)
    • Long kernel name overflow fix (#246)
    • Index prefix added to kernel selector (#279)
  • Code quality improvements (#228 ~ #235):

    • ESLint added to CI workflow (#235)
    • 8 PRs fixing React hooks, TypeScript, and lint errors
    • Fix Python source line highlight clearing (#236)
    • Display Python source line numbers from original file offset (#239)
  • Infrastructure:

    • Dependabot configuration for npm dependencies (#264)
    • Runtime accessibility test in CI (#265)
    • SVG icons centralized using @heroicons/react (#241)
    • Remote URL button fix (commit 7df4b87)
    • Compile-time flag for internal wiki link (commit 03c71e5)

🏗️ Infrastructure & Code Quality

  • Module reorganization:

    • Parse module refactored into tritonparse/parse/ subdirectory (#240)
    • Unified logger modules to tp_logger.py (#242)
    • Hierarchical sub-loggers under "tritonparse" namespace
  • Test infrastructure:

    • Test directory restructuring: tests/cpu/ and tests/gpu/
    • Extract GPU TensorBlob, complex kernels, reproducer E2E tests
    • Extract GPU structured logging + context manager tests
    • Extract CPU tests to dedicated directory
    • CI workflow updated for new test structure
  • Code formatting:

    • Align OSS formatting with internal pyfmt config (#256)
    • Black 25.11.0 style applied (commit b2d12f9)
  • Bug fixes:

    • Kernel selector overflow fix (commit b5c72b8)
    • Substring matching bug in call graph dependency filtering (commit 0ec75af)
    • PAR compatibility in function_extractor (commit 48551a2)
    • ast.unparse() for proper indentation in reproducer extraction (commit 1d8a33d)
    • --kernel-import help message fix (commit 18cf9d8)
    • source_repo_dir support for mapping production file paths (commit a952d99)
    • BisectLogger unique logger names per instance (#251)

📚 Documentation

  • Simplified CHANGELOG.md with links to GitHub releases (#226)
  • Website version bumped to 0.3.2 with dependency updates (#238)

Compatibility notes

  • New Feature: The bisect subcommand is an additive feature that doesn't affect existing workflows.
  • SASS Support: To use SASS source mapping, traces must include SASS IR (enable via enable_sass_dump=True or TRITONPARSE_DUMP_SASS=1).
  • Filter Syntax: The new advanced filter syntax is backward compatible; existing filter expressions continue to work.
  • Test Directory: Tests have been reorganized into tests/cpu/ and tests/gpu/ subdirectories.

Upgrade guidance

  1. Use bisect for regression hunting:

    # Basic Triton bisect
    tritonparseoss bisect --triton-dir /path/to/triton \
        --test-script test.py --good v2.0.0 --bad HEAD
    
    # Full workflow with LLVM bump detection
    tritonparseoss bisect --triton-dir /path/to/triton \
        --test-script test.py --good v2.0.0 --bad HEAD \
        --commits-csv pairs.csv
    
    # Resume interrupted bisect
    tritonparseoss bisect --resume
    
    # Check status
    tritonparseoss bisect --status
  2. Generate standalone reproducers:

    tritonparseoss reproduce trace.ndjson --kernel matmul --embed-context
  3. Use advanced filtering:

    tritonparseoss info trace.ndjson --args-list "C_ptr.shape[0]=302...
Read more

TritonParse v0.3.2 Release 🎉

26 Dec 18:12

Choose a tag to compare

TritonParse Release Notes v0.3.2 (34 commits)

  • Date range: 2025-11-05 — 2025-12-22
  • Scope: Major feature release - New info CLI subcommand, multi-file call graph analysis for reproducers, unified 0-based indexing, IR extraction tools, and infrastructure improvements.

Highlights

  • 📊 New info CLI Subcommand: Query kernel information from NDJSON trace files without manual parsing. List all kernels with launch counts, view launches for specific kernels, and get fuzzy matching suggestions for kernel names.
  • 🔍 Multi-File Call Graph Analyzer: Advanced AST-based analysis that automatically extracts all transitively-called functions across multiple Python files. Enables self-contained kernel reproducers with all dependencies included.
  • 🎯 Unified 0-Based Indexing: All launch indices throughout the codebase (CLI, website, internal APIs) now use consistent 0-based indexing following Python conventions.
  • ⚡ Enhanced Reproducer: New --kernel and --launch-id arguments eliminate manual line number lookup. AST-based dependency extraction, autotune disabler, and code formatting for generated scripts.
  • 🛠️ IR Extraction Tool: New command-line tool to extract Triton IRs (TTIR, TTGIR, LLIR, PTX) from trace logs with flexible output organization.
  • 🔐 PyPI Trusted Publishing: Migrated from API token authentication to OIDC-based Trusted Publishing for improved security and attestations.

Changes by area

📊 New info CLI Subcommand

  • Core query layer (PR #208):
    • New tritonparse/info/ module for kernel information queries
    • KernelSummary and LaunchInfo dataclasses for structured results
    • list_kernels(): List all kernels with launch counts
    • find_launch_index_by_kernel(): Find line index for a kernel's N-th launch
  • CLI interface (PR #210):
    • tritonparseoss info <trace.ndjson> - List all kernels with launch counts
    • tritonparseoss info <trace.ndjson> --kernel <name> - List launches for specific kernel
    • Auto-parsing: Automatically detects and parses raw logs
    • Fuzzy matching suggestions when kernel not found
    • Performance optimization using launch_diff events when available
  • Additional filtering (commit 8134195):
    • Added --args-list filtering to info command

🔍 Multi-File Call Graph Analyzer

  • Three-phase implementation (PR #206 Phase 1-3):
    • Phase 1 - ImportResolver: Multi-file call graph analysis foundation
    • Phase 2 - ImportParser: AST-based import statement parsing
    • Phase 3 - MultiFileCallGraphAnalyzer: Complete multi-file traversal with BFS
  • Key features:
    • Automatic extraction of transitively-called functions across Python files
    • Per-file code root tracking (fbcode, Python projects, Git repositories)
    • Graceful fallback for files outside detected roots
    • Integrated into reproducer for self-contained script generation

🎯 Unified 0-Based Indexing

  • Breaking change (PR #211):
    • All launch indices now use 0-based indexing
    • Affects: trace processor, website components (KernelOverview, DiffViewer, StackDiffViewer, ArgumentViewer)
    • CLI --line argument changed to 0-based (PR #205)
  • Rationale:
    • Consistency with Python conventions
    • Alignment with existing info and reproduce commands
    • Simpler code without +1/-1 conversions

Reproducer Enhancements

  • Kernel name lookup (PR #209):
    • New --kernel argument to specify kernel by name instead of line number
    • New --launch-id argument (0-based) to select specific launch
    • Mutual exclusivity with --line argument
    • Example: tritonparseoss reproduce trace.ndjson --kernel matmul_kernel --launch-id 2
  • AST-based dependency extraction (commit 8ad24f6):
    • Automatic extraction of dependent helper functions
    • Call graph analysis for transitive dependencies
    • Self-contained reproducers without manual function hunting
  • Autotune disabler (commit 28486fc):
    • Automatically disable Triton's autotune decorator in generated scripts
    • New utils.py module with _disable_triton_autotune() function
    • Works with both IMPORT and COPY kernel import modes
  • Code formatting (commit 311e016):
    • Generated reproducers are now properly formatted
  • Bug fixes:
    • Fix FileNotFoundError with absolute path templates (commit 458e6e9)
    • Fix kernel signature parsing for return type annotations (commit d24ae1d)
    • Support for Triton dtype parameters (commit 86fa46b)

🛠️ IR Extraction Tool

  • New tool (PR #202):
    • tritonparse/tools/extract_irs.py for extracting Triton IRs from trace logs
    • Supports TTIR, TTGIR, LLIR, PTX, and other IR formats
    • Flexible output: flat or by-kernel directory structure
    • Comprehensive documentation in tritonparse/tools/readme.md
  • Logger fix:
    • Fixed NameError: 'logger' is not defined in generated reproducers
    • Added proper logging initialization to templates

🔐 Infrastructure & CI/CD

  • PyPI Trusted Publishing (PR #219):
    • Migrated from API token to OIDC authentication
    • Enabled package attestations for provenance
    • No secrets management required
  • On-Demand Nightly Publishing (PR #216):
    • Flexible PyPI publishing workflow
  • Website build CI (PR #224):
    • Added CI test for website builds
    • Updated frontend dependencies
  • Usage tracking (commit 89913ff):
    • Extended usage_report_logger to track all subcommands and API calls
    • Entry function detection via call stack traversal
    • Added skip_logger parameter to prevent duplicate logging

🔧 Bug Fixes & Improvements

  • CUDA Graph capture fix (PR #197):
    • Fixed crash during CUDA graph capture in tensor argument extraction
    • Detects capture mode and skips problematic operations
    • Fixes compatibility with triton.testing.do_bench_cudagraph
  • Gzip support (PR #207):
    • Added gzip support for load_ndjson() function
  • Compilation metadata (PR #198):
    • Sort compilation metadata attributes alphabetically
  • Import formatting (commit 86a2229):
    • Format imports following Python style guide
  • Debug message (commit a205e50):
    • Added message for debugging when BlockPingpong exits early

📚 Documentation

  • Wiki pages (PR #223):
    • Added new wiki pages to documentation table
  • Dependency cleanup (PR #225):
    • Removed unnecessary npm overrides for prismjs and dompurify

Compatibility notes

  • Breaking Change: All launch indices are now 0-based. Website displays and CLI arguments have been updated. If you have scripts relying on 1-based line numbers from --line, update them to use 0-based indices.
  • New Features: The info subcommand and --kernel/--launch-id reproducer options are additive and don't break existing workflows.
  • Reproducer: Generated scripts now include autotune disabler and dependent functions automatically. Templates have been updated with proper logger initialization.

Upgrade guidance

  1. Update index references: Change any 1-based line number references to 0-based indices.
  2. Use info command: Replace manual trace file inspection with tritonparseoss info <trace.ndjson> to list kernels.
  3. Use kernel name lookup: Instead of --line N, use --kernel <name> --launch-id <id> for more intuitive reproducer generation.
  4. Extract IRs: Use new python -m tritonparse.tools.extract_irs <trace.ndjson> for IR extraction tasks.

TritonParse v0.3.1 Release 🎉

06 Nov 03:13

Choose a tag to compare

TritonParse Release Notes (last 24 commits)

  • Date range: 2025-10-14 — 2025-11-03
  • Scope: IR Analysis enhancements (beta), Reproducer template extensions, code viewer improvements, bug fixes.

Highlights

  • 📊 IR Analysis (Beta): New analysis capabilities for visualizing Software Pipelining (SWP), BufferOps statistics, and loop schedules in Triton IR. Note: This is a beta feature.
  • 🏷️ Variable Location Tracking: Complete location alias tracking system for mapping IR locations back to source code with frontend visualization.
  • 🔧 TritonBench Template: New reproducer template for easy TritonBench integration and kernel benchmarking.
  • 🎨 Code Viewer Enhancements: Full Python source extraction, function highlighting, and performance optimizations.
  • 🔄 Reproducer Refactoring: AST-based function extraction eliminates code duplication and simplifies template maintenance.

Changes by area

📊 IR Analysis (Beta)

  • Software Pipelining (SWP) visualization (PR #189):
    • Analyzes inner scf.for loops and identifies prologue, loop_body, and epilogue stages
    • Tracks tt.load and tt.dot operations through TTIR → TTGIR → Python source mappings
    • Frontend displays simplified source code with SWP stage information
    • Limitations: Does not support Warp Specialization or Blackwell operators yet
  • BufferOps backend information (PR #181):
    • Statistical analysis of buffer operations (tt.load/store, amdgpu.buffer_load/store, global_load/store) at TTGIR and AMDGCN levels
    • Useful for AMD GPU backend optimization analysis
  • Web frontend IR Analysis page (PR #184):
    • New dedicated page at /ir-analysis route with integrated display for loop schedules and BufferOps statistics

🏷️ Variable Location Tracking

Complete three-part implementation (PR #186, #187, #188):

  • Fixed #loc storage key conflict in IR parser
  • Added location alias parsing support in ir_parser.py and trace_processor.py
  • Frontend visualization with CSS styling and interactive location display in Code Viewer

🔄 Reproducer System

  • TritonBench template support (commit 3493ac8):
    • New template: tritonparse/reproducer/templates/tritonbench.py
    • CLI option: --template tritonbench for TritonBench-compatible reproducers
    • Integrates with TritonBench's BenchmarkOperator and benchmark harness
  • AST-based refactoring (PR #178):
    • New module: tritonparse/reproducer/function_extractor.py using Python AST
    • Simplified example.py template from ~370 lines to ~20 lines
  • Bug fixes:
    • Fixed 1-based to 0-based line number conversion (PR #185)
    • Corrected output key typo: repo_*repro_* (PR #175)
    • CUDA device normalization to cuda:0 format (PR #177)

📝 Callsite Location Support

  • TTIR/TTGIR callsite location (PR #190):
    • Extended IR parser to extract callsite location information
    • Better debugging with call graph information and test coverage

💻 Code Viewer & Frontend

  • Full Python source extraction (commit 2976887):
    • Enhanced structured_logging.py to extract complete Python source files
  • Full file display with function highlighting (commit 220d5a4):
    • CodeViewer now supports displaying entire source files with function-level highlighting
  • CodeComparisonView performance optimization (commit c17e584):
    • Significant rendering performance improvements for large files
    • Reduced re-renders and improved memory efficiency

🌐 Website & Maintenance

  • Dependency updates (PR #179): Added automation script website/scripts/update_deps.sh
  • Copyright updates (PR #183): Updated copyright headers across source files

Compatibility notes

  • No breaking changes: All updates are backward compatible with v0.3.0.
  • IR Analysis (Beta): New optional feature accessible through web UI.
  • TritonBench template: Optional, does not impact existing reproducer generation.

Upgrade guidance

  1. Using IR Analysis (Beta):

    • Open web UI and navigate to IR Analysis page after parsing
    • View SWP stage information (prologue/loop_body/epilogue) and BufferOps statistics
    • Note: Beta feature with some limitations on advanced pipelining patterns
  2. Generating TritonBench reproducers:

    tritonparseoss reproduce trace.ndjson.gz --line <N> --template tritonbench --out-dir <output>
  3. Code viewer enhancements: Automatically enabled with full source display and function highlighting

TritonParse v0.3.0 Release 🎉

15 Oct 03:34

Choose a tag to compare

TritonParse Release Notes (last 44 commits)

  • Date range: 2025-09-19 — 2025-10-14
  • Scope: Major feature release - Reproducer system, tensor storage, SASS support, enhanced context manager, CLI improvements.

Highlights

  • 🔄 Reproducer System (Complete): Full-featured standalone kernel script generation with template support, tensor reconstruction, and multiple import modes. Extract any traced kernel into a self-contained Python script for debugging, testing, and sharing.
  • 💾 TensorBlobManager: Production-ready content-addressed tensor storage with automatic compression, deduplication, quota management, and efficient disk usage. Enables high-fidelity kernel reproduction with actual tensor data.
  • 🔧 SASS Disassembly Support: Optional NVIDIA SASS disassembly during compilation tracing for low-level debugging and performance analysis. Toggle via enable_sass_dump parameter or TRITONPARSE_DUMP_SASS environment variable.
  • 🎯 Enhanced Context Manager: Configurable TritonParseManager context manager with support for trace launch control, inductor compilation splitting, and flexible parsing parameters.
  • ⚡ CLI Modernization: Refactored to subcommand structure (tritonparseoss parse, tritonparseoss reproduce) with unified entry point and improved argument handling.
  • 📊 Auto-enable Inductor Launch Tracing: Automatic detection and tracing of PyTorch Inductor-compiled kernels without manual configuration.
  • 🌐 Website Improvements: Light mode color scheme, improved stack display in Launch Analysis, and better file diff navigation.

Changes by area

🔄 Reproducer System

  • Complete reproducer infrastructure (PR #117-127):
    • CLI subcommand structure: tritonparse reproduce <ndjson_file> [options]
    • NDJSON ingestion layer with IR preservation
    • Context bundle system for kernel metadata and parameters
    • Standardized output paths: repro_output/<kernel_name>/repro_<timestamp>.py
    • Template support with placeholder system for custom generation
    • Example templates for tensor loading and kernel invocation
    • Dynamic import generation for kernel dependencies
    • Kernel signature parsing and integration
    • Kernel invocation snippet generation with grid/block configuration
  • Kernel import modes (PR #165, #166):
    • --kernel-import direct: Import kernel from source file
    • --kernel-import override-ttir: Override and inject TTIR for advanced debugging
    • Flexible kernel loading strategies for different debugging workflows
  • Enhanced tensor handling (PR #141):
    • Improved tensor metadata logging (shape, dtype, stride, storage offset, device)
    • Better tensor reconstruction quality in generated reproducers
    • Support for non-contiguous tensors (commit 12f1d1b)
  • Extensible placeholder system (PR #149):
    • Refactored placeholder replacement with class-based design
    • Support for: {{KERNEL_IMPORT_PLACEHOLDER}}, {{KERNEL_INVOCATION_PLACEHOLDER}}, {{KERNEL_SYSPATH_PLACEHOLDER}}, {{JSON_FILE_NAME_PLACEHOLDER}}
    • Easy extension for future template needs
  • Documentation: Comprehensive reproducer section in README (PR #161) and Usage Guide in Wiki

💾 TensorBlobManager & Storage

  • Production-ready blob storage (PR #156):
    • Content-addressed storage using BLAKE2b hashing
    • Automatic gzip compression for large tensors (>1MB)
    • Two-level directory structure (xx/hash.bin.gz) to avoid filesystem limits
    • Automatic deduplication: identical tensors stored only once
    • Storage quota enforcement (default: 100GB)
    • Per-tensor size limit (default: 10GB) to prevent OOM
    • Real-time statistics: saved count, dedup hits, compression ratio
    • Graceful degradation with warning logs when quota exceeded
  • Compression support (PR #157):
    • Configurable compression level (default: 4)
    • Atomic writes using temporary files + rename for safety
    • Hash verification for data integrity
  • Comprehensive testing (PR #162):
    • Unit tests for compression, deduplication, quota management
    • Edge case handling and cleanup verification

🔧 SASS Disassembly

  • SASS extraction support (PR #137):
    • New tool: tritonparse/tools/disasm.py for CUBIN disassembly
    • Integration into structured logging behind opt-in flag
    • Uses nvdisasm -c -gp -g -gi for detailed disassembly
    • Parses output to find function blocks with preserved labels and source mapping
  • Configuration:
    • Environment variable: TRITONPARSE_DUMP_SASS=1
    • API parameter: enable_sass_dump=True in structured_logging.init()
    • API parameter takes precedence over environment variable
  • Robustness:
    • Error handling for subprocess failures, missing nvdisasm, and generic exceptions
    • Writes marker messages instead of failing the trace
    • Requires NVIDIA CUDA Binary Utilities (nvdisasm)
  • CUDA testing (PR #138):
    • Strengthened tests to validate SASS extraction and persistence

🎯 Context Manager & API

  • Enhanced context manager (PR #144, #159):
    • Added __init__ method with configurable parameters:
      • enable_trace_launch: Control trace launch logging
      • split_inductor_compilations: Control inductor compilation splitting
      • **parse_kwargs: Additional arguments for unified_parse
    • Updated __exit__ to pass parameters through to parsing pipeline
    • More flexible for different use cases and workflows
  • Split inductor compilations control:
    • Parameter threading through: unified_parse()oss_run()parse_logs()parse_single_file()
    • Renamed from split_by_frame_id_and_compile_id to split_inductor_compilations for clarity
    • Default True: splits by frame_id, frame_compile_id, attempt_id, compiled_autograd_id
    • When False: groups all inductor compilations together
    • Follows tlparse's convention
  • Unit tests (commit a5338ce):
    • Tests for enhanced context manager behavior
    • Validation of split inductor compilation modes

CLI & Entry Points

  • Subcommand structure (PR #117):
    • Refactored from single-command to modern subcommand architecture
    • tritonparse parse <source> [options] - Run structured log parser
    • tritonparse reproduce <ndjson_file> [options] - Generate reproducers
    • Breaking change: old python run.py <source> no longer works
    • Extract parser flags into tritonparse.utils._add_parse_args()
    • Remove unified_parse_from_cli (programmatic unified_parse() remains)
  • Unified entry point (PR #133):
    • Added proper CLI entry point in package configuration
    • Unified argument handling across commands
  • CLI entry point fix (PR #154):
    • Fixed ModuleNotFoundError for tritonparse CLI entry point
    • Improved package installation and command availability

📊 Logging & Tracing

  • Auto-enable Inductor Launch Tracing (PR #142):
    • Automatically detect and trace PyTorch Inductor-compiled kernels
    • No manual configuration required for Inductor workflows
    • Seamless integration with existing tracing infrastructure
  • Kernel source path output (commit 03bc1e1):
    • Output kernel_src_path in trace metadata for better debugging
  • NDJSON prettifier improvements (PR #135):
    • Renamed and inverted flag to default-filter IRs
    • More intuitive filtering behavior
  • Debug flag deprecation (PR #132):
    • Removed unused debugging flags
    • Cleaner configuration surface

🌐 Website & UI

  • Upgraded to Tailwind CSS v4 (commit 6c42d8a):
    • Migrated from PostCSS plugin to @tailwindcss/vite for improved performance
    • Updated CSS import syntax from @tailwind directives to @import "tailwindcss"
    • Removed tailwind.config.js and postcss.config.js (now CSS-based configuration)
    • Updated shadow class naming to v4 convention (shadowshadow-sm)
    • Cleaned up global CSS to prevent interference with Tailwind utility classes
  • Upgraded all frontend dependencies:
    • Vite: 6.3.5 → 7.1.10
    • React ecosystem: Updated to latest versions (React 19+)
    • TypeScript: 5.7.2 → 5.7.3
    • Added @types/node for Node.js type definitions
    • Fixed dompurify security vulnerability (3.1.7 → 3.3.0) via npm overrides
  • Light mode color scheme (PR #139):
    • Updated index.css to support only light mode
    • Consistent, professional appearance
  • Improved stack display (PR #151):
    • Better stack trace visualization in Launch Analysis
    • Clearer debugging information
  • Documentation cleanup (PR #172):
    • Removed redundant docs directory and screenshots
    • Streamlined repository structure

🔧 Bug Fixes & Maintenance

  • General bug fixes (PR #153):
    • Multiple stability and reliability improvements
    • Better error handling throughout codebase
  • Deserialization fix (commit d4d7a20):
    • Fixed unhandled types in deserialization
    • More robust data loading
  • README improvements (PR #158, #164):
    • Refactored and cleaned up README
    • Fixed command typos in reproducer generation examples
    • Clearer installation and usage instructions
  • Test cleanup (PR #160):
    • Removed deprecated test for triton_kernels Tensor functionality
    • Updated test suite for current codebase

Compatibility notes

  • Breaking Change: CLI now uses subcommand structure. Old usage python run.py <source> must be updated to tritonparse parse <source> or python run.py parse <source>.
  • New Dependencies: SASS disassembly requires NVIDIA CUDA Binary Utilities (nvdisasm). This is optional and only needed if enable_sass_dump=True.
  • Storage: TensorBlobManager introduces new blob storage directory structure. Default quota is 100GB; configure via TensorBlobManager initialization if needed.
  • Context Manager API: Enhanced with new parameters. Fully backward compatible with sensible defaults.

...

Read more

TritonParse v0.2.3 Release 🎉

19 Sep 20:26

Choose a tag to compare

TritonParse Release Notes (last 15 commits)

  • Date range: 2025-09-13 — 2025-09-18
  • Scope: Website UI/UX, core library, CI/CD & packaging, documentation & testing.

Highlights

  • Website File Diff tooling: Introduced a new Diff Comparison view and File Diff page, preserved diff sessions across navigation, integrated Monaco editor, added preview mode, and shipped a round of UI polish with a URL redirect fix for File Diff navigation.
  • Kernel Overview: Added a tiled kernel view toggle to improve dense overviews.
  • Core: Added lazy-import support for Triton repo triton_kernels custom types, attribution check for torch._utils_internal, and safer file mapping cleanup in the log parser.
  • CI/Packaging: Refactored dependencies in pyproject.toml, removed a legacy Triton install script, and updated GitHub Actions workflows.
  • Docs & tests: Improved README guidance; added tests and example outputs; minor UI bug fix in CopyCodeButton SVG attributes.

Changes by area

  • Website UI/UX

    • Introduce DiffComparisonView and FileDiffView; maintain diff session state; integrate Monaco editor; preview mode; UI polish and navigation fixes.
    • Add tiled kernel view toggle in KernelOverview.
  • Core library

    • Lazy-import support for triton_kernels custom types; extend tensor handling in tests.
    • Add attribution check for torch._utils_internal.
    • Refactor file mapping cleanup in parse_logs.
  • CI/CD & packaging

    • Refactor dependencies in pyproject.toml; remove .ci/install-triton-pip.sh.
    • Update GitHub Actions workflows; add helper for triton_kernels in CI.
  • Docs & testing

    • Clarify tool purpose and installation in README.md.
    • Add tests and sample outputs; small UI component fixes.

Compatibility notes

  • No breaking changes expected. triton_kernels support is optional via lazy import.

Upgrade guidance

  • Reinstall website dependencies if developing the UI to pick up the Monaco editor.

TritonParse v0.2.0 Release 🎉

11 Sep 23:08

Choose a tag to compare

TritonParse Release Notes (last 27 commits)

  • Date range: 2025-07-25 — 2025-09-11
  • Scope: Core library, website UI/UX, performance & scalability, CI/CD & packaging, documentation & maintenance.

Highlights

  • PyPI package: TritonParse has been added to PyPI and can be installed by pip install tritonparse!
  • Website usability: Drag-and-drop to open logs; one-click copy in code viewers; sticky, compact kernel selector; footer shows app version, localized build date, and Git short SHA; tensor arguments in Launch Analysis now display concise summaries with expandable details.
  • Large-file parsing: Streaming NDJSON parsing and robust gzip handling significantly reduce memory usage and improve stability for files >100 MB.
  • Core & integrations: Persist Inductor kernel config into inductor_metadata and pass to JIT hooks; ensure Inductor path invokes jit_post_compile_hook; new init_with_env for environment-based initialization; move compilation timing times into metadata for automatic frontend rendering.
  • Releases & versioning: Adopt setuptools-scm dynamic versioning; add Nightly PyPI publishing; enable stable publishing on tag push; fix nightly version potentially being older than stable; correct packaging license metadata.
  • CI stability: Ubuntu 24.04 compatibility; improved CUDA/cuDNN setup and detection; parallelize jobs; add parallel CI for pip-installed Triton; better error visibility in install scripts; upgrade libstdc++.

Changes by area

  • Core library

    • Save Inductor kernel params to inductor_metadata and forward to JIT hooks.
    • Manually invoke jit_post_compile_hook in the Inductor Triton compile path.
    • Add init_with_env that reads TRITON_TRACE_FOLDER and TRITON_TRACE_LAUNCH.
    • Move compilation times into metadata so the frontend auto-renders it.
    • Use cached source in compile listener for stability.
    • Refactor source-mapping pipeline into modular units for maintainability.
  • Website UI/UX

    • Drag-and-drop to open supported log files.
    • Copy button in code viewer panels.
    • Sticky/collapsible/compact kernel selector in Kernel Overview; resizable compilation stack trace vertically.
    • Launch Analysis: tensor args show concise summaries with expandable details.
    • Footer displays version, localized build date, and Git short SHA.
    • Streaming NDJSON parsing and improved error handling for large logs.
  • Performance & scalability

    • Use streaming path for files >100 MB to reduce memory peaks and improve robustness.
  • CI/CD & packaging

    • Enable setuptools-scm and nightly PyPI publishing.
    • Publish stable releases on tag push; improve version computation and tag detection.
    • Fix nightly version possibly lagging behind stable; add clear error on missing tags.
    • Add parallel CI for pip-installed Triton; recommend pip installation in docs.
    • Improve Ubuntu 24.04 setup, CUDA/cuDNN handling, and job parallelism.
    • Increase error visibility in install scripts and upgrade libstdc++.
    • Define lower bounds for prerequisites in pyproject.toml.
  • Docs & maintenance

    • Move repository to meta-pytorch org; update links and guidance; add AI assistant context.
    • Update/restore CONTRIBUTING docs to avoid breaking downstream consumers.
  • Testing

    • Preserve test outputs when TEST_KEEP_OUTPUT=1 to aid debugging.

Compatibility notes

  • Versioning & publishing: setuptools-scm with tag-based stable releases and nightly dev versions. Ensure PYPI_API_TOKEN is configured in CI if publishing is intended.
  • Data format: compilation timing times moved under metadata; update any downstream scripts that referenced the old location.
  • Build metadata: footer shows localized build date and Git short SHA; restart dev server to refresh these values.

Upgrade guidance

  • Prefer Triton from PyPI (≥ 3.4.0) and adhere to the lower bounds declared in pyproject.toml.
  • For deterministic build metadata in the website, set BUILD_DATE and GIT_COMMIT_SHA_SHORT in the environment when running dev/build.