Releases · meta-pytorch/tritonparse

23 Apr 03:32

FindHao

v0.4.4

49893ed

TritonParse v0.4.4 Release 🎉 Latest

Latest

TritonParse Release Notes v0.4.4 (23 commits)

Date range: 2026-04-09 — 2026-04-22
Scope: Feature release — new compat_builder module for automated Triton/LLVM compatibility mapping, PyTorch bisection support, AI-powered diff root cause analysis, CLP archive viewer support, and reproducer correctness fixes.

Highlights

🏗️ New compat_builder Module: Brand-new package (~2,085 lines across 8 modules) that automates generating commits.csv files for LLVM bumps in Triton. Uses a state-machine-driven workflow (CompatBuilder) with 7 phases, git bisect–based compatibility probing (build → import → smoke test), AI-powered fix generation via ClaudeCodeClient, CSV management with metadata headers, and a full CLI with --resume, --verify, and --status modes. Includes 200+ tests covering all pure-logic paths. Integrated into the main tritonparse CLI as the compat-build subcommand.
🔍 PyTorch Bisection Support (#377): Extends the bisect module (~1,030 new lines) to bisect PyTorch commits in addition to Triton/LLVM. New TorchBisector class drives git bisect over a PyTorch repo using user-provided test scripts. Includes build infrastructure scripts for CUDA, cuSparseLt, and Magma installation, plus a prepare_build_pytorch.sh that sets up the PyTorch build environment. Accessible via tritonparse bisect --target torch.
🤖 AI-Powered Diff Root Cause Analysis: Adds Phase 2 AI analysis to tritonparse diff --ai. Deterministic diff results from Phase 1 (metadata, IR stats, source mappings, tensor values) are formatted as structured markdown and sent to an LLM, which returns root cause explanations as DiffNote objects. Architecture includes a Triton-expert system prompt, priority-ordered context builder, and three response parsing strategies (JSON, structured markdown, raw text fallback). Supports both single-kernel and trace-level analysis with significance thresholds.
📦 CLP Archive Support in Log Viewer (#382): The web viewer can now load and parse CLP (Compressed Log Processor) archives directly, completing the pipeline started in #326 where structured logging gained CLP output support. Updates DataSourceSelector, WelcomeScreen, and dataLoader.ts to handle CLP file selection and decompression via clp-ffi-js.
🔧 OVERRIDE_TTIR Constexpr Interleaving Fix (#384): Fixes a TypeError that broke all triton-mpp analyze subcommands (ncu, barrier-analysis, plot-sm-occupancy) when kernel signatures interleave constexpr and non-constexpr parameters. The OVERRIDE_TTIR reproducer branch was removing constexpr args from positional lists, shifting remaining args into wrong positions. Fix passes all non-constexpr args as keyword args, eliminating position-dependent binding entirely.
📝 Documentation Overhaul: Moves all GitHub Wiki pages into a version-controlled docs/ directory (~5,000 lines) with automatic wiki sync via GitHub Actions. Updates API signatures, adds documentation for diff, bisect, and compat-build subcommands, fixes outdated environment variable references, and corrects test commands.

Changes by Area

🏗️ New `compat_builder` Module

State Machine (state.py): CompatBuildPhase 7-phase enum (INITIALIZING → COMPLETED/FAILED), CompatBuildState dataclass with JSON serialization, CompatStateManager for persistence. 218 lines + 251 lines of tests.
Core Builder (builder.py, PR2-01): CompatBuilder orchestrator driving the initialize → find_next_incompatible → record_pair → fix_incompatibility loop. 773 lines + 634 lines of tests.
CSV Manager (csv_manager.py, PR2-02): CSVManager and BumpBlock for reading, validating, and writing single-bump CSV files with metadata headers. 261 lines + 413 lines of tests.
AI Fixer (ai_fixer.py, PR3-01): AI-powered compatibility fixing following a two-phase (deterministic context + AI) pattern. System prompt encoding LLVM API change patterns, structured context builder, AICompatFixer orchestrator. 442 lines + 346 lines of tests.
CLI (cli.py, PR3-02): Four modes — default build, --resume, --verify, --status. AI control flags (--ai/--no-ai, --ai-model) and worktree management. 364 lines + 225 lines of tests.
CLI Integration (PR3-03): compat-build subparser wired into the main tritonparse CLI.

🔍 Bisect Enhancements

PyTorch Bisection (#377): New TorchBisector class (142 lines), shell scripts for CUDA/cuSparseLt/Magma installation and PyTorch builds (~644 lines), CLI extension with --target torch. 130 lines of tests.
Torch Bisect Script Fixes (#383): Setup CUDA_HOME, install cuSparseLt libraries, install CI requirements across all bisect build scripts.
LLVM Path Comment Fix: Corrected misleading comments in bisect scripts about .llvm-project/ vs llvm-project/ directory layout.

🤖 AI & Diff

AI Root Cause Analysis for Diff: diff/fb/ai/ module with system prompt, context builder, and AIDiffAnalyzer orchestrator. --ai flag for both single-kernel and trace-level diff modes. 390 lines + tests (moved to tests/fb/diff/).
AI Diff Test Relocation: Moved fb-only AI diff tests from tests/cpu/diff/ to tests/fb/diff/ to prevent ModuleNotFoundError on GitHub CI.

🔧 Reproducer Fixes

OVERRIDE_TTIR Constexpr Fix (#384): Pass non-constexpr args as keyword args in the override branch, preventing TypeError when constexprs are interleaved with positional args. 123 lines of new tests.
num_warps_base Extraction: Extract original num_warps from TTGIR ttg.num-warps module attribute during parse phase, storing it as metadata["num_warps_base"]. Fixes warp-specialized kernels reporting inflated warp counts to the reproducer and viewer.
Per-Hash Tensor Blob Saving (#380): Tensor blob saving counter changed from global to per-compilation-hash. Each autotuned config saves exactly one set of blobs instead of only the first winner. Benchmark (autotune timing) launches are now always skipped.

🌐 Website & Viewer

CLP Archive Loading (#382): clp-ffi-js integration for decompressing and parsing CLP archives in the browser-based log viewer.
ESLint 10 Upgrade (#378): ESLint v9 → v10, react-hooks canary channel, React 19.2.5, Vite 8.0.7, TypeScript-ESLint 8.58.1.
ESLint 10 Lint Fixes (#379): Comprehensive fixes across 10 files for new lint rules — lazy state initialization, useCallback wrapping, extracted utility modules, error cause chaining.
Vite Security Bump (#381): Vite 8.0.7 → 8.0.8 (dependabot).

⚙️ Internal Improvements

TRITONPARSE_FB_MODE Env Var: Override is_fbcode() detection with TRITONPARSE_FB_MODE=0 (OSS) or =1 (fbcode). Fixes ImportError when running inside fbsource without Meta-internal dependencies.
Torch as Hard Dependency: Removed TORCH_INSTALLED conditional flag and 12 guard branches in structured_logging.py. Torch was already a de facto hard dependency.
FileCheck Binary Detection: Check package root, AMD backend, and NVIDIA backend paths (not just AMD), matching Triton's own _filecheck.py convention.
importlib.resources for Procedure Checks: Fix default_procedure_checks.json loading in PAR archives by switching from Path(__file__).parent to importlib.resources.files().

📝 Documentation & CI

Wiki → docs/ Migration: 10 wiki pages (5,000+ lines) moved into version-controlled docs/ directory with automatic sync via GitHub Actions.
Wiki Sync Regex Fix (#390): Escape literal ) in sed extended regex to fix sync-wiki.yml workflow.

Compatibility Notes

torch is now a hard dependency: The TORCH_INSTALLED guard has been removed. Environments without PyTorch installed will fail at import time rather than silently degrading.
TRITONPARSE_FB_MODE env var: New escape hatch for users running inside fbsource without full Meta-internal dependencies — set TRITONPARSE_FB_MODE=0 to force OSS mode.
No other breaking changes to the public API.

Assets 2

08 Apr 23:24

FindHao

v0.4.3

8b49b4b

TritonParse v0.4.3 Release 🎉

Date range: 2026-04-01 — 2026-04-08
Scope: Bug-fix release - OVERRIDE_TTIR reproducer rewrite with stub kernel generation, warp-specialized kernel num_warps fix, Manifold upload scoping to fbcode MAST environments, OSS atexit cleanup fix, and _json_compat extensions.

Highlights

🔧 OVERRIDE_TTIR Reproducer Rewrite (#376): Complete rewrite of the OVERRIDE_TTIR reproducer mode. The previous implementation was broken — it skipped defining the kernel function (causing NameError), only worked for autotuned kernels, and discarded constexpr values. The new approach generates a stub triton.jit function (same name and params, pass body) wrapped with triton.autotune carrying captured constexpr values, compile params, and ir_override pointing to the captured TTGIR. This eliminates the need to copy kernel source code and its transitive dependencies.
🐛 Warp-Specialized Kernel Reproducer Fix: Fixed ptxas "Insufficient registers" failure when reproducing warp-specialized kernels. The Triton compiler overwrites metadata["num_warps"] with the post-expansion count (ttg.total-num-warps), causing the reproducer to double-inflate the warp count. The fix extracts the original ttg.num-warps from TTGIR module attributes instead.
🔒 Manifold Upload Scoping: Manifold upload is now only enabled by default in fbcode MAST environments (detected via torch.version.git_version and MAST_HPC_JOB_NAME), preventing ModuleNotFoundError in OSS environments during atexit cleanup.

Changes by Area

🔧 Reproducer Enhancements

OVERRIDE_TTIR Stub Generation (#376): New stub_generator.py (~137 lines) generates stub Triton functions and extracts constexpr values. Rewritten _replace_kernel_import for OVERRIDE_TTIR generates stub + autotune config. _replace_kernel_invocation filters constexpr/compile params (autotune provides them). Captured IR files saved from compilation event's file_content to captured_irs/. Uses lru_cache on extract_params_from_source to avoid redundant AST parses.
Warp-Specialized num_warps Fix: At reproducer generation time, extracts original ttg.num-warps from TTGIR module attributes instead of the inflated metadata["num_warps"]. The post-expansion count is preserved as total_num_warps for informational purposes.

⚡ JSON Compatibility Layer

_json_compat.py Extensions: Added load(f) and dump(obj, f) file-based convenience wrappers delegating to existing loads()/dumps() with file I/O wrapping.
CUTracer Migration: All 14 CUTracer production Python files migrated from stdlib json to tritonparse._json_compat, providing a free 3-10x JSON performance upgrade via orjson with graceful degradation.

🔒 Manifold Upload & OSS Fixes

Scoped Default (#374, 3337a0c): TRITONPARSE_TRACE_MANIFOLD now defaults to "0" (OFF) and is only auto-enabled when running in fbcode and in a MAST environment. The env var override still works in all environments.
OSS atexit Fix (#374): Gated the Manifold upload path in _cleanup() behind is_fbcode() to prevent ModuleNotFoundError: No module named 'tritonparse.fb' during atexit in OSS environments.

🏗️ Infrastructure & CI

Packaging Workaround (#370): Added explicit pip install packaging in CI setup to work around PyTorch nightly (2.12.0.dev20260405+) missing dependency on packaging module.
Pin Node.js in CI: Pinned Node.js version in GitHub Actions CI workflows for reproducible builds.
Website Dependencies: Upgraded website dependencies and fixed Vite 8 / ESLint compatibility. Bumped vite from 8.0.3 to 8.0.5 (security fix).
Internal Repo Re-sync (#375): Cleaned up Claude Code configuration files that were incorrectly synced to the OSS repository.

Compatibility Notes

No breaking changes: This is a bug-fix release with no API or behavior changes for existing users.
Manifold upload default changed: TRITONPARSE_TRACE_MANIFOLD now defaults to OFF in non-fbcode environments. Users who relied on the previous default of ON in OSS should explicitly set TRITONPARSE_TRACE_MANIFOLD=1.
OVERRIDE_TTIR reproducer: The reproducer output format for OVERRIDE_TTIR mode has changed (stub kernel + autotune wrapper instead of source copy), but the generated reproducers are functionally equivalent and more reliable.

Upgrade Guidance

Standard upgrade:
```
pip install --upgrade tritonparse
```
Warp-specialized kernel reproducers: Previously failing reproducers for warp-specialized kernels should now work correctly without manual intervention.

Assets 2

01 Apr 02:33

FindHao

v0.4.2

bf3a3b7

TritonParse v0.4.2 Release 🎉

TritonParse Release Notes v0.4.2 (45 commits)

Date range: 2026-02-27 — 2026-03-30
Scope: Feature release - New ai module for LLM-powered analysis, whole-trace --trace diff mode with multi-strategy kernel matching, FileCheck-based procedure detection replacing hardcoded BlockPingpong, orjson performance optimization with free-threading fallback, torch trace kernel attribution, JSON schema validation, and kernel-run-level tensor blob save controls.

Highlights

🤖 New ai Module: LLM client abstraction layer with LLMClient ABC, ClaudeCodeClient for Claude Code CLI integration, MockClient for testing, and output parsers (extract_json, extract_code_block, extract_diff_patch). Foundation for AI-powered analysis features.
🔬 Whole-Trace Diff (--trace mode): Compare all kernels across two trace files with a single command. Multi-strategy KernelMatcher engine matches kernels by hash → name → source similarity → fuzzy name → config similarity. TraceDiffEngine orchestrates matching, per-pair diffing, and summary generation. Autotuning-aware: distinguishes truly absent kernels from unpaired autotuning compilations.
📋 FileCheck-Based Procedure Detection: Complete rewrite of IR analysis from hardcoded Python pattern matching to a JSON-driven, FileCheck-based system. Procedure definitions are declarative with configurable pattern checks and display attributes. Replaces old BlockPingpongCategory with three configurable procedure configs (Small/Medium/Large). Tile size attributes (M, N, K, bits) now displayed.
⚡ orjson Performance + Free-Threading Fallback: New _json_compat.py compatibility layer uses orjson for performance and falls back to stdlib json for CPython 3.14 free-threading builds. All 21 modules migrated. orjson>=3.9 and rich>=13.0 are now default dependencies.
🔍 Torch Trace Kernel Attribution: New torch trace log parser extracts kernel_source_path → CompileInfo mappings from inductor's output code events, enabling kernel-to-compilation-frame attribution when pt_info is missing. Wired through parse pipeline and CLI via --torch-trace-dir.
✅ JSON Schema Validation: New tritonparse/validation/ module with JSON schemas for compilation, launch, launch_diff, and ir_analysis event types. Lightweight validator checks types, required fields, enums, numeric constraints, and $ref resolution.
🎛️ Kernel-Run-Level Tensor Blob Controls: New TRITONPARSE_TENSOR_SAVE_SKIP_RUNS and TRITONPARSE_TENSOR_SAVE_MAX_RUNS environment variables (and Python API) for fine-grained control over which kernel runs get tensor blob snapshots.

Changes by Area

🤖 New `ai` Module

A new tritonparse/ai/ module (~1,400 lines) providing LLM client abstractions:

LLM Client ABC (PR-1): LLMClient abstract base with chat() and chat_stream() interfaces; Message, Response, ToolCall dataclasses; MockClient for testing
ClaudeCodeClient (PR-2): Production client wrapping Claude Code CLI with temp file shell escaping, session resumption, model selection, retry logic, JSON/stream-JSON parsing
Output Parsers (PR-3): extract_json(), extract_code_block(), extract_diff_patch() fallback parsers for LLM text responses; format_messages(), truncate_context() utilities
Error Diagnostics: Improved error handling extracts actual error from stdout JSON "result" field instead of just stderr

🔬 Whole-Trace Diff (`--trace` mode)

A complete trace-level comparison system (~3,400 lines) with layered architecture:

Data Types: MatchMethod enum (HASH/NAME/SOURCE/FUZZY_NAME/CONFIG), KernelMatchResult, TraceDiffResult, TraceDiffSummary, TraceStats, DtypeMismatch
KernelMatcher (~505 lines): Three-phase group-aware matching engine:
- Phase 0: Hash-based exact matching (highest priority, cross-name capable)
- Phase 1: Group-level matching by exact name → source similarity (threshold 0.75) → fuzzy name (threshold 0.7)
- Phase 2: Within-group config pairing by (num_stages, num_warps, shared memory) similarity
- Bounded sampling (_MAX_GROUP_SAMPLES=5) for performance on large traces
TraceDiffEngine (~355 lines): Orchestrator computing trace stats → kernel matching → per-pair DiffEngine → summary generation
Output: TraceSummaryFormatter for human-readable output; extended ConsolidatedDiffWriter with add_trace_diff()
CLI: New --trace flag requiring exactly 2 input files
Dtype Mismatch Detection: Surfaces dtype mismatches in tensor value comparison when argument names don't overlap
Test Reorganization: Monolithic test_diff.py split into 7 focused files: test_cli.py, test_diff_engine.py, test_fixtures.py, test_kernel_matcher.py, test_tensor_value.py, test_trace_diff.py, test_trace_output.py

CLI Usage:

# Compare all kernels across two trace files
tritonparseoss diff trace_a.ndjson trace_b.ndjson --trace

# With tensor value analysis
tritonparseoss diff trace_a.ndjson trace_b.ndjson --trace --tensor-values --atol 1e-5 --rtol 1e-3

📋 FileCheck-Based Procedure Detection

Complete rewrite of IR analysis (~2,200 lines) from hardcoded Python to JSON-driven FileCheck:

FileCheck Integration: Auto-discovers FileCheck binary from Triton's bundled version, FILECHECK_PATH env var, or system PATH
JSON Configuration (default_procedure_checks.json): Declarative procedure definitions with pattern_checks (FileCheck patterns) and display_attributes (configurable extraction rules)
Attribute Extraction: Multiple sources (module_attrs, ir_content, computed) with rules (regex, count, dot_shape, tile_size_bits, pp_clusters)
Tile Size Display: New tile_m, tile_n, tile_k, tile_size_bits attributes
BlockPingpong Migration: Old BlockPingpongCategory enum and ~254 lines of hardcoded Python replaced by three JSON-configured procedures (Small/Medium/Large)
Website UI: Collapsible/foldable sections per procedure in IRAnalysis page
Streamlined Workflow: Procedure detection integrated into main tritonparse parse pipeline

⚡ orjson Performance + Free-Threading Fallback

_json_compat.py (new): Unified JSON compatibility layer — orjson when available, stdlib json fallback
- loads() accepts str | bytes | bytearray | memoryview
- dumps() returns str with indent and sort_keys support
- Non-string key coercion in fallback path (replicates orjson's OPT_NON_STR_KEYS)
Global Migration (#362): All 21 modules migrated from import json to from tritonparse._json_compat import loads, dumps, JSONDecodeError
Free-Threading Support (#365): Automatic stdlib json fallback for CPython 3.14 free-threading builds where orjson is unavailable
Default Dependencies (#366): orjson>=3.9 and rich>=13.0 added to pyproject.toml dependencies (previously zero dependencies)

🔍 Torch Trace Kernel Attribution

Torch Trace Parser (#353): New tritonparse/parse/torch_trace_parser.py (~212 lines) parsing inductor's glog-formatted torch trace logs to extract kernel_source_path → CompileInfo mappings from inductor_output_code events
Trace Processor Integration (#354): _build_kernel_attribution_map() and _apply_kernel_attribution() enrich compilation events with pt_info when missing (~126 lines)
CLI & Pipeline Wiring (#355): New --torch-trace-dir flag with auto-discovery of torch trace files from the same parent directory

✅ JSON Schema Validation

Schema Files (#356): Four JSON schemas for compilation, launch, launch_diff, ir_analysis event types
Lightweight Validator (json_validator.py, ~287 lines): Validates required fields, types, enums, numeric constraints (min/max/exclusive), additionalProperties, array items, and $ref resolution
validate_trace_file(): Full NDJSON trace file validation with max_errors cap
Schema Loader: importlib.resources for PAR compatibility, lazy loading with caching
Test Suite: Comprehensive tests (~652 lines) covering all validation scenarios

🎛️ Tensor Blob Save Controls

Skip/Max Runs Gating: New environment variables for fine-grained control:
- TRITONPARSE_TENSOR_SAVE_SKIP_RUNS: Skip tensor blob saving for the first N kernel runs (default: 0)
- TRITONPARSE_TENSOR_SAVE_MAX_RUNS: Save tensor blobs for at most N kernel runs after skipping (default: 0 = unlimited)
Python API: TritonParseManager(tensor_save_skip_runs=N, tensor_save_max_runs=M) and init(tensor_save_skip_runs=N, tensor_save_max_runs=M)
Autotune-Aware: Benchmark launches during autotune are excluded from run counting
GPU Tests: End-to-end validation of skip/max runs gating

🔧 Reproducer Enhancements

CUDA Graph Capture Error (#359): Clear RuntimeError when reproducing kernels launched during CUDA graph capture, explaining that argument extraction was skipped
Kernel Name Fallback: Reproducer/info now falls back to matching by kernel name when compilation hash is missing (Inductor kernels where JIT hook didn't fire)

🔧 Bisect Enhancements

--triton-repo Flag: Controls culprit commit URL prefix — oai (triton-lang/triton, default) or meta (facebookexperimental/triton); state persisted and restored on resume
Rich as Default Dependency (#366): rich>=13.0 moved from optional to default, simplifying bisect UI code

🏗️ Infrastructure & CI

GitHub Actions Update (#357): All actions updated to latest versions; Python test matrix changed from 3.11 to 3.13
MAST Compatibility: Handle both numeric and string state formats in MAST CLI JSON output
Internal Test Reorganization (#358): test_mast_compat.py moved to tests/fb/ for ...

Assets 2

25 Feb 18:56

FindHao

v0.4.1

eb04a8f

TritonParse v0.4.1 Release 🎉

Date range: 2026-01-22 — 2026-02-24
Scope: Feature release - New diff CLI subcommand for kernel compilation comparison with tensor value analysis, autotune analysis visualization, profile-aware launch tracing, enhanced reproducer support, bisect auto-setup, and multi-format trace compression support.

Highlights

📊 Autotune Analysis: End-to-end autotune session tracking with frontend visualization. Automatically detects autotune sessions, tracks benchmark vs winner launches, displays configuration comparison tables, and shows winner run count statistics.
🔬 New diff CLI Subcommand (Beta): Complete kernel compilation diff system for comparing two compilation events. Supports metadata analysis, source mapping comparison, IR statistics diff, and tensor value comparison with configurable tolerances (--tensor-values, --atol, --rtol). Output can be appended in-place or written to new files. Note: This feature is in beta — APIs and output formats may change in future releases.
⚡ Profile-Aware Launch Tracing: Transparent integration with torch.profiler via TRITON_TRACE_LAUNCH_WITHIN_PROFILING=1. Monkey-patches torch.profiler.schedule to trace launches only during the profiler's RECORD phase.
🗜️ Multi-Format Compression: Added CLP (Compressed Log Processor) support alongside existing gzip. Trace compression is now disabled by default (TRITON_TRACE_COMPRESSION=none). Magic number detection for transparent decompression.
🔧 Bisect Auto-Setup: New --auto-env-setup flag for --llvm-only bisect mode. Automatically clones/updates Triton and LLVM repositories, creates conda environments.
📦 TMA Kernel Support: TensorDescriptor capture and reconstruction for TMA (Tensor Memory Accelerator) kernel reproducers.

Changes by Area

📊 Autotune Analysis

⚡ Profile-Aware Launch Tracing

🔬 New `diff` CLI Subcommand (Beta)

A complete kernel compilation comparison system (~1500 lines) with layered architecture:

Data Types (D1): CompilationDiffResult, DiffNote, DiffSummary, IRStats, IRStatsDiff, MetadataDiff, TensorArgDiff, TensorValueDiff
Event Matching (D2): match_events_by_index(), match_events_by_kernel(), find_launch_for_compilation()
Diff Engine (D3): Main DiffEngine class orchestrating all analyzers
Metadata Analyzer (D4): Compares compilation metadata (num_warps, num_stages, etc.)
Sourcemap Analyzer (D5): Compares source mappings between IRs
Summary Generator (D6): Generates human-readable diff summaries
Output Module (D7): ConsolidatedDiffWriter, append_diff_to_file(), format_summary()
CLI Entry Point (D8): tritonparseoss diff command with --events, --kernel, --tensor-values flags
Tensor Value Analyzer: Numeric tensor comparison with blob mode (full element-wise) and stats mode (min/max/mean/std fallback)
Unit Tests: Phase 1 test coverage for core modules

CLI Usage:

# Compare compilations 0 and 1 in single file
tritonparseoss diff trace.ndjson --events 0,1

# Compare with tensor value analysis
tritonparseoss diff trace.ndjson --tensor-values --atol 1e-5 --rtol 1e-3

# List available compilations
tritonparseoss diff trace.ndjson --list

# Filter by kernel name
tritonparseoss diff trace.ndjson --kernel matmul --events 0,1

⚡ Profile-Aware Launch Tracing

New environment variable TRITON_TRACE_LAUNCH_WITHIN_PROFILING=1
patch_profiler_schedule(): Monkey-patches torch.profiler.schedule
enable_launch_tracing() / disable_launch_tracing() API
Mutually exclusive with TRITON_TRACE_LAUNCH (validated at init)
Unit tests for all three scenarios: no flag, trace all, profile-aware

🗜️ Compression Module

Magic number detection: detect_compression() for gzip/zstd/none
Transparent reading: open_compressed_file() context manager
CLP format support (#326): TRITON_TRACE_COMPRESSION="clp" for Compressed Log Processor format
Default change: Compression disabled by default (was gzip)
API functions: is_gzip_file(), is_zstd_file(), iter_lines()

🔧 Bisect Enhancements

EnvironmentManager (#329-#332):
- Auto-clone Triton and LLVM repositories from GitHub
- Create/verify conda environments
- --auto-env-setup CLI flag for --llvm-only mode
- Status checking and diagnostics
- Unit tests for all scenarios

📦 Reproducer Enhancements

TensorDescriptor support (#344): Captures base, shape, strides, block_shape, padding for TMA kernels
preserve_autotune mode (#328): Preserve autotune configs in reproducer scripts
Robustness improvements: Complex kernel handling, function reference detection in call arguments (#348)
Verbose args print placeholder (#347): Placeholder for verbose argument printing
WS kernel fix (#349): Correct num_warps handling for Warp Specialization kernels
Better logging (#346): Improved logging when black/isort unavailable

🌐 Website UI Improvements

KernelOverview page: New component for autotune analysis visualization (870 lines)
WebSocket ArrayBuffer handling (#340): Direct trace ArrayBuffer via iframe messaging
URL normalization (#324): Manifold Explorer and tritonparse URL handling
Click-to-highlight tip: Added in CodeComparisonView
Title navigation fix: TritonParse title returns to home

🏗️ Infrastructure & API

SASS parsing refactor: extract_sass_pc_mappings() for PC-offset-keyed source mapping (for CUTracer integration)
Rank-less file support (#341, #342): --rank none for parsing files without rank suffix
Launch without compilation (#336): Support launch events when compilation was cached
log_dir parameter (#337): TritonParseManager(log_dir=...) API
Auto-switch log file: When rank becomes available during execution
Error message improvements (#339): Better diagnostics and bug fixes
Meta copyright headers: Added to all scripts
Dependabot prefix: [dependabot] prefix to PR titles
Negative line support (#319): prettify_ndjson handles negative line numbers

Compatibility Notes

Default Change: Trace compression is now disabled by default. Set TRITON_TRACE_COMPRESSION="gzip" to restore v0.4.0 behavior.
New Feature (Beta): The diff subcommand is additive and doesn't affect existing workflows. It is in beta — APIs and output formats may change.
New Feature: Autotune analysis events are automatically generated; frontend displays when available.
Mutual Exclusivity: TRITON_TRACE_LAUNCH and TRITON_TRACE_LAUNCH_WITHIN_PROFILING cannot both be set.

Upgrade Guidance

Use diff for kernel comparison:

# Basic diff
tritonparseoss diff trace.ndjson --events 0,1

# With tensor value comparison
tritonparseoss diff trace.ndjson --tensor-values --kernel matmul

Enable profile-aware tracing:

TRITON_TRACE_LAUNCH_WITHIN_PROFILING=1 python train.py

Use CLP compression (if available):

TRITON_TRACE_COMPRESSION="clp" python train.py

Bisect with auto-setup:

tritonparseoss bisect --llvm-only --auto-env-setup \
    --triton-dir ~/oss-triton \
    --good-llvm abc123 --bad-llvm def456 \
    --test-script test.py

TMA kernel reproducers: Now work automatically when TensorDescriptor arguments are present.

Assets 2

22 Jan 18:07

FindHao

v0.4.0

d81c736

TritonParse v0.4.0 Release 🎉

TritonParse Release Notes v0.4.0 (115 commits)

Date range: 2025-12-26 — 2026-01-21
Scope: Major feature release - New bisect CLI subcommand for automated Triton/LLVM regression bisection, SASS source mapping support, BlockPingpong IR analysis, advanced filter syntax, and significant infrastructure improvements.

Highlights

🔍 New bisect CLI Subcommand: Complete regression bisection system for Triton and LLVM. Automatically find culprit commits with git bisect integration, LLVM bump detection, commit pair testing, and Rich TUI real-time progress display. Supports resumable workflows and multiple operation modes.
📊 SASS Source Mapping: Full SASS (NVIDIA assembly) source mapping support with fuzzy matching. Enables bidirectional mapping between SASS and other IR types (TTIR, TTGIR, PTX) in the website UI.
🔬 BlockPingpong Detection: New IR analysis capability to detect and categorize block pingpong scheduling patterns in TTGIR, with color-coded visualization in the website UI.
📦 Standalone Reproducer: New --embed-context flag embeds JSON context directly into generated Python scripts, creating fully self-contained single-file reproducers for easy sharing and bug reports.
🎛️ Advanced Filter Syntax: Enhanced --args-list filtering with support for nested properties (C_ptr.dtype), array indexing (C_ptr.shape[0]), and list matching (C_ptr.shape=[3024, 10752]).
🏗️ Infrastructure Modernization: Parse module refactored into dedicated subdirectory, unified logging system, centralized SVG icons, test directory restructuring, and ESLint integration for website.

Changes by area

🔍 New `bisect` CLI Subcommand

A complete regression bisection system spanning ~6000+ lines of code across 55+ PRs, organized in 7 architectural layers.

Operation modes (PR-43 ~ PR-52):
- tritonparseoss bisect --good <commit> --bad <commit> - Triton-only bisect
- --llvm-only - Direct LLVM commit bisection
- --pair-test - Test (Triton, LLVM) commit pairs from CSV
- --commits-csv - Full 4-phase workflow (Triton bisect → LLVM bump detection → pair test → LLVM bisect)
- --resume / --status - Resume interrupted bisect or check status
Core bisector architecture (PR-15 ~ PR-21):
- BaseBisector - Abstract base class with template method pattern
- TritonBisector - Triton commit bisection with automatic build and test
- LLVMBisector - LLVM commit bisection with Triton rebuild
- Commit validation and correct bisect range detection
Commit detection and pair testing (PR-22 ~ PR-27):
- CommitDetector - Automatically detects LLVM version bump commits
- LLVMBumpInfo - Captures old/new LLVM hash information
- PairTester - CSV-driven (Triton, LLVM) commit pair testing
- LLVM range filtering for efficient pair selection
State management (PR-28 ~ PR-31):
- BisectPhase enum: TRITON_BISECT, TYPE_CHECK, PAIR_TEST, LLVM_BISECT, COMPLETED, FAILED
- BisectState dataclass with JSON serialization
- StateManager for persistent state with auto-resume support
- Automatic state file discovery (find_latest_state())
Rich TUI interface (PR-32 ~ PR-42):
- BisectUI - Split-screen layout with progress and output panels
- Real-time progress updates with phase, commit, and step information
- Graceful fallback to plain text when Rich unavailable
- print_final_summary() - Beautiful summary with GitHub links
Shell scripts (PR-06 ~ PR-13):
- bisect_triton.sh - Triton build and test script for git bisect
- bisect_llvm.sh - LLVM + Triton build with COMPAT_MODE support
- test_commit_pairs.sh - Sequential pair testing with CSV support
- scripts/__init__.py - Script path utilities
Execution infrastructure (PR-01 ~ PR-05, PR-14):
- ShellExecutor - Blocking and streaming command execution
- CommandResult dataclass with duration tracking
- BisectLogger - Dual logging (file + TUI callback)
- run_git_bisect_sequence() - Complete git bisect workflow
- uv package manager support via config.py (PR-54)
- Clean build environment before each bisect step (PR-55)
Unit tests (Test-PR-01 ~ Test-PR-03):
- Tests for state.py, commit_detector.py, pair_tester.py
- Tests for executor.py and logger.py (Layer 0)

📊 SASS Source Mapping Support

Fuzzy matching for SASS (commit 762844e):
- New extract_sass_mappings() function in ir_parser.py
- ignore_column parameter for fuzzy matching (SASS lacks column info)
- Automatic fuzzy matching when source or target IR is "sass"
- SASS comment line mapping (//## File "/path", line N)
- Skip .nv_debug_ptx_txt debug file references
Website UI integration (#249):
- SASS code panel support in IR Code View
- Bidirectional highlighting between SASS and other IRs
- Updated default trace with SASS code (commit 1b2d6a9)

🔬 BlockPingpong Detection

IR analysis enhancement (commits 50deca4, fe3092f, 0426510, 2dc0eac):
- New BlockPingpong pattern detection in ir_analysis.py (~257 lines)
- Automatic categorization of ping-pong scheduling patterns
- Pattern matching descriptions for each category
- Color-coded visualization in website UI
- Dedicated Pingpong section in IR Analysis interface

📦 Reproducer Enhancements

Standalone reproducer (#252):
- New --embed-context CLI flag (default: False)
- Embeds JSON context directly into Python script
- Creates fully self-contained single-file reproducer
- Ideal for sharing, bug reports, and archiving
Compile params support (#295):
- Pass compile parameters to kernel invocation
- Fixes issue #277
Improved identification (#293, #294):
- line_index added to reproducer filename
- Metadata comments in generated scripts
Bug fixes:
- Fix reproducer for inductor-generated Triton kernels (commit 430510c)
- Fix isort reordering issue (#254)

🎛️ Advanced Filter Syntax

Nested property filtering (commit 3ee5df5):
- Dot notation: C_ptr.dtype=torch.bfloat16
- Array indexing: C_ptr.shape[0]=3024
- List matching: C_ptr.shape=[3024, 10752]
- Unified nested dict unwrapping across all value sources
- Filter kernel launches by tensor metadata (shape, dtype, stride)

🌐 Website UI Improvements

Code panel enhancements:
- Vertical resize capability for IR Code View panels (#253)
- Horizontal scroll tip banner (#250)
- Long kernel name overflow fix (#246)
- Index prefix added to kernel selector (#279)
Code quality improvements (#228 ~ #235):
- ESLint added to CI workflow (#235)
- 8 PRs fixing React hooks, TypeScript, and lint errors
- Fix Python source line highlight clearing (#236)
- Display Python source line numbers from original file offset (#239)
Infrastructure:
- Dependabot configuration for npm dependencies (#264)
- Runtime accessibility test in CI (#265)
- SVG icons centralized using @heroicons/react (#241)
- Remote URL button fix (commit 7df4b87)
- Compile-time flag for internal wiki link (commit 03c71e5)

🏗️ Infrastructure & Code Quality

Module reorganization:
- Parse module refactored into tritonparse/parse/ subdirectory (#240)
- Unified logger modules to tp_logger.py (#242)
- Hierarchical sub-loggers under "tritonparse" namespace
Test infrastructure:
- Test directory restructuring: tests/cpu/ and tests/gpu/
- Extract GPU TensorBlob, complex kernels, reproducer E2E tests
- Extract GPU structured logging + context manager tests
- Extract CPU tests to dedicated directory
- CI workflow updated for new test structure
Code formatting:
- Align OSS formatting with internal pyfmt config (#256)
- Black 25.11.0 style applied (commit b2d12f9)
Bug fixes:
- Kernel selector overflow fix (commit b5c72b8)
- Substring matching bug in call graph dependency filtering (commit 0ec75af)
- PAR compatibility in function_extractor (commit 48551a2)
- ast.unparse() for proper indentation in reproducer extraction (commit 1d8a33d)
- --kernel-import help message fix (commit 18cf9d8)
- source_repo_dir support for mapping production file paths (commit a952d99)
- BisectLogger unique logger names per instance (#251)

📚 Documentation

Simplified CHANGELOG.md with links to GitHub releases (#226)
Website version bumped to 0.3.2 with dependency updates (#238)

Compatibility notes

New Feature: The bisect subcommand is an additive feature that doesn't affect existing workflows.
SASS Support: To use SASS source mapping, traces must include SASS IR (enable via enable_sass_dump=True or TRITONPARSE_DUMP_SASS=1).
Filter Syntax: The new advanced filter syntax is backward compatible; existing filter expressions continue to work.
Test Directory: Tests have been reorganized into tests/cpu/ and tests/gpu/ subdirectories.

Upgrade guidance

Use bisect for regression hunting:

# Basic Triton bisect
tritonparseoss bisect --triton-dir /path/to/triton \
    --test-script test.py --good v2.0.0 --bad HEAD

# Full workflow with LLVM bump detection
tritonparseoss bisect --triton-dir /path/to/triton \
    --test-script test.py --good v2.0.0 --bad HEAD \
    --commits-csv pairs.csv

# Resume interrupted bisect
tritonparseoss bisect --resume

# Check status
tritonparseoss bisect --status

Generate standalone reproducers:

tritonparseoss reproduce trace.ndjson --kernel matmul --embed-context

Use advanced filtering:

tritonparseoss info trace.ndjson --args-list "C_ptr.shape[0]=302...

Assets 2

26 Dec 18:12

FindHao

v0.3.2

bdf16ac

TritonParse v0.3.2 Release 🎉

TritonParse Release Notes v0.3.2 (34 commits)

Date range: 2025-11-05 — 2025-12-22
Scope: Major feature release - New info CLI subcommand, multi-file call graph analysis for reproducers, unified 0-based indexing, IR extraction tools, and infrastructure improvements.

Highlights

📊 New info CLI Subcommand: Query kernel information from NDJSON trace files without manual parsing. List all kernels with launch counts, view launches for specific kernels, and get fuzzy matching suggestions for kernel names.
🔍 Multi-File Call Graph Analyzer: Advanced AST-based analysis that automatically extracts all transitively-called functions across multiple Python files. Enables self-contained kernel reproducers with all dependencies included.
🎯 Unified 0-Based Indexing: All launch indices throughout the codebase (CLI, website, internal APIs) now use consistent 0-based indexing following Python conventions.
⚡ Enhanced Reproducer: New --kernel and --launch-id arguments eliminate manual line number lookup. AST-based dependency extraction, autotune disabler, and code formatting for generated scripts.
🛠️ IR Extraction Tool: New command-line tool to extract Triton IRs (TTIR, TTGIR, LLIR, PTX) from trace logs with flexible output organization.
🔐 PyPI Trusted Publishing: Migrated from API token authentication to OIDC-based Trusted Publishing for improved security and attestations.

Changes by area

📊 New `info` CLI Subcommand

Core query layer (PR #208):
- New tritonparse/info/ module for kernel information queries
- KernelSummary and LaunchInfo dataclasses for structured results
- list_kernels(): List all kernels with launch counts
- find_launch_index_by_kernel(): Find line index for a kernel's N-th launch
CLI interface (PR #210):
- tritonparseoss info <trace.ndjson> - List all kernels with launch counts
- tritonparseoss info <trace.ndjson> --kernel <name> - List launches for specific kernel
- Auto-parsing: Automatically detects and parses raw logs
- Fuzzy matching suggestions when kernel not found
- Performance optimization using launch_diff events when available
Additional filtering (commit 8134195):
- Added --args-list filtering to info command

🔍 Multi-File Call Graph Analyzer

Three-phase implementation (PR #206 Phase 1-3):
- Phase 1 - ImportResolver: Multi-file call graph analysis foundation
- Phase 2 - ImportParser: AST-based import statement parsing
- Phase 3 - MultiFileCallGraphAnalyzer: Complete multi-file traversal with BFS
Key features:
- Automatic extraction of transitively-called functions across Python files
- Per-file code root tracking (fbcode, Python projects, Git repositories)
- Graceful fallback for files outside detected roots
- Integrated into reproducer for self-contained script generation

🎯 Unified 0-Based Indexing

Breaking change (PR #211):
- All launch indices now use 0-based indexing
- Affects: trace processor, website components (KernelOverview, DiffViewer, StackDiffViewer, ArgumentViewer)
- CLI --line argument changed to 0-based (PR #205)
Rationale:
- Consistency with Python conventions
- Alignment with existing info and reproduce commands
- Simpler code without +1/-1 conversions

⚡ Reproducer Enhancements

Kernel name lookup (PR #209):
- New --kernel argument to specify kernel by name instead of line number
- New --launch-id argument (0-based) to select specific launch
- Mutual exclusivity with --line argument
- Example: tritonparseoss reproduce trace.ndjson --kernel matmul_kernel --launch-id 2
AST-based dependency extraction (commit 8ad24f6):
- Automatic extraction of dependent helper functions
- Call graph analysis for transitive dependencies
- Self-contained reproducers without manual function hunting
Autotune disabler (commit 28486fc):
- Automatically disable Triton's autotune decorator in generated scripts
- New utils.py module with _disable_triton_autotune() function
- Works with both IMPORT and COPY kernel import modes
Code formatting (commit 311e016):
- Generated reproducers are now properly formatted
Bug fixes:
- Fix FileNotFoundError with absolute path templates (commit 458e6e9)
- Fix kernel signature parsing for return type annotations (commit d24ae1d)
- Support for Triton dtype parameters (commit 86fa46b)

🛠️ IR Extraction Tool

New tool (PR #202):
- tritonparse/tools/extract_irs.py for extracting Triton IRs from trace logs
- Supports TTIR, TTGIR, LLIR, PTX, and other IR formats
- Flexible output: flat or by-kernel directory structure
- Comprehensive documentation in tritonparse/tools/readme.md
Logger fix:
- Fixed NameError: 'logger' is not defined in generated reproducers
- Added proper logging initialization to templates

🔐 Infrastructure & CI/CD

PyPI Trusted Publishing (PR #219):
- Migrated from API token to OIDC authentication
- Enabled package attestations for provenance
- No secrets management required
On-Demand Nightly Publishing (PR #216):
- Flexible PyPI publishing workflow
Website build CI (PR #224):
- Added CI test for website builds
- Updated frontend dependencies
Usage tracking (commit 89913ff):
- Extended usage_report_logger to track all subcommands and API calls
- Entry function detection via call stack traversal
- Added skip_logger parameter to prevent duplicate logging

🔧 Bug Fixes & Improvements

CUDA Graph capture fix (PR #197):
- Fixed crash during CUDA graph capture in tensor argument extraction
- Detects capture mode and skips problematic operations
- Fixes compatibility with triton.testing.do_bench_cudagraph
Gzip support (PR #207):
- Added gzip support for load_ndjson() function
Compilation metadata (PR #198):
- Sort compilation metadata attributes alphabetically
Import formatting (commit 86a2229):
- Format imports following Python style guide
Debug message (commit a205e50):
- Added message for debugging when BlockPingpong exits early

📚 Documentation

Wiki pages (PR #223):
- Added new wiki pages to documentation table
Dependency cleanup (PR #225):
- Removed unnecessary npm overrides for prismjs and dompurify

Compatibility notes

Breaking Change: All launch indices are now 0-based. Website displays and CLI arguments have been updated. If you have scripts relying on 1-based line numbers from --line, update them to use 0-based indices.
New Features: The info subcommand and --kernel/--launch-id reproducer options are additive and don't break existing workflows.
Reproducer: Generated scripts now include autotune disabler and dependent functions automatically. Templates have been updated with proper logger initialization.

Upgrade guidance

Update index references: Change any 1-based line number references to 0-based indices.
Use info command: Replace manual trace file inspection with tritonparseoss info <trace.ndjson> to list kernels.
Use kernel name lookup: Instead of --line N, use --kernel <name> --launch-id <id> for more intuitive reproducer generation.
Extract IRs: Use new python -m tritonparse.tools.extract_irs <trace.ndjson> for IR extraction tasks.

Assets 2

06 Nov 03:13

FindHao

v0.3.1

e9c0e23

TritonParse v0.3.1 Release 🎉

TritonParse Release Notes (last 24 commits)

Date range: 2025-10-14 — 2025-11-03
Scope: IR Analysis enhancements (beta), Reproducer template extensions, code viewer improvements, bug fixes.

Highlights

📊 IR Analysis (Beta): New analysis capabilities for visualizing Software Pipelining (SWP), BufferOps statistics, and loop schedules in Triton IR. Note: This is a beta feature.
🏷️ Variable Location Tracking: Complete location alias tracking system for mapping IR locations back to source code with frontend visualization.
🔧 TritonBench Template: New reproducer template for easy TritonBench integration and kernel benchmarking.
🎨 Code Viewer Enhancements: Full Python source extraction, function highlighting, and performance optimizations.
🔄 Reproducer Refactoring: AST-based function extraction eliminates code duplication and simplifies template maintenance.

Changes by area

📊 IR Analysis (Beta)

Software Pipelining (SWP) visualization (PR #189):
- Analyzes inner scf.for loops and identifies prologue, loop_body, and epilogue stages
- Tracks tt.load and tt.dot operations through TTIR → TTGIR → Python source mappings
- Frontend displays simplified source code with SWP stage information
- Limitations: Does not support Warp Specialization or Blackwell operators yet
BufferOps backend information (PR #181):
- Statistical analysis of buffer operations (tt.load/store, amdgpu.buffer_load/store, global_load/store) at TTGIR and AMDGCN levels
- Useful for AMD GPU backend optimization analysis
Web frontend IR Analysis page (PR #184):
- New dedicated page at /ir-analysis route with integrated display for loop schedules and BufferOps statistics

🏷️ Variable Location Tracking

Complete three-part implementation (PR #186, #187, #188):

Fixed #loc storage key conflict in IR parser
Added location alias parsing support in ir_parser.py and trace_processor.py
Frontend visualization with CSS styling and interactive location display in Code Viewer

🔄 Reproducer System

TritonBench template support (commit 3493ac8):
- New template: tritonparse/reproducer/templates/tritonbench.py
- CLI option: --template tritonbench for TritonBench-compatible reproducers
- Integrates with TritonBench's BenchmarkOperator and benchmark harness
AST-based refactoring (PR #178):
- New module: tritonparse/reproducer/function_extractor.py using Python AST
- Simplified example.py template from ~370 lines to ~20 lines
Bug fixes:
- Fixed 1-based to 0-based line number conversion (PR #185)
- Corrected output key typo: repo_* → repro_* (PR #175)
- CUDA device normalization to cuda:0 format (PR #177)

📝 Callsite Location Support

TTIR/TTGIR callsite location (PR #190):
- Extended IR parser to extract callsite location information
- Better debugging with call graph information and test coverage

💻 Code Viewer & Frontend

Full Python source extraction (commit 2976887):
- Enhanced structured_logging.py to extract complete Python source files
Full file display with function highlighting (commit 220d5a4):
- CodeViewer now supports displaying entire source files with function-level highlighting
CodeComparisonView performance optimization (commit c17e584):
- Significant rendering performance improvements for large files
- Reduced re-renders and improved memory efficiency

🌐 Website & Maintenance

Dependency updates (PR #179): Added automation script website/scripts/update_deps.sh
Copyright updates (PR #183): Updated copyright headers across source files

Compatibility notes

No breaking changes: All updates are backward compatible with v0.3.0.
IR Analysis (Beta): New optional feature accessible through web UI.
TritonBench template: Optional, does not impact existing reproducer generation.

Upgrade guidance

Using IR Analysis (Beta):
- Open web UI and navigate to IR Analysis page after parsing
- View SWP stage information (prologue/loop_body/epilogue) and BufferOps statistics
- Note: Beta feature with some limitations on advanced pipelining patterns

Generating TritonBench reproducers:

tritonparseoss reproduce trace.ndjson.gz --line <N> --template tritonbench --out-dir <output>

Code viewer enhancements: Automatically enabled with full source display and function highlighting

Assets 2

15 Oct 03:34

FindHao

v0.3.0

2f3f12d

TritonParse v0.3.0 Release 🎉

TritonParse Release Notes (last 44 commits)

Date range: 2025-09-19 — 2025-10-14
Scope: Major feature release - Reproducer system, tensor storage, SASS support, enhanced context manager, CLI improvements.

Highlights

🔄 Reproducer System (Complete): Full-featured standalone kernel script generation with template support, tensor reconstruction, and multiple import modes. Extract any traced kernel into a self-contained Python script for debugging, testing, and sharing.
💾 TensorBlobManager: Production-ready content-addressed tensor storage with automatic compression, deduplication, quota management, and efficient disk usage. Enables high-fidelity kernel reproduction with actual tensor data.
🔧 SASS Disassembly Support: Optional NVIDIA SASS disassembly during compilation tracing for low-level debugging and performance analysis. Toggle via enable_sass_dump parameter or TRITONPARSE_DUMP_SASS environment variable.
🎯 Enhanced Context Manager: Configurable TritonParseManager context manager with support for trace launch control, inductor compilation splitting, and flexible parsing parameters.
⚡ CLI Modernization: Refactored to subcommand structure (tritonparseoss parse, tritonparseoss reproduce) with unified entry point and improved argument handling.
📊 Auto-enable Inductor Launch Tracing: Automatic detection and tracing of PyTorch Inductor-compiled kernels without manual configuration.
🌐 Website Improvements: Light mode color scheme, improved stack display in Launch Analysis, and better file diff navigation.

Changes by area

🔄 Reproducer System

Complete reproducer infrastructure (PR #117-127):
- CLI subcommand structure: tritonparse reproduce <ndjson_file> [options]
- NDJSON ingestion layer with IR preservation
- Context bundle system for kernel metadata and parameters
- Standardized output paths: repro_output/<kernel_name>/repro_<timestamp>.py
- Template support with placeholder system for custom generation
- Example templates for tensor loading and kernel invocation
- Dynamic import generation for kernel dependencies
- Kernel signature parsing and integration
- Kernel invocation snippet generation with grid/block configuration
Kernel import modes (PR #165, #166):
- --kernel-import direct: Import kernel from source file
- --kernel-import override-ttir: Override and inject TTIR for advanced debugging
- Flexible kernel loading strategies for different debugging workflows
Enhanced tensor handling (PR #141):
- Improved tensor metadata logging (shape, dtype, stride, storage offset, device)
- Better tensor reconstruction quality in generated reproducers
- Support for non-contiguous tensors (commit 12f1d1b)
Extensible placeholder system (PR #149):
- Refactored placeholder replacement with class-based design
- Support for: {{KERNEL_IMPORT_PLACEHOLDER}}, {{KERNEL_INVOCATION_PLACEHOLDER}}, {{KERNEL_SYSPATH_PLACEHOLDER}}, {{JSON_FILE_NAME_PLACEHOLDER}}
- Easy extension for future template needs
Documentation: Comprehensive reproducer section in README (PR #161) and Usage Guide in Wiki

💾 TensorBlobManager & Storage

Production-ready blob storage (PR #156):
- Content-addressed storage using BLAKE2b hashing
- Automatic gzip compression for large tensors (>1MB)
- Two-level directory structure (xx/hash.bin.gz) to avoid filesystem limits
- Automatic deduplication: identical tensors stored only once
- Storage quota enforcement (default: 100GB)
- Per-tensor size limit (default: 10GB) to prevent OOM
- Real-time statistics: saved count, dedup hits, compression ratio
- Graceful degradation with warning logs when quota exceeded
Compression support (PR #157):
- Configurable compression level (default: 4)
- Atomic writes using temporary files + rename for safety
- Hash verification for data integrity
Comprehensive testing (PR #162):
- Unit tests for compression, deduplication, quota management
- Edge case handling and cleanup verification

🔧 SASS Disassembly

SASS extraction support (PR #137):
- New tool: tritonparse/tools/disasm.py for CUBIN disassembly
- Integration into structured logging behind opt-in flag
- Uses nvdisasm -c -gp -g -gi for detailed disassembly
- Parses output to find function blocks with preserved labels and source mapping
Configuration:
- Environment variable: TRITONPARSE_DUMP_SASS=1
- API parameter: enable_sass_dump=True in structured_logging.init()
- API parameter takes precedence over environment variable
Robustness:
- Error handling for subprocess failures, missing nvdisasm, and generic exceptions
- Writes marker messages instead of failing the trace
- Requires NVIDIA CUDA Binary Utilities (nvdisasm)
CUDA testing (PR #138):
- Strengthened tests to validate SASS extraction and persistence

🎯 Context Manager & API

Enhanced context manager (PR #144, #159):
- Added __init__ method with configurable parameters:
  - enable_trace_launch: Control trace launch logging
  - split_inductor_compilations: Control inductor compilation splitting
  - **parse_kwargs: Additional arguments for unified_parse
- Updated __exit__ to pass parameters through to parsing pipeline
- More flexible for different use cases and workflows
Split inductor compilations control:
- Parameter threading through: unified_parse() → oss_run() → parse_logs() → parse_single_file()
- Renamed from split_by_frame_id_and_compile_id to split_inductor_compilations for clarity
- Default True: splits by frame_id, frame_compile_id, attempt_id, compiled_autograd_id
- When False: groups all inductor compilations together
- Follows tlparse's convention
Unit tests (commit a5338ce):
- Tests for enhanced context manager behavior
- Validation of split inductor compilation modes

⚡ CLI & Entry Points

Subcommand structure (PR #117):
- Refactored from single-command to modern subcommand architecture
- tritonparse parse <source> [options] - Run structured log parser
- tritonparse reproduce <ndjson_file> [options] - Generate reproducers
- Breaking change: old python run.py <source> no longer works
- Extract parser flags into tritonparse.utils._add_parse_args()
- Remove unified_parse_from_cli (programmatic unified_parse() remains)
Unified entry point (PR #133):
- Added proper CLI entry point in package configuration
- Unified argument handling across commands
CLI entry point fix (PR #154):
- Fixed ModuleNotFoundError for tritonparse CLI entry point
- Improved package installation and command availability

📊 Logging & Tracing

Auto-enable Inductor Launch Tracing (PR #142):
- Automatically detect and trace PyTorch Inductor-compiled kernels
- No manual configuration required for Inductor workflows
- Seamless integration with existing tracing infrastructure
Kernel source path output (commit 03bc1e1):
- Output kernel_src_path in trace metadata for better debugging
NDJSON prettifier improvements (PR #135):
- Renamed and inverted flag to default-filter IRs
- More intuitive filtering behavior
Debug flag deprecation (PR #132):
- Removed unused debugging flags
- Cleaner configuration surface

🌐 Website & UI

Upgraded to Tailwind CSS v4 (commit 6c42d8a):
- Migrated from PostCSS plugin to @tailwindcss/vite for improved performance
- Updated CSS import syntax from @tailwind directives to @import "tailwindcss"
- Removed tailwind.config.js and postcss.config.js (now CSS-based configuration)
- Updated shadow class naming to v4 convention (shadow → shadow-sm)
- Cleaned up global CSS to prevent interference with Tailwind utility classes
Upgraded all frontend dependencies:
- Vite: 6.3.5 → 7.1.10
- React ecosystem: Updated to latest versions (React 19+)
- TypeScript: 5.7.2 → 5.7.3
- Added @types/node for Node.js type definitions
- Fixed dompurify security vulnerability (3.1.7 → 3.3.0) via npm overrides
Light mode color scheme (PR #139):
- Updated index.css to support only light mode
- Consistent, professional appearance
Improved stack display (PR #151):
- Better stack trace visualization in Launch Analysis
- Clearer debugging information
Documentation cleanup (PR #172):
- Removed redundant docs directory and screenshots
- Streamlined repository structure

🔧 Bug Fixes & Maintenance

General bug fixes (PR #153):
- Multiple stability and reliability improvements
- Better error handling throughout codebase
Deserialization fix (commit d4d7a20):
- Fixed unhandled types in deserialization
- More robust data loading
README improvements (PR #158, #164):
- Refactored and cleaned up README
- Fixed command typos in reproducer generation examples
- Clearer installation and usage instructions
Test cleanup (PR #160):
- Removed deprecated test for triton_kernels Tensor functionality
- Updated test suite for current codebase

Compatibility notes

Breaking Change: CLI now uses subcommand structure. Old usage python run.py <source> must be updated to tritonparse parse <source> or python run.py parse <source>.
New Dependencies: SASS disassembly requires NVIDIA CUDA Binary Utilities (nvdisasm). This is optional and only needed if enable_sass_dump=True.
Storage: TensorBlobManager introduces new blob storage directory structure. Default quota is 100GB; configure via TensorBlobManager initialization if needed.
Context Manager API: Enhanced with new parameters. Fully backward compatible with sensible defaults.

...

Assets 2

19 Sep 20:26

FindHao

v0.2.3

563d237

TritonParse v0.2.3 Release 🎉

TritonParse Release Notes (last 15 commits)

Date range: 2025-09-13 — 2025-09-18
Scope: Website UI/UX, core library, CI/CD & packaging, documentation & testing.

Highlights

Website File Diff tooling: Introduced a new Diff Comparison view and File Diff page, preserved diff sessions across navigation, integrated Monaco editor, added preview mode, and shipped a round of UI polish with a URL redirect fix for File Diff navigation.
Kernel Overview: Added a tiled kernel view toggle to improve dense overviews.
Core: Added lazy-import support for Triton repo triton_kernels custom types, attribution check for torch._utils_internal, and safer file mapping cleanup in the log parser.
CI/Packaging: Refactored dependencies in pyproject.toml, removed a legacy Triton install script, and updated GitHub Actions workflows.
Docs & tests: Improved README guidance; added tests and example outputs; minor UI bug fix in CopyCodeButton SVG attributes.

Changes by area

Website UI/UX
- Introduce DiffComparisonView and FileDiffView; maintain diff session state; integrate Monaco editor; preview mode; UI polish and navigation fixes.
- Add tiled kernel view toggle in KernelOverview.
Core library
- Lazy-import support for triton_kernels custom types; extend tensor handling in tests.
- Add attribution check for torch._utils_internal.
- Refactor file mapping cleanup in parse_logs.
CI/CD & packaging
- Refactor dependencies in pyproject.toml; remove .ci/install-triton-pip.sh.
- Update GitHub Actions workflows; add helper for triton_kernels in CI.
Docs & testing
- Clarify tool purpose and installation in README.md.
- Add tests and sample outputs; small UI component fixes.

Compatibility notes

No breaking changes expected. triton_kernels support is optional via lazy import.

Upgrade guidance

Reinstall website dependencies if developing the UI to pick up the Monaco editor.

Assets 2

11 Sep 23:08

FindHao

v0.2.0

1203104

TritonParse v0.2.0 Release 🎉

TritonParse Release Notes (last 27 commits)

Date range: 2025-07-25 — 2025-09-11
Scope: Core library, website UI/UX, performance & scalability, CI/CD & packaging, documentation & maintenance.

Highlights

PyPI package: TritonParse has been added to PyPI and can be installed by pip install tritonparse!
Website usability: Drag-and-drop to open logs; one-click copy in code viewers; sticky, compact kernel selector; footer shows app version, localized build date, and Git short SHA; tensor arguments in Launch Analysis now display concise summaries with expandable details.
Large-file parsing: Streaming NDJSON parsing and robust gzip handling significantly reduce memory usage and improve stability for files >100 MB.
Core & integrations: Persist Inductor kernel config into inductor_metadata and pass to JIT hooks; ensure Inductor path invokes jit_post_compile_hook; new init_with_env for environment-based initialization; move compilation timing times into metadata for automatic frontend rendering.
Releases & versioning: Adopt setuptools-scm dynamic versioning; add Nightly PyPI publishing; enable stable publishing on tag push; fix nightly version potentially being older than stable; correct packaging license metadata.
CI stability: Ubuntu 24.04 compatibility; improved CUDA/cuDNN setup and detection; parallelize jobs; add parallel CI for pip-installed Triton; better error visibility in install scripts; upgrade libstdc++.

Changes by area

Core library
- Save Inductor kernel params to inductor_metadata and forward to JIT hooks.
- Manually invoke jit_post_compile_hook in the Inductor Triton compile path.
- Add init_with_env that reads TRITON_TRACE_FOLDER and TRITON_TRACE_LAUNCH.
- Move compilation times into metadata so the frontend auto-renders it.
- Use cached source in compile listener for stability.
- Refactor source-mapping pipeline into modular units for maintainability.
Website UI/UX
- Drag-and-drop to open supported log files.
- Copy button in code viewer panels.
- Sticky/collapsible/compact kernel selector in Kernel Overview; resizable compilation stack trace vertically.
- Launch Analysis: tensor args show concise summaries with expandable details.
- Footer displays version, localized build date, and Git short SHA.
- Streaming NDJSON parsing and improved error handling for large logs.
Performance & scalability
- Use streaming path for files >100 MB to reduce memory peaks and improve robustness.
CI/CD & packaging
- Enable setuptools-scm and nightly PyPI publishing.
- Publish stable releases on tag push; improve version computation and tag detection.
- Fix nightly version possibly lagging behind stable; add clear error on missing tags.
- Add parallel CI for pip-installed Triton; recommend pip installation in docs.
- Improve Ubuntu 24.04 setup, CUDA/cuDNN handling, and job parallelism.
- Increase error visibility in install scripts and upgrade libstdc++.
- Define lower bounds for prerequisites in pyproject.toml.
Docs & maintenance
- Move repository to meta-pytorch org; update links and guidance; add AI assistant context.
- Update/restore CONTRIBUTING docs to avoid breaking downstream consumers.
Testing
- Preserve test outputs when TEST_KEEP_OUTPUT=1 to aid debugging.

Compatibility notes

Versioning & publishing: setuptools-scm with tag-based stable releases and nightly dev versions. Ensure PYPI_API_TOKEN is configured in CI if publishing is intended.
Data format: compilation timing times moved under metadata; update any downstream scripts that referenced the old location.
Build metadata: footer shows localized build date and Git short SHA; restart dev server to refresh these values.

Upgrade guidance

Prefer Triton from PyPI (≥ 3.4.0) and adhere to the lower bounds declared in pyproject.toml.
For deterministic build metadata in the website, set BUILD_DATE and GIT_COMMIT_SHA_SHORT in the environment when running dev/build.

Assets 2

Releases: meta-pytorch/tritonparse

TritonParse v0.4.4 Release 🎉

TritonParse Release Notes v0.4.4 (23 commits)

Highlights

Changes by Area

🏗️ New compat_builder Module

🔍 Bisect Enhancements

🤖 AI & Diff

🔧 Reproducer Fixes

🌐 Website & Viewer

⚙️ Internal Improvements

📝 Documentation & CI

Compatibility Notes

Uh oh!

TritonParse v0.4.3 Release 🎉

Highlights

Changes by Area

🔧 Reproducer Enhancements

⚡ JSON Compatibility Layer

🔒 Manifold Upload & OSS Fixes

🏗️ Infrastructure & CI

Compatibility Notes

Upgrade Guidance

Uh oh!

TritonParse v0.4.2 Release 🎉

TritonParse Release Notes v0.4.2 (45 commits)

Highlights

Changes by Area

🤖 New ai Module

🔬 Whole-Trace Diff (--trace mode)

📋 FileCheck-Based Procedure Detection

⚡ orjson Performance + Free-Threading Fallback

🔍 Torch Trace Kernel Attribution

✅ JSON Schema Validation

🎛️ Tensor Blob Save Controls

🔧 Reproducer Enhancements

🔧 Bisect Enhancements

🏗️ Infrastructure & CI

Uh oh!

TritonParse v0.4.1 Release 🎉

Highlights

Changes by Area

📊 Autotune Analysis

⚡ Profile-Aware Launch Tracing

🔬 New diff CLI Subcommand (Beta)

⚡ Profile-Aware Launch Tracing

🗜️ Compression Module

🔧 Bisect Enhancements

📦 Reproducer Enhancements

🌐 Website UI Improvements

🏗️ Infrastructure & API

Compatibility Notes

Upgrade Guidance

Uh oh!

TritonParse v0.4.0 Release 🎉

TritonParse Release Notes v0.4.0 (115 commits)

Highlights

Changes by area

🔍 New bisect CLI Subcommand

📊 SASS Source Mapping Support

🔬 BlockPingpong Detection

📦 Reproducer Enhancements

🎛️ Advanced Filter Syntax

🌐 Website UI Improvements

🏗️ Infrastructure & Code Quality

📚 Documentation

Compatibility notes

Upgrade guidance

Uh oh!

TritonParse v0.3.2 Release 🎉

TritonParse Release Notes v0.3.2 (34 commits)

Highlights

Changes by area

📊 New info CLI Subcommand

🔍 Multi-File Call Graph Analyzer

🎯 Unified 0-Based Indexing

⚡ Reproducer Enhancements

🛠️ IR Extraction Tool

🔐 Infrastructure & CI/CD

🔧 Bug Fixes & Improvements

🏗️ New `compat_builder` Module

🤖 New `ai` Module

🔬 Whole-Trace Diff (`--trace` mode)

🔬 New `diff` CLI Subcommand (Beta)

🔍 New `bisect` CLI Subcommand

📊 New `info` CLI Subcommand