Releases: meta-pytorch/tritonparse
TritonParse v0.4.4 Release 🎉
TritonParse Release Notes v0.4.4 (23 commits)
- Date range: 2026-04-09 — 2026-04-22
- Scope: Feature release — new
compat_buildermodule for automated Triton/LLVM compatibility mapping, PyTorch bisection support, AI-powered diff root cause analysis, CLP archive viewer support, and reproducer correctness fixes.
Highlights
-
🏗️ New
compat_builderModule: Brand-new package (~2,085 lines across 8 modules) that automates generatingcommits.csvfiles for LLVM bumps in Triton. Uses a state-machine-driven workflow (CompatBuilder) with 7 phases, git bisect–based compatibility probing (build → import → smoke test), AI-powered fix generation viaClaudeCodeClient, CSV management with metadata headers, and a full CLI with--resume,--verify, and--statusmodes. Includes 200+ tests covering all pure-logic paths. Integrated into the maintritonparseCLI as thecompat-buildsubcommand. -
🔍 PyTorch Bisection Support (#377): Extends the bisect module (~1,030 new lines) to bisect PyTorch commits in addition to Triton/LLVM. New
TorchBisectorclass drivesgit bisectover a PyTorch repo using user-provided test scripts. Includes build infrastructure scripts for CUDA, cuSparseLt, and Magma installation, plus aprepare_build_pytorch.shthat sets up the PyTorch build environment. Accessible viatritonparse bisect --target torch. -
🤖 AI-Powered Diff Root Cause Analysis: Adds Phase 2 AI analysis to
tritonparse diff --ai. Deterministic diff results from Phase 1 (metadata, IR stats, source mappings, tensor values) are formatted as structured markdown and sent to an LLM, which returns root cause explanations asDiffNoteobjects. Architecture includes a Triton-expert system prompt, priority-ordered context builder, and three response parsing strategies (JSON, structured markdown, raw text fallback). Supports both single-kernel and trace-level analysis with significance thresholds. -
📦 CLP Archive Support in Log Viewer (#382): The web viewer can now load and parse CLP (Compressed Log Processor) archives directly, completing the pipeline started in #326 where structured logging gained CLP output support. Updates
DataSourceSelector,WelcomeScreen, anddataLoader.tsto handle CLP file selection and decompression viaclp-ffi-js. -
🔧 OVERRIDE_TTIR Constexpr Interleaving Fix (#384): Fixes a
TypeErrorthat broke alltriton-mpp analyzesubcommands (ncu, barrier-analysis, plot-sm-occupancy) when kernel signatures interleave constexpr and non-constexpr parameters. The OVERRIDE_TTIR reproducer branch was removing constexpr args from positional lists, shifting remaining args into wrong positions. Fix passes all non-constexpr args as keyword args, eliminating position-dependent binding entirely. -
📝 Documentation Overhaul: Moves all GitHub Wiki pages into a version-controlled
docs/directory (~5,000 lines) with automatic wiki sync via GitHub Actions. Updates API signatures, adds documentation fordiff,bisect, andcompat-buildsubcommands, fixes outdated environment variable references, and corrects test commands.
Changes by Area
🏗️ New compat_builder Module
- State Machine (
state.py):CompatBuildPhase7-phase enum (INITIALIZING → COMPLETED/FAILED),CompatBuildStatedataclass with JSON serialization,CompatStateManagerfor persistence. 218 lines + 251 lines of tests. - Core Builder (
builder.py, PR2-01):CompatBuilderorchestrator driving the initialize → find_next_incompatible → record_pair → fix_incompatibility loop. 773 lines + 634 lines of tests. - CSV Manager (
csv_manager.py, PR2-02):CSVManagerandBumpBlockfor reading, validating, and writing single-bump CSV files with metadata headers. 261 lines + 413 lines of tests. - AI Fixer (
ai_fixer.py, PR3-01): AI-powered compatibility fixing following a two-phase (deterministic context + AI) pattern. System prompt encoding LLVM API change patterns, structured context builder,AICompatFixerorchestrator. 442 lines + 346 lines of tests. - CLI (
cli.py, PR3-02): Four modes — default build,--resume,--verify,--status. AI control flags (--ai/--no-ai,--ai-model) and worktree management. 364 lines + 225 lines of tests. - CLI Integration (PR3-03):
compat-buildsubparser wired into the maintritonparseCLI.
🔍 Bisect Enhancements
- PyTorch Bisection (#377): New
TorchBisectorclass (142 lines), shell scripts for CUDA/cuSparseLt/Magma installation and PyTorch builds (~644 lines), CLI extension with--target torch. 130 lines of tests. - Torch Bisect Script Fixes (#383): Setup
CUDA_HOME, install cuSparseLt libraries, install CI requirements across all bisect build scripts. - LLVM Path Comment Fix: Corrected misleading comments in bisect scripts about
.llvm-project/vsllvm-project/directory layout.
🤖 AI & Diff
- AI Root Cause Analysis for Diff:
diff/fb/ai/module with system prompt, context builder, andAIDiffAnalyzerorchestrator.--aiflag for both single-kernel and trace-level diff modes. 390 lines + tests (moved totests/fb/diff/). - AI Diff Test Relocation: Moved fb-only AI diff tests from
tests/cpu/diff/totests/fb/diff/to preventModuleNotFoundErroron GitHub CI.
🔧 Reproducer Fixes
- OVERRIDE_TTIR Constexpr Fix (#384): Pass non-constexpr args as keyword args in the override branch, preventing
TypeErrorwhen constexprs are interleaved with positional args. 123 lines of new tests. num_warps_baseExtraction: Extract originalnum_warpsfrom TTGIRttg.num-warpsmodule attribute during parse phase, storing it asmetadata["num_warps_base"]. Fixes warp-specialized kernels reporting inflated warp counts to the reproducer and viewer.- Per-Hash Tensor Blob Saving (#380): Tensor blob saving counter changed from global to per-compilation-hash. Each autotuned config saves exactly one set of blobs instead of only the first winner. Benchmark (autotune timing) launches are now always skipped.
🌐 Website & Viewer
- CLP Archive Loading (#382):
clp-ffi-jsintegration for decompressing and parsing CLP archives in the browser-based log viewer. - ESLint 10 Upgrade (#378): ESLint v9 → v10, react-hooks canary channel, React 19.2.5, Vite 8.0.7, TypeScript-ESLint 8.58.1.
- ESLint 10 Lint Fixes (#379): Comprehensive fixes across 10 files for new lint rules — lazy state initialization,
useCallbackwrapping, extracted utility modules, error cause chaining. - Vite Security Bump (#381): Vite 8.0.7 → 8.0.8 (dependabot).
⚙️ Internal Improvements
TRITONPARSE_FB_MODEEnv Var: Overrideis_fbcode()detection withTRITONPARSE_FB_MODE=0(OSS) or=1(fbcode). FixesImportErrorwhen running inside fbsource without Meta-internal dependencies.- Torch as Hard Dependency: Removed
TORCH_INSTALLEDconditional flag and 12 guard branches instructured_logging.py. Torch was already a de facto hard dependency. - FileCheck Binary Detection: Check package root, AMD backend, and NVIDIA backend paths (not just AMD), matching Triton's own
_filecheck.pyconvention. importlib.resourcesfor Procedure Checks: Fixdefault_procedure_checks.jsonloading in PAR archives by switching fromPath(__file__).parenttoimportlib.resources.files().
📝 Documentation & CI
- Wiki →
docs/Migration: 10 wiki pages (5,000+ lines) moved into version-controlleddocs/directory with automatic sync via GitHub Actions. - Wiki Sync Regex Fix (#390): Escape literal
)in sed extended regex to fixsync-wiki.ymlworkflow.
Compatibility Notes
torchis now a hard dependency: TheTORCH_INSTALLEDguard has been removed. Environments without PyTorch installed will fail at import time rather than silently degrading.TRITONPARSE_FB_MODEenv var: New escape hatch for users running inside fbsource without full Meta-internal dependencies — setTRITONPARSE_FB_MODE=0to force OSS mode.- No other breaking changes to the public API.
TritonParse v0.4.3 Release 🎉
- Date range: 2026-04-01 — 2026-04-08
- Scope: Bug-fix release - OVERRIDE_TTIR reproducer rewrite with stub kernel generation, warp-specialized kernel num_warps fix, Manifold upload scoping to fbcode MAST environments, OSS atexit cleanup fix, and _json_compat extensions.
Highlights
-
🔧 OVERRIDE_TTIR Reproducer Rewrite (#376): Complete rewrite of the OVERRIDE_TTIR reproducer mode. The previous implementation was broken — it skipped defining the kernel function (causing
NameError), only worked for autotuned kernels, and discarded constexpr values. The new approach generates a stubtriton.jitfunction (same name and params,passbody) wrapped withtriton.autotunecarrying captured constexpr values, compile params, andir_overridepointing to the captured TTGIR. This eliminates the need to copy kernel source code and its transitive dependencies. -
🐛 Warp-Specialized Kernel Reproducer Fix: Fixed
ptxas"Insufficient registers" failure when reproducing warp-specialized kernels. The Triton compiler overwritesmetadata["num_warps"]with the post-expansion count (ttg.total-num-warps), causing the reproducer to double-inflate the warp count. The fix extracts the originalttg.num-warpsfrom TTGIR module attributes instead. -
🔒 Manifold Upload Scoping: Manifold upload is now only enabled by default in fbcode MAST environments (detected via
torch.version.git_versionandMAST_HPC_JOB_NAME), preventingModuleNotFoundErrorin OSS environments during atexit cleanup.
Changes by Area
🔧 Reproducer Enhancements
- OVERRIDE_TTIR Stub Generation (#376): New
stub_generator.py(~137 lines) generates stub Triton functions and extracts constexpr values. Rewritten_replace_kernel_importfor OVERRIDE_TTIR generates stub + autotune config._replace_kernel_invocationfilters constexpr/compile params (autotune provides them). Captured IR files saved from compilation event'sfile_contenttocaptured_irs/. Useslru_cacheonextract_params_from_sourceto avoid redundant AST parses. - Warp-Specialized num_warps Fix: At reproducer generation time, extracts original
ttg.num-warpsfrom TTGIR module attributes instead of the inflatedmetadata["num_warps"]. The post-expansion count is preserved astotal_num_warpsfor informational purposes.
⚡ JSON Compatibility Layer
_json_compat.pyExtensions: Addedload(f)anddump(obj, f)file-based convenience wrappers delegating to existingloads()/dumps()with file I/O wrapping.- CUTracer Migration: All 14 CUTracer production Python files migrated from stdlib
jsontotritonparse._json_compat, providing a free 3-10x JSON performance upgrade via orjson with graceful degradation.
🔒 Manifold Upload & OSS Fixes
- Scoped Default (#374, 3337a0c):
TRITONPARSE_TRACE_MANIFOLDnow defaults to"0"(OFF) and is only auto-enabled when running in fbcode and in a MAST environment. The env var override still works in all environments. - OSS atexit Fix (#374): Gated the Manifold upload path in
_cleanup()behindis_fbcode()to preventModuleNotFoundError: No module named 'tritonparse.fb'during atexit in OSS environments.
🏗️ Infrastructure & CI
- Packaging Workaround (#370): Added explicit
pip install packagingin CI setup to work around PyTorch nightly (2.12.0.dev20260405+) missing dependency onpackagingmodule. - Pin Node.js in CI: Pinned Node.js version in GitHub Actions CI workflows for reproducible builds.
- Website Dependencies: Upgraded website dependencies and fixed Vite 8 / ESLint compatibility. Bumped vite from 8.0.3 to 8.0.5 (security fix).
- Internal Repo Re-sync (#375): Cleaned up Claude Code configuration files that were incorrectly synced to the OSS repository.
Compatibility Notes
- No breaking changes: This is a bug-fix release with no API or behavior changes for existing users.
- Manifold upload default changed:
TRITONPARSE_TRACE_MANIFOLDnow defaults to OFF in non-fbcode environments. Users who relied on the previous default of ON in OSS should explicitly setTRITONPARSE_TRACE_MANIFOLD=1. - OVERRIDE_TTIR reproducer: The reproducer output format for OVERRIDE_TTIR mode has changed (stub kernel + autotune wrapper instead of source copy), but the generated reproducers are functionally equivalent and more reliable.
Upgrade Guidance
-
Standard upgrade:
pip install --upgrade tritonparse
-
Warp-specialized kernel reproducers: Previously failing reproducers for warp-specialized kernels should now work correctly without manual intervention.
TritonParse v0.4.2 Release 🎉
TritonParse Release Notes v0.4.2 (45 commits)
- Date range: 2026-02-27 — 2026-03-30
- Scope: Feature release - New
aimodule for LLM-powered analysis, whole-trace--tracediff mode with multi-strategy kernel matching, FileCheck-based procedure detection replacing hardcoded BlockPingpong, orjson performance optimization with free-threading fallback, torch trace kernel attribution, JSON schema validation, and kernel-run-level tensor blob save controls.
Highlights
-
🤖 New
aiModule: LLM client abstraction layer withLLMClientABC,ClaudeCodeClientfor Claude Code CLI integration,MockClientfor testing, and output parsers (extract_json,extract_code_block,extract_diff_patch). Foundation for AI-powered analysis features. -
🔬 Whole-Trace Diff (
--tracemode): Compare all kernels across two trace files with a single command. Multi-strategyKernelMatcherengine matches kernels by hash → name → source similarity → fuzzy name → config similarity.TraceDiffEngineorchestrates matching, per-pair diffing, and summary generation. Autotuning-aware: distinguishes truly absent kernels from unpaired autotuning compilations. -
📋 FileCheck-Based Procedure Detection: Complete rewrite of IR analysis from hardcoded Python pattern matching to a JSON-driven, FileCheck-based system. Procedure definitions are declarative with configurable pattern checks and display attributes. Replaces old
BlockPingpongCategorywith three configurable procedure configs (Small/Medium/Large). Tile size attributes (M, N, K, bits) now displayed. -
⚡ orjson Performance + Free-Threading Fallback: New
_json_compat.pycompatibility layer uses orjson for performance and falls back to stdlib json for CPython 3.14 free-threading builds. All 21 modules migrated.orjson>=3.9andrich>=13.0are now default dependencies. -
🔍 Torch Trace Kernel Attribution: New torch trace log parser extracts
kernel_source_path → CompileInfomappings from inductor's output code events, enabling kernel-to-compilation-frame attribution whenpt_infois missing. Wired through parse pipeline and CLI via--torch-trace-dir. -
✅ JSON Schema Validation: New
tritonparse/validation/module with JSON schemas forcompilation,launch,launch_diff, andir_analysisevent types. Lightweight validator checks types, required fields, enums, numeric constraints, and$refresolution. -
🎛️ Kernel-Run-Level Tensor Blob Controls: New
TRITONPARSE_TENSOR_SAVE_SKIP_RUNSandTRITONPARSE_TENSOR_SAVE_MAX_RUNSenvironment variables (and Python API) for fine-grained control over which kernel runs get tensor blob snapshots.
Changes by Area
🤖 New ai Module
A new tritonparse/ai/ module (~1,400 lines) providing LLM client abstractions:
- LLM Client ABC (PR-1):
LLMClientabstract base withchat()andchat_stream()interfaces;Message,Response,ToolCalldataclasses;MockClientfor testing - ClaudeCodeClient (PR-2): Production client wrapping Claude Code CLI with temp file shell escaping, session resumption, model selection, retry logic, JSON/stream-JSON parsing
- Output Parsers (PR-3):
extract_json(),extract_code_block(),extract_diff_patch()fallback parsers for LLM text responses;format_messages(),truncate_context()utilities - Error Diagnostics: Improved error handling extracts actual error from stdout JSON
"result"field instead of just stderr
🔬 Whole-Trace Diff (--trace mode)
A complete trace-level comparison system (~3,400 lines) with layered architecture:
- Data Types:
MatchMethodenum (HASH/NAME/SOURCE/FUZZY_NAME/CONFIG),KernelMatchResult,TraceDiffResult,TraceDiffSummary,TraceStats,DtypeMismatch - KernelMatcher (~505 lines): Three-phase group-aware matching engine:
- Phase 0: Hash-based exact matching (highest priority, cross-name capable)
- Phase 1: Group-level matching by exact name → source similarity (threshold 0.75) → fuzzy name (threshold 0.7)
- Phase 2: Within-group config pairing by (num_stages, num_warps, shared memory) similarity
- Bounded sampling (
_MAX_GROUP_SAMPLES=5) for performance on large traces
- TraceDiffEngine (~355 lines): Orchestrator computing trace stats → kernel matching → per-pair DiffEngine → summary generation
- Output:
TraceSummaryFormatterfor human-readable output; extendedConsolidatedDiffWriterwithadd_trace_diff() - CLI: New
--traceflag requiring exactly 2 input files - Dtype Mismatch Detection: Surfaces dtype mismatches in tensor value comparison when argument names don't overlap
- Test Reorganization: Monolithic
test_diff.pysplit into 7 focused files:test_cli.py,test_diff_engine.py,test_fixtures.py,test_kernel_matcher.py,test_tensor_value.py,test_trace_diff.py,test_trace_output.py
CLI Usage:
# Compare all kernels across two trace files
tritonparseoss diff trace_a.ndjson trace_b.ndjson --trace
# With tensor value analysis
tritonparseoss diff trace_a.ndjson trace_b.ndjson --trace --tensor-values --atol 1e-5 --rtol 1e-3📋 FileCheck-Based Procedure Detection
Complete rewrite of IR analysis (~2,200 lines) from hardcoded Python to JSON-driven FileCheck:
- FileCheck Integration: Auto-discovers FileCheck binary from Triton's bundled version,
FILECHECK_PATHenv var, or system PATH - JSON Configuration (
default_procedure_checks.json): Declarative procedure definitions withpattern_checks(FileCheck patterns) anddisplay_attributes(configurable extraction rules) - Attribute Extraction: Multiple sources (
module_attrs,ir_content,computed) with rules (regex,count,dot_shape,tile_size_bits,pp_clusters) - Tile Size Display: New tile_m, tile_n, tile_k, tile_size_bits attributes
- BlockPingpong Migration: Old
BlockPingpongCategoryenum and ~254 lines of hardcoded Python replaced by three JSON-configured procedures (Small/Medium/Large) - Website UI: Collapsible/foldable sections per procedure in IRAnalysis page
- Streamlined Workflow: Procedure detection integrated into main tritonparse parse pipeline
⚡ orjson Performance + Free-Threading Fallback
_json_compat.py(new): Unified JSON compatibility layer — orjson when available, stdlib json fallbackloads()acceptsstr | bytes | bytearray | memoryviewdumps()returnsstrwithindentandsort_keyssupport- Non-string key coercion in fallback path (replicates orjson's
OPT_NON_STR_KEYS)
- Global Migration (#362): All 21 modules migrated from
import jsontofrom tritonparse._json_compat import loads, dumps, JSONDecodeError - Free-Threading Support (#365): Automatic stdlib json fallback for CPython 3.14 free-threading builds where orjson is unavailable
- Default Dependencies (#366):
orjson>=3.9andrich>=13.0added topyproject.tomldependencies (previously zero dependencies)
🔍 Torch Trace Kernel Attribution
- Torch Trace Parser (#353): New
tritonparse/parse/torch_trace_parser.py(~212 lines) parsing inductor's glog-formatted torch trace logs to extractkernel_source_path → CompileInfomappings frominductor_output_codeevents - Trace Processor Integration (#354):
_build_kernel_attribution_map()and_apply_kernel_attribution()enrich compilation events withpt_infowhen missing (~126 lines) - CLI & Pipeline Wiring (#355): New
--torch-trace-dirflag with auto-discovery of torch trace files from the same parent directory
✅ JSON Schema Validation
- Schema Files (#356): Four JSON schemas for
compilation,launch,launch_diff,ir_analysisevent types - Lightweight Validator (
json_validator.py, ~287 lines): Validates required fields, types, enums, numeric constraints (min/max/exclusive),additionalProperties, array items, and$refresolution validate_trace_file(): Full NDJSON trace file validation withmax_errorscap- Schema Loader:
importlib.resourcesfor PAR compatibility, lazy loading with caching - Test Suite: Comprehensive tests (~652 lines) covering all validation scenarios
🎛️ Tensor Blob Save Controls
- Skip/Max Runs Gating: New environment variables for fine-grained control:
TRITONPARSE_TENSOR_SAVE_SKIP_RUNS: Skip tensor blob saving for the first N kernel runs (default: 0)TRITONPARSE_TENSOR_SAVE_MAX_RUNS: Save tensor blobs for at most N kernel runs after skipping (default: 0 = unlimited)
- Python API:
TritonParseManager(tensor_save_skip_runs=N, tensor_save_max_runs=M)andinit(tensor_save_skip_runs=N, tensor_save_max_runs=M) - Autotune-Aware: Benchmark launches during autotune are excluded from run counting
- GPU Tests: End-to-end validation of skip/max runs gating
🔧 Reproducer Enhancements
- CUDA Graph Capture Error (#359): Clear
RuntimeErrorwhen reproducing kernels launched during CUDA graph capture, explaining that argument extraction was skipped - Kernel Name Fallback: Reproducer/info now falls back to matching by kernel name when compilation hash is missing (Inductor kernels where JIT hook didn't fire)
🔧 Bisect Enhancements
--triton-repoFlag: Controls culprit commit URL prefix —oai(triton-lang/triton, default) ormeta(facebookexperimental/triton); state persisted and restored on resume- Rich as Default Dependency (#366):
rich>=13.0moved from optional to default, simplifying bisect UI code
🏗️ Infrastructure & CI
TritonParse v0.4.1 Release 🎉
- Date range: 2026-01-22 — 2026-02-24
- Scope: Feature release - New
diffCLI subcommand for kernel compilation comparison with tensor value analysis, autotune analysis visualization, profile-aware launch tracing, enhanced reproducer support, bisect auto-setup, and multi-format trace compression support.
Highlights
-
📊 Autotune Analysis: End-to-end autotune session tracking with frontend visualization. Automatically detects autotune sessions, tracks benchmark vs winner launches, displays configuration comparison tables, and shows winner run count statistics.
-
🔬 New
diffCLI Subcommand (Beta): Complete kernel compilation diff system for comparing two compilation events. Supports metadata analysis, source mapping comparison, IR statistics diff, and tensor value comparison with configurable tolerances (--tensor-values,--atol,--rtol). Output can be appended in-place or written to new files. Note: This feature is in beta — APIs and output formats may change in future releases. -
⚡ Profile-Aware Launch Tracing: Transparent integration with
torch.profilerviaTRITON_TRACE_LAUNCH_WITHIN_PROFILING=1. Monkey-patchestorch.profiler.scheduleto trace launches only during the profiler's RECORD phase. -
🗜️ Multi-Format Compression: Added CLP (Compressed Log Processor) support alongside existing gzip. Trace compression is now disabled by default (
TRITON_TRACE_COMPRESSION=none). Magic number detection for transparent decompression. -
🔧 Bisect Auto-Setup: New
--auto-env-setupflag for--llvm-onlybisect mode. Automatically clones/updates Triton and LLVM repositories, creates conda environments. -
📦 TMA Kernel Support: TensorDescriptor capture and reconstruction for TMA (Tensor Memory Accelerator) kernel reproducers.
Changes by Area
📊 Autotune Analysis
⚡ Profile-Aware Launch Tracing
🔬 New diff CLI Subcommand (Beta)
A complete kernel compilation comparison system (~1500 lines) with layered architecture:
- Data Types (D1):
CompilationDiffResult,DiffNote,DiffSummary,IRStats,IRStatsDiff,MetadataDiff,TensorArgDiff,TensorValueDiff - Event Matching (D2):
match_events_by_index(),match_events_by_kernel(),find_launch_for_compilation() - Diff Engine (D3): Main
DiffEngineclass orchestrating all analyzers - Metadata Analyzer (D4): Compares compilation metadata (num_warps, num_stages, etc.)
- Sourcemap Analyzer (D5): Compares source mappings between IRs
- Summary Generator (D6): Generates human-readable diff summaries
- Output Module (D7):
ConsolidatedDiffWriter,append_diff_to_file(),format_summary() - CLI Entry Point (D8):
tritonparseoss diffcommand with--events,--kernel,--tensor-valuesflags - Tensor Value Analyzer: Numeric tensor comparison with blob mode (full element-wise) and stats mode (min/max/mean/std fallback)
- Unit Tests: Phase 1 test coverage for core modules
CLI Usage:
# Compare compilations 0 and 1 in single file
tritonparseoss diff trace.ndjson --events 0,1
# Compare with tensor value analysis
tritonparseoss diff trace.ndjson --tensor-values --atol 1e-5 --rtol 1e-3
# List available compilations
tritonparseoss diff trace.ndjson --list
# Filter by kernel name
tritonparseoss diff trace.ndjson --kernel matmul --events 0,1⚡ Profile-Aware Launch Tracing
- New environment variable
TRITON_TRACE_LAUNCH_WITHIN_PROFILING=1 patch_profiler_schedule(): Monkey-patchestorch.profiler.scheduleenable_launch_tracing()/disable_launch_tracing()API- Mutually exclusive with
TRITON_TRACE_LAUNCH(validated at init) - Unit tests for all three scenarios: no flag, trace all, profile-aware
🗜️ Compression Module
- Magic number detection:
detect_compression()for gzip/zstd/none - Transparent reading:
open_compressed_file()context manager - CLP format support (#326):
TRITON_TRACE_COMPRESSION="clp"for Compressed Log Processor format - Default change: Compression disabled by default (was gzip)
- API functions:
is_gzip_file(),is_zstd_file(),iter_lines()
🔧 Bisect Enhancements
- EnvironmentManager (#329-#332):
- Auto-clone Triton and LLVM repositories from GitHub
- Create/verify conda environments
--auto-env-setupCLI flag for--llvm-onlymode- Status checking and diagnostics
- Unit tests for all scenarios
📦 Reproducer Enhancements
- TensorDescriptor support (#344): Captures
base,shape,strides,block_shape,paddingfor TMA kernels - preserve_autotune mode (#328): Preserve autotune configs in reproducer scripts
- Robustness improvements: Complex kernel handling, function reference detection in call arguments (#348)
- Verbose args print placeholder (#347): Placeholder for verbose argument printing
- WS kernel fix (#349): Correct
num_warpshandling for Warp Specialization kernels - Better logging (#346): Improved logging when black/isort unavailable
🌐 Website UI Improvements
- KernelOverview page: New component for autotune analysis visualization (870 lines)
- WebSocket ArrayBuffer handling (#340): Direct trace ArrayBuffer via iframe messaging
- URL normalization (#324): Manifold Explorer and tritonparse URL handling
- Click-to-highlight tip: Added in CodeComparisonView
- Title navigation fix: TritonParse title returns to home
🏗️ Infrastructure & API
- SASS parsing refactor:
extract_sass_pc_mappings()for PC-offset-keyed source mapping (for CUTracer integration) - Rank-less file support (#341, #342):
--rank nonefor parsing files without rank suffix - Launch without compilation (#336): Support launch events when compilation was cached
- log_dir parameter (#337):
TritonParseManager(log_dir=...)API - Auto-switch log file: When rank becomes available during execution
- Error message improvements (#339): Better diagnostics and bug fixes
- Meta copyright headers: Added to all scripts
- Dependabot prefix:
[dependabot]prefix to PR titles - Negative line support (#319):
prettify_ndjsonhandles negative line numbers
Compatibility Notes
- Default Change: Trace compression is now disabled by default. Set
TRITON_TRACE_COMPRESSION="gzip"to restore v0.4.0 behavior. - New Feature (Beta): The
diffsubcommand is additive and doesn't affect existing workflows. It is in beta — APIs and output formats may change. - New Feature: Autotune analysis events are automatically generated; frontend displays when available.
- Mutual Exclusivity:
TRITON_TRACE_LAUNCHandTRITON_TRACE_LAUNCH_WITHIN_PROFILINGcannot both be set.
Upgrade Guidance
-
Use diff for kernel comparison:
# Basic diff tritonparseoss diff trace.ndjson --events 0,1 # With tensor value comparison tritonparseoss diff trace.ndjson --tensor-values --kernel matmul
-
Enable profile-aware tracing:
TRITON_TRACE_LAUNCH_WITHIN_PROFILING=1 python train.py
-
Use CLP compression (if available):
TRITON_TRACE_COMPRESSION="clp" python train.py -
Bisect with auto-setup:
tritonparseoss bisect --llvm-only --auto-env-setup \ --triton-dir ~/oss-triton \ --good-llvm abc123 --bad-llvm def456 \ --test-script test.py -
TMA kernel reproducers: Now work automatically when TensorDescriptor arguments are present.
TritonParse v0.4.0 Release 🎉
TritonParse Release Notes v0.4.0 (115 commits)
- Date range: 2025-12-26 — 2026-01-21
- Scope: Major feature release - New
bisectCLI subcommand for automated Triton/LLVM regression bisection, SASS source mapping support, BlockPingpong IR analysis, advanced filter syntax, and significant infrastructure improvements.
Highlights
- 🔍 New
bisectCLI Subcommand: Complete regression bisection system for Triton and LLVM. Automatically find culprit commits withgit bisectintegration, LLVM bump detection, commit pair testing, and Rich TUI real-time progress display. Supports resumable workflows and multiple operation modes. - 📊 SASS Source Mapping: Full SASS (NVIDIA assembly) source mapping support with fuzzy matching. Enables bidirectional mapping between SASS and other IR types (TTIR, TTGIR, PTX) in the website UI.
- 🔬 BlockPingpong Detection: New IR analysis capability to detect and categorize block pingpong scheduling patterns in TTGIR, with color-coded visualization in the website UI.
- 📦 Standalone Reproducer: New
--embed-contextflag embeds JSON context directly into generated Python scripts, creating fully self-contained single-file reproducers for easy sharing and bug reports. - 🎛️ Advanced Filter Syntax: Enhanced
--args-listfiltering with support for nested properties (C_ptr.dtype), array indexing (C_ptr.shape[0]), and list matching (C_ptr.shape=[3024, 10752]). - 🏗️ Infrastructure Modernization: Parse module refactored into dedicated subdirectory, unified logging system, centralized SVG icons, test directory restructuring, and ESLint integration for website.
Changes by area
🔍 New bisect CLI Subcommand
A complete regression bisection system spanning ~6000+ lines of code across 55+ PRs, organized in 7 architectural layers.
-
Operation modes (PR-43 ~ PR-52):
tritonparseoss bisect --good <commit> --bad <commit>- Triton-only bisect--llvm-only- Direct LLVM commit bisection--pair-test- Test (Triton, LLVM) commit pairs from CSV--commits-csv- Full 4-phase workflow (Triton bisect → LLVM bump detection → pair test → LLVM bisect)--resume/--status- Resume interrupted bisect or check status
-
Core bisector architecture (PR-15 ~ PR-21):
BaseBisector- Abstract base class with template method patternTritonBisector- Triton commit bisection with automatic build and testLLVMBisector- LLVM commit bisection with Triton rebuild- Commit validation and correct bisect range detection
-
Commit detection and pair testing (PR-22 ~ PR-27):
CommitDetector- Automatically detects LLVM version bump commitsLLVMBumpInfo- Captures old/new LLVM hash informationPairTester- CSV-driven (Triton, LLVM) commit pair testing- LLVM range filtering for efficient pair selection
-
State management (PR-28 ~ PR-31):
BisectPhaseenum:TRITON_BISECT,TYPE_CHECK,PAIR_TEST,LLVM_BISECT,COMPLETED,FAILEDBisectStatedataclass with JSON serializationStateManagerfor persistent state with auto-resume support- Automatic state file discovery (
find_latest_state())
-
Rich TUI interface (PR-32 ~ PR-42):
BisectUI- Split-screen layout with progress and output panels- Real-time progress updates with phase, commit, and step information
- Graceful fallback to plain text when Rich unavailable
print_final_summary()- Beautiful summary with GitHub links
-
Shell scripts (PR-06 ~ PR-13):
bisect_triton.sh- Triton build and test script for git bisectbisect_llvm.sh- LLVM + Triton build with COMPAT_MODE supporttest_commit_pairs.sh- Sequential pair testing with CSV supportscripts/__init__.py- Script path utilities
-
Execution infrastructure (PR-01 ~ PR-05, PR-14):
ShellExecutor- Blocking and streaming command executionCommandResultdataclass with duration trackingBisectLogger- Dual logging (file + TUI callback)run_git_bisect_sequence()- Complete git bisect workflowuvpackage manager support viaconfig.py(PR-54)- Clean build environment before each bisect step (PR-55)
-
Unit tests (Test-PR-01 ~ Test-PR-03):
- Tests for
state.py,commit_detector.py,pair_tester.py - Tests for
executor.pyandlogger.py(Layer 0)
- Tests for
📊 SASS Source Mapping Support
-
Fuzzy matching for SASS (commit 762844e):
- New
extract_sass_mappings()function inir_parser.py ignore_columnparameter for fuzzy matching (SASS lacks column info)- Automatic fuzzy matching when source or target IR is "sass"
- SASS comment line mapping (
//## File "/path", line N) - Skip
.nv_debug_ptx_txtdebug file references
- New
-
Website UI integration (#249):
- SASS code panel support in IR Code View
- Bidirectional highlighting between SASS and other IRs
- Updated default trace with SASS code (commit 1b2d6a9)
🔬 BlockPingpong Detection
- IR analysis enhancement (commits 50deca4, fe3092f, 0426510, 2dc0eac):
- New BlockPingpong pattern detection in
ir_analysis.py(~257 lines) - Automatic categorization of ping-pong scheduling patterns
- Pattern matching descriptions for each category
- Color-coded visualization in website UI
- Dedicated Pingpong section in IR Analysis interface
- New BlockPingpong pattern detection in
📦 Reproducer Enhancements
-
Standalone reproducer (#252):
- New
--embed-contextCLI flag (default: False) - Embeds JSON context directly into Python script
- Creates fully self-contained single-file reproducer
- Ideal for sharing, bug reports, and archiving
- New
-
Compile params support (#295):
- Pass compile parameters to kernel invocation
- Fixes issue #277
-
Improved identification (#293, #294):
line_indexadded to reproducer filename- Metadata comments in generated scripts
-
Bug fixes:
🎛️ Advanced Filter Syntax
- Nested property filtering (commit 3ee5df5):
- Dot notation:
C_ptr.dtype=torch.bfloat16 - Array indexing:
C_ptr.shape[0]=3024 - List matching:
C_ptr.shape=[3024, 10752] - Unified nested dict unwrapping across all value sources
- Filter kernel launches by tensor metadata (shape, dtype, stride)
- Dot notation:
🌐 Website UI Improvements
-
Code panel enhancements:
-
Infrastructure:
🏗️ Infrastructure & Code Quality
-
Module reorganization:
-
Test infrastructure:
- Test directory restructuring:
tests/cpu/andtests/gpu/ - Extract GPU TensorBlob, complex kernels, reproducer E2E tests
- Extract GPU structured logging + context manager tests
- Extract CPU tests to dedicated directory
- CI workflow updated for new test structure
- Test directory restructuring:
-
Code formatting:
-
Bug fixes:
- Kernel selector overflow fix (commit b5c72b8)
- Substring matching bug in call graph dependency filtering (commit 0ec75af)
- PAR compatibility in function_extractor (commit 48551a2)
ast.unparse()for proper indentation in reproducer extraction (commit 1d8a33d)--kernel-importhelp message fix (commit 18cf9d8)source_repo_dirsupport for mapping production file paths (commit a952d99)- BisectLogger unique logger names per instance (#251)
📚 Documentation
- Simplified CHANGELOG.md with links to GitHub releases (#226)
- Website version bumped to 0.3.2 with dependency updates (#238)
Compatibility notes
- New Feature: The
bisectsubcommand is an additive feature that doesn't affect existing workflows. - SASS Support: To use SASS source mapping, traces must include SASS IR (enable via
enable_sass_dump=TrueorTRITONPARSE_DUMP_SASS=1). - Filter Syntax: The new advanced filter syntax is backward compatible; existing filter expressions continue to work.
- Test Directory: Tests have been reorganized into
tests/cpu/andtests/gpu/subdirectories.
Upgrade guidance
-
Use bisect for regression hunting:
# Basic Triton bisect tritonparseoss bisect --triton-dir /path/to/triton \ --test-script test.py --good v2.0.0 --bad HEAD # Full workflow with LLVM bump detection tritonparseoss bisect --triton-dir /path/to/triton \ --test-script test.py --good v2.0.0 --bad HEAD \ --commits-csv pairs.csv # Resume interrupted bisect tritonparseoss bisect --resume # Check status tritonparseoss bisect --status
-
Generate standalone reproducers:
tritonparseoss reproduce trace.ndjson --kernel matmul --embed-context
-
Use advanced filtering:
tritonparseoss info trace.ndjson --args-list "C_ptr.shape[0]=302...
TritonParse v0.3.2 Release 🎉
TritonParse Release Notes v0.3.2 (34 commits)
- Date range: 2025-11-05 — 2025-12-22
- Scope: Major feature release - New
infoCLI subcommand, multi-file call graph analysis for reproducers, unified 0-based indexing, IR extraction tools, and infrastructure improvements.
Highlights
- 📊 New
infoCLI Subcommand: Query kernel information from NDJSON trace files without manual parsing. List all kernels with launch counts, view launches for specific kernels, and get fuzzy matching suggestions for kernel names. - 🔍 Multi-File Call Graph Analyzer: Advanced AST-based analysis that automatically extracts all transitively-called functions across multiple Python files. Enables self-contained kernel reproducers with all dependencies included.
- 🎯 Unified 0-Based Indexing: All launch indices throughout the codebase (CLI, website, internal APIs) now use consistent 0-based indexing following Python conventions.
- ⚡ Enhanced Reproducer: New
--kerneland--launch-idarguments eliminate manual line number lookup. AST-based dependency extraction, autotune disabler, and code formatting for generated scripts. - 🛠️ IR Extraction Tool: New command-line tool to extract Triton IRs (TTIR, TTGIR, LLIR, PTX) from trace logs with flexible output organization.
- 🔐 PyPI Trusted Publishing: Migrated from API token authentication to OIDC-based Trusted Publishing for improved security and attestations.
Changes by area
📊 New info CLI Subcommand
- Core query layer (PR #208):
- New
tritonparse/info/module for kernel information queries KernelSummaryandLaunchInfodataclasses for structured resultslist_kernels(): List all kernels with launch countsfind_launch_index_by_kernel(): Find line index for a kernel's N-th launch
- New
- CLI interface (PR #210):
tritonparseoss info <trace.ndjson>- List all kernels with launch countstritonparseoss info <trace.ndjson> --kernel <name>- List launches for specific kernel- Auto-parsing: Automatically detects and parses raw logs
- Fuzzy matching suggestions when kernel not found
- Performance optimization using
launch_diffevents when available
- Additional filtering (commit 8134195):
- Added
--args-listfiltering to info command
- Added
🔍 Multi-File Call Graph Analyzer
- Three-phase implementation (PR #206 Phase 1-3):
- Phase 1 - ImportResolver: Multi-file call graph analysis foundation
- Phase 2 - ImportParser: AST-based import statement parsing
- Phase 3 - MultiFileCallGraphAnalyzer: Complete multi-file traversal with BFS
- Key features:
- Automatic extraction of transitively-called functions across Python files
- Per-file code root tracking (fbcode, Python projects, Git repositories)
- Graceful fallback for files outside detected roots
- Integrated into reproducer for self-contained script generation
🎯 Unified 0-Based Indexing
- Breaking change (PR #211):
- All launch indices now use 0-based indexing
- Affects: trace processor, website components (KernelOverview, DiffViewer, StackDiffViewer, ArgumentViewer)
- CLI
--lineargument changed to 0-based (PR #205)
- Rationale:
- Consistency with Python conventions
- Alignment with existing
infoandreproducecommands - Simpler code without +1/-1 conversions
⚡ Reproducer Enhancements
- Kernel name lookup (PR #209):
- New
--kernelargument to specify kernel by name instead of line number - New
--launch-idargument (0-based) to select specific launch - Mutual exclusivity with
--lineargument - Example:
tritonparseoss reproduce trace.ndjson --kernel matmul_kernel --launch-id 2
- New
- AST-based dependency extraction (commit 8ad24f6):
- Automatic extraction of dependent helper functions
- Call graph analysis for transitive dependencies
- Self-contained reproducers without manual function hunting
- Autotune disabler (commit 28486fc):
- Automatically disable Triton's autotune decorator in generated scripts
- New
utils.pymodule with_disable_triton_autotune()function - Works with both IMPORT and COPY kernel import modes
- Code formatting (commit 311e016):
- Generated reproducers are now properly formatted
- Bug fixes:
🛠️ IR Extraction Tool
- New tool (PR #202):
tritonparse/tools/extract_irs.pyfor extracting Triton IRs from trace logs- Supports TTIR, TTGIR, LLIR, PTX, and other IR formats
- Flexible output: flat or by-kernel directory structure
- Comprehensive documentation in
tritonparse/tools/readme.md
- Logger fix:
- Fixed
NameError: 'logger' is not definedin generated reproducers - Added proper logging initialization to templates
- Fixed
🔐 Infrastructure & CI/CD
- PyPI Trusted Publishing (PR #219):
- Migrated from API token to OIDC authentication
- Enabled package attestations for provenance
- No secrets management required
- On-Demand Nightly Publishing (PR #216):
- Flexible PyPI publishing workflow
- Website build CI (PR #224):
- Added CI test for website builds
- Updated frontend dependencies
- Usage tracking (commit 89913ff):
- Extended usage_report_logger to track all subcommands and API calls
- Entry function detection via call stack traversal
- Added
skip_loggerparameter to prevent duplicate logging
🔧 Bug Fixes & Improvements
- CUDA Graph capture fix (PR #197):
- Fixed crash during CUDA graph capture in tensor argument extraction
- Detects capture mode and skips problematic operations
- Fixes compatibility with
triton.testing.do_bench_cudagraph
- Gzip support (PR #207):
- Added gzip support for
load_ndjson()function
- Added gzip support for
- Compilation metadata (PR #198):
- Sort compilation metadata attributes alphabetically
- Import formatting (commit 86a2229):
- Format imports following Python style guide
- Debug message (commit a205e50):
- Added message for debugging when BlockPingpong exits early
📚 Documentation
- Wiki pages (PR #223):
- Added new wiki pages to documentation table
- Dependency cleanup (PR #225):
- Removed unnecessary npm overrides for prismjs and dompurify
Compatibility notes
- Breaking Change: All launch indices are now 0-based. Website displays and CLI arguments have been updated. If you have scripts relying on 1-based line numbers from
--line, update them to use 0-based indices. - New Features: The
infosubcommand and--kernel/--launch-idreproducer options are additive and don't break existing workflows. - Reproducer: Generated scripts now include autotune disabler and dependent functions automatically. Templates have been updated with proper logger initialization.
Upgrade guidance
- Update index references: Change any 1-based line number references to 0-based indices.
- Use info command: Replace manual trace file inspection with
tritonparseoss info <trace.ndjson>to list kernels. - Use kernel name lookup: Instead of
--line N, use--kernel <name> --launch-id <id>for more intuitive reproducer generation. - Extract IRs: Use new
python -m tritonparse.tools.extract_irs <trace.ndjson>for IR extraction tasks.
TritonParse v0.3.1 Release 🎉
TritonParse Release Notes (last 24 commits)
- Date range: 2025-10-14 — 2025-11-03
- Scope: IR Analysis enhancements (beta), Reproducer template extensions, code viewer improvements, bug fixes.
Highlights
- 📊 IR Analysis (Beta): New analysis capabilities for visualizing Software Pipelining (SWP), BufferOps statistics, and loop schedules in Triton IR. Note: This is a beta feature.
- 🏷️ Variable Location Tracking: Complete location alias tracking system for mapping IR locations back to source code with frontend visualization.
- 🔧 TritonBench Template: New reproducer template for easy TritonBench integration and kernel benchmarking.
- 🎨 Code Viewer Enhancements: Full Python source extraction, function highlighting, and performance optimizations.
- 🔄 Reproducer Refactoring: AST-based function extraction eliminates code duplication and simplifies template maintenance.
Changes by area
📊 IR Analysis (Beta)
- Software Pipelining (SWP) visualization (PR #189):
- Analyzes inner
scf.forloops and identifies prologue, loop_body, and epilogue stages - Tracks
tt.loadandtt.dotoperations through TTIR → TTGIR → Python source mappings - Frontend displays simplified source code with SWP stage information
- Limitations: Does not support Warp Specialization or Blackwell operators yet
- Analyzes inner
- BufferOps backend information (PR #181):
- Statistical analysis of buffer operations (tt.load/store, amdgpu.buffer_load/store, global_load/store) at TTGIR and AMDGCN levels
- Useful for AMD GPU backend optimization analysis
- Web frontend IR Analysis page (PR #184):
- New dedicated page at
/ir-analysisroute with integrated display for loop schedules and BufferOps statistics
- New dedicated page at
🏷️ Variable Location Tracking
Complete three-part implementation (PR #186, #187, #188):
- Fixed #loc storage key conflict in IR parser
- Added location alias parsing support in
ir_parser.pyandtrace_processor.py - Frontend visualization with CSS styling and interactive location display in Code Viewer
🔄 Reproducer System
- TritonBench template support (commit 3493ac8):
- New template:
tritonparse/reproducer/templates/tritonbench.py - CLI option:
--template tritonbenchfor TritonBench-compatible reproducers - Integrates with TritonBench's
BenchmarkOperatorand benchmark harness
- New template:
- AST-based refactoring (PR #178):
- New module:
tritonparse/reproducer/function_extractor.pyusing Python AST - Simplified
example.pytemplate from ~370 lines to ~20 lines
- New module:
- Bug fixes:
📝 Callsite Location Support
- TTIR/TTGIR callsite location (PR #190):
- Extended IR parser to extract callsite location information
- Better debugging with call graph information and test coverage
💻 Code Viewer & Frontend
- Full Python source extraction (commit 2976887):
- Enhanced
structured_logging.pyto extract complete Python source files
- Enhanced
- Full file display with function highlighting (commit 220d5a4):
- CodeViewer now supports displaying entire source files with function-level highlighting
- CodeComparisonView performance optimization (commit c17e584):
- Significant rendering performance improvements for large files
- Reduced re-renders and improved memory efficiency
🌐 Website & Maintenance
- Dependency updates (PR #179): Added automation script
website/scripts/update_deps.sh - Copyright updates (PR #183): Updated copyright headers across source files
Compatibility notes
- No breaking changes: All updates are backward compatible with v0.3.0.
- IR Analysis (Beta): New optional feature accessible through web UI.
- TritonBench template: Optional, does not impact existing reproducer generation.
Upgrade guidance
-
Using IR Analysis (Beta):
- Open web UI and navigate to IR Analysis page after parsing
- View SWP stage information (prologue/loop_body/epilogue) and BufferOps statistics
- Note: Beta feature with some limitations on advanced pipelining patterns
-
Generating TritonBench reproducers:
tritonparseoss reproduce trace.ndjson.gz --line <N> --template tritonbench --out-dir <output>
-
Code viewer enhancements: Automatically enabled with full source display and function highlighting
TritonParse v0.3.0 Release 🎉
TritonParse Release Notes (last 44 commits)
- Date range: 2025-09-19 — 2025-10-14
- Scope: Major feature release - Reproducer system, tensor storage, SASS support, enhanced context manager, CLI improvements.
Highlights
- 🔄 Reproducer System (Complete): Full-featured standalone kernel script generation with template support, tensor reconstruction, and multiple import modes. Extract any traced kernel into a self-contained Python script for debugging, testing, and sharing.
- 💾 TensorBlobManager: Production-ready content-addressed tensor storage with automatic compression, deduplication, quota management, and efficient disk usage. Enables high-fidelity kernel reproduction with actual tensor data.
- 🔧 SASS Disassembly Support: Optional NVIDIA SASS disassembly during compilation tracing for low-level debugging and performance analysis. Toggle via
enable_sass_dumpparameter orTRITONPARSE_DUMP_SASSenvironment variable. - 🎯 Enhanced Context Manager: Configurable
TritonParseManagercontext manager with support for trace launch control, inductor compilation splitting, and flexible parsing parameters. - ⚡ CLI Modernization: Refactored to subcommand structure (
tritonparseoss parse,tritonparseoss reproduce) with unified entry point and improved argument handling. - 📊 Auto-enable Inductor Launch Tracing: Automatic detection and tracing of PyTorch Inductor-compiled kernels without manual configuration.
- 🌐 Website Improvements: Light mode color scheme, improved stack display in Launch Analysis, and better file diff navigation.
Changes by area
🔄 Reproducer System
- Complete reproducer infrastructure (PR #117-127):
- CLI subcommand structure:
tritonparse reproduce <ndjson_file> [options] - NDJSON ingestion layer with IR preservation
- Context bundle system for kernel metadata and parameters
- Standardized output paths:
repro_output/<kernel_name>/repro_<timestamp>.py - Template support with placeholder system for custom generation
- Example templates for tensor loading and kernel invocation
- Dynamic import generation for kernel dependencies
- Kernel signature parsing and integration
- Kernel invocation snippet generation with grid/block configuration
- CLI subcommand structure:
- Kernel import modes (PR #165, #166):
--kernel-import direct: Import kernel from source file--kernel-import override-ttir: Override and inject TTIR for advanced debugging- Flexible kernel loading strategies for different debugging workflows
- Enhanced tensor handling (PR #141):
- Improved tensor metadata logging (shape, dtype, stride, storage offset, device)
- Better tensor reconstruction quality in generated reproducers
- Support for non-contiguous tensors (commit 12f1d1b)
- Extensible placeholder system (PR #149):
- Refactored placeholder replacement with class-based design
- Support for:
{{KERNEL_IMPORT_PLACEHOLDER}},{{KERNEL_INVOCATION_PLACEHOLDER}},{{KERNEL_SYSPATH_PLACEHOLDER}},{{JSON_FILE_NAME_PLACEHOLDER}} - Easy extension for future template needs
- Documentation: Comprehensive reproducer section in README (PR #161) and Usage Guide in Wiki
💾 TensorBlobManager & Storage
- Production-ready blob storage (PR #156):
- Content-addressed storage using BLAKE2b hashing
- Automatic gzip compression for large tensors (>1MB)
- Two-level directory structure (
xx/hash.bin.gz) to avoid filesystem limits - Automatic deduplication: identical tensors stored only once
- Storage quota enforcement (default: 100GB)
- Per-tensor size limit (default: 10GB) to prevent OOM
- Real-time statistics: saved count, dedup hits, compression ratio
- Graceful degradation with warning logs when quota exceeded
- Compression support (PR #157):
- Configurable compression level (default: 4)
- Atomic writes using temporary files + rename for safety
- Hash verification for data integrity
- Comprehensive testing (PR #162):
- Unit tests for compression, deduplication, quota management
- Edge case handling and cleanup verification
🔧 SASS Disassembly
- SASS extraction support (PR #137):
- New tool:
tritonparse/tools/disasm.pyfor CUBIN disassembly - Integration into structured logging behind opt-in flag
- Uses
nvdisasm -c -gp -g -gifor detailed disassembly - Parses output to find function blocks with preserved labels and source mapping
- New tool:
- Configuration:
- Environment variable:
TRITONPARSE_DUMP_SASS=1 - API parameter:
enable_sass_dump=Trueinstructured_logging.init() - API parameter takes precedence over environment variable
- Environment variable:
- Robustness:
- Error handling for subprocess failures, missing nvdisasm, and generic exceptions
- Writes marker messages instead of failing the trace
- Requires NVIDIA CUDA Binary Utilities (nvdisasm)
- CUDA testing (PR #138):
- Strengthened tests to validate SASS extraction and persistence
🎯 Context Manager & API
- Enhanced context manager (PR #144, #159):
- Added
__init__method with configurable parameters:enable_trace_launch: Control trace launch loggingsplit_inductor_compilations: Control inductor compilation splitting**parse_kwargs: Additional arguments forunified_parse
- Updated
__exit__to pass parameters through to parsing pipeline - More flexible for different use cases and workflows
- Added
- Split inductor compilations control:
- Parameter threading through:
unified_parse()→oss_run()→parse_logs()→parse_single_file() - Renamed from
split_by_frame_id_and_compile_idtosplit_inductor_compilationsfor clarity - Default
True: splits by frame_id, frame_compile_id, attempt_id, compiled_autograd_id - When
False: groups all inductor compilations together - Follows tlparse's convention
- Parameter threading through:
- Unit tests (commit a5338ce):
- Tests for enhanced context manager behavior
- Validation of split inductor compilation modes
⚡ CLI & Entry Points
- Subcommand structure (PR #117):
- Refactored from single-command to modern subcommand architecture
tritonparse parse <source> [options]- Run structured log parsertritonparse reproduce <ndjson_file> [options]- Generate reproducers- Breaking change: old
python run.py <source>no longer works - Extract parser flags into
tritonparse.utils._add_parse_args() - Remove
unified_parse_from_cli(programmaticunified_parse()remains)
- Unified entry point (PR #133):
- Added proper CLI entry point in package configuration
- Unified argument handling across commands
- CLI entry point fix (PR #154):
- Fixed
ModuleNotFoundErrorfor tritonparse CLI entry point - Improved package installation and command availability
- Fixed
📊 Logging & Tracing
- Auto-enable Inductor Launch Tracing (PR #142):
- Automatically detect and trace PyTorch Inductor-compiled kernels
- No manual configuration required for Inductor workflows
- Seamless integration with existing tracing infrastructure
- Kernel source path output (commit 03bc1e1):
- Output
kernel_src_pathin trace metadata for better debugging
- Output
- NDJSON prettifier improvements (PR #135):
- Renamed and inverted flag to default-filter IRs
- More intuitive filtering behavior
- Debug flag deprecation (PR #132):
- Removed unused debugging flags
- Cleaner configuration surface
🌐 Website & UI
- Upgraded to Tailwind CSS v4 (commit 6c42d8a):
- Migrated from PostCSS plugin to
@tailwindcss/vitefor improved performance - Updated CSS import syntax from
@tailwinddirectives to@import "tailwindcss" - Removed
tailwind.config.jsandpostcss.config.js(now CSS-based configuration) - Updated
shadowclass naming to v4 convention (shadow→shadow-sm) - Cleaned up global CSS to prevent interference with Tailwind utility classes
- Migrated from PostCSS plugin to
- Upgraded all frontend dependencies:
- Vite: 6.3.5 → 7.1.10
- React ecosystem: Updated to latest versions (React 19+)
- TypeScript: 5.7.2 → 5.7.3
- Added
@types/nodefor Node.js type definitions - Fixed dompurify security vulnerability (3.1.7 → 3.3.0) via npm overrides
- Light mode color scheme (PR #139):
- Updated
index.cssto support only light mode - Consistent, professional appearance
- Updated
- Improved stack display (PR #151):
- Better stack trace visualization in Launch Analysis
- Clearer debugging information
- Documentation cleanup (PR #172):
- Removed redundant docs directory and screenshots
- Streamlined repository structure
🔧 Bug Fixes & Maintenance
- General bug fixes (PR #153):
- Multiple stability and reliability improvements
- Better error handling throughout codebase
- Deserialization fix (commit d4d7a20):
- Fixed unhandled types in deserialization
- More robust data loading
- README improvements (PR #158, #164):
- Refactored and cleaned up README
- Fixed command typos in reproducer generation examples
- Clearer installation and usage instructions
- Test cleanup (PR #160):
- Removed deprecated test for triton_kernels Tensor functionality
- Updated test suite for current codebase
Compatibility notes
- Breaking Change: CLI now uses subcommand structure. Old usage
python run.py <source>must be updated totritonparse parse <source>orpython run.py parse <source>. - New Dependencies: SASS disassembly requires NVIDIA CUDA Binary Utilities (
nvdisasm). This is optional and only needed ifenable_sass_dump=True. - Storage: TensorBlobManager introduces new blob storage directory structure. Default quota is 100GB; configure via
TensorBlobManagerinitialization if needed. - Context Manager API: Enhanced with new parameters. Fully backward compatible with sensible defaults.
...
TritonParse v0.2.3 Release 🎉
TritonParse Release Notes (last 15 commits)
- Date range: 2025-09-13 — 2025-09-18
- Scope: Website UI/UX, core library, CI/CD & packaging, documentation & testing.
Highlights
- Website File Diff tooling: Introduced a new Diff Comparison view and File Diff page, preserved diff sessions across navigation, integrated Monaco editor, added preview mode, and shipped a round of UI polish with a URL redirect fix for File Diff navigation.
- Kernel Overview: Added a tiled kernel view toggle to improve dense overviews.
- Core: Added lazy-import support for Triton repo
triton_kernelscustom types, attribution check fortorch._utils_internal, and safer file mapping cleanup in the log parser. - CI/Packaging: Refactored dependencies in
pyproject.toml, removed a legacy Triton install script, and updated GitHub Actions workflows. - Docs & tests: Improved README guidance; added tests and example outputs; minor UI bug fix in
CopyCodeButtonSVG attributes.
Changes by area
-
Website UI/UX
- Introduce
DiffComparisonViewandFileDiffView; maintain diff session state; integrate Monaco editor; preview mode; UI polish and navigation fixes. - Add tiled kernel view toggle in
KernelOverview.
- Introduce
-
Core library
- Lazy-import support for
triton_kernelscustom types; extend tensor handling in tests. - Add attribution check for
torch._utils_internal. - Refactor file mapping cleanup in
parse_logs.
- Lazy-import support for
-
CI/CD & packaging
- Refactor dependencies in
pyproject.toml; remove.ci/install-triton-pip.sh. - Update GitHub Actions workflows; add helper for
triton_kernelsin CI.
- Refactor dependencies in
-
Docs & testing
- Clarify tool purpose and installation in
README.md. - Add tests and sample outputs; small UI component fixes.
- Clarify tool purpose and installation in
Compatibility notes
- No breaking changes expected.
triton_kernelssupport is optional via lazy import.
Upgrade guidance
- Reinstall website dependencies if developing the UI to pick up the Monaco editor.
TritonParse v0.2.0 Release 🎉
TritonParse Release Notes (last 27 commits)
- Date range: 2025-07-25 — 2025-09-11
- Scope: Core library, website UI/UX, performance & scalability, CI/CD & packaging, documentation & maintenance.
Highlights
- PyPI package: TritonParse has been added to PyPI and can be installed by
pip install tritonparse! - Website usability: Drag-and-drop to open logs; one-click copy in code viewers; sticky, compact kernel selector; footer shows app version, localized build date, and Git short SHA; tensor arguments in Launch Analysis now display concise summaries with expandable details.
- Large-file parsing: Streaming NDJSON parsing and robust gzip handling significantly reduce memory usage and improve stability for files >100 MB.
- Core & integrations: Persist Inductor kernel config into
inductor_metadataand pass to JIT hooks; ensure Inductor path invokesjit_post_compile_hook; newinit_with_envfor environment-based initialization; move compilation timingtimesintometadatafor automatic frontend rendering. - Releases & versioning: Adopt setuptools-scm dynamic versioning; add Nightly PyPI publishing; enable stable publishing on tag push; fix nightly version potentially being older than stable; correct packaging license metadata.
- CI stability: Ubuntu 24.04 compatibility; improved CUDA/cuDNN setup and detection; parallelize jobs; add parallel CI for pip-installed Triton; better error visibility in install scripts; upgrade libstdc++.
Changes by area
-
Core library
- Save Inductor kernel params to
inductor_metadataand forward to JIT hooks. - Manually invoke
jit_post_compile_hookin the Inductor Triton compile path. - Add
init_with_envthat readsTRITON_TRACE_FOLDERandTRITON_TRACE_LAUNCH. - Move compilation
timesintometadataso the frontend auto-renders it. - Use cached source in compile listener for stability.
- Refactor source-mapping pipeline into modular units for maintainability.
- Save Inductor kernel params to
-
Website UI/UX
- Drag-and-drop to open supported log files.
- Copy button in code viewer panels.
- Sticky/collapsible/compact kernel selector in Kernel Overview; resizable compilation stack trace vertically.
- Launch Analysis: tensor args show concise summaries with expandable details.
- Footer displays version, localized build date, and Git short SHA.
- Streaming NDJSON parsing and improved error handling for large logs.
-
Performance & scalability
- Use streaming path for files >100 MB to reduce memory peaks and improve robustness.
-
CI/CD & packaging
- Enable setuptools-scm and nightly PyPI publishing.
- Publish stable releases on tag push; improve version computation and tag detection.
- Fix nightly version possibly lagging behind stable; add clear error on missing tags.
- Add parallel CI for pip-installed Triton; recommend pip installation in docs.
- Improve Ubuntu 24.04 setup, CUDA/cuDNN handling, and job parallelism.
- Increase error visibility in install scripts and upgrade libstdc++.
- Define lower bounds for prerequisites in
pyproject.toml.
-
Docs & maintenance
- Move repository to
meta-pytorchorg; update links and guidance; add AI assistant context. - Update/restore CONTRIBUTING docs to avoid breaking downstream consumers.
- Move repository to
-
Testing
- Preserve test outputs when
TEST_KEEP_OUTPUT=1to aid debugging.
- Preserve test outputs when
Compatibility notes
- Versioning & publishing: setuptools-scm with tag-based stable releases and nightly dev versions. Ensure
PYPI_API_TOKENis configured in CI if publishing is intended. - Data format: compilation timing
timesmoved undermetadata; update any downstream scripts that referenced the old location. - Build metadata: footer shows localized build date and Git short SHA; restart dev server to refresh these values.
Upgrade guidance
- Prefer Triton from PyPI (≥ 3.4.0) and adhere to the lower bounds declared in
pyproject.toml. - For deterministic build metadata in the website, set
BUILD_DATEandGIT_COMMIT_SHA_SHORTin the environment when running dev/build.