Optimize Capstone disassembly performance across the stack by trufae · Pull Request #25721 · radareorg/radare2

trufae · 2026-04-04T18:26:33Z

Summary

Capstone X86: Replace binary search over 15K entries with thread-safe per-handle O(1) lookup tables for find_insn(), X86_insn_reg_intel(), and X86_insn_reg_att()
Capstone SStream: Add fast integer-to-string formatters to bypass expensive vsnprintf calls
ARM plugin: Switch from allocating cs_disasm() to stack-based cs_disasm_iter(), eliminating malloc/free per instruction
x86 plugin: Replace double-allocation mnemonic formatting with single malloc + in-place edit; remove redundant cs_option() calls
Core disasm: Compute decode mask once based on display settings, skip ESIL/OPEX generation when not needed
Analysis loop: Remove unconditional ESIL generation from function analysis for non-ARM architectures

Thread safety

Lookup tables are stored per-handle in cs_struct fields (not static globals), built during X86_global_init(), and freed in cs_close(). Multiple Capstone handles can coexist safely across threads.

Profiling methodology

Built a standalone benchmark disassembling 65KB of /bin/ls x86_64 code with CS_OPT_DETAIL enabled. Used callgrind for instruction-level profiling. The top bottlenecks before optimization:

Function	% of cycles	Root cause
`X86_get_op_access`	14.3%	`find_insn()` binary search (log2(15K) ≈ 14 comparisons)
`X86_get_insn_id`	7.6%	Same binary search, second call per instruction
`X86_insn_reg_intel`	5.7%	Two more binary searches per instruction
`SStream_concat`	4.1%	`vsnprintf` for integer formatting

Benchmark results (Capstone layer)

Mode	Before	After	Speedup
`cs_disasm` (with detail)	2.03M insns/sec	2.42M insns/sec	+19%
`cs_disasm_iter` (with detail)	2.43M insns/sec	3.16M insns/sec	+30%
`cs_disasm_iter` (no detail)	3.63M insns/sec	4.09M insns/sec	+13%

Callgrind total instructions: 664M → 525M (-21%)

Changes by layer

Capstone (shipped as `optimize-x86-lookup-tables.patch`)

Per-handle O(1) lookup tables in cs_struct (~120KB per handle)
find_insn_h() + X86_insn_reg_{intel,att}_h() fast variants
Binary search fallback for decoder paths without handle access
Fast SStream_concat_num() bypassing vsnprintf
memset for MCInst tied_op_idx initialization

libr/arch/p/arm/plugin_cs.c

cs_disasm() → cs_disasm_iter() (zero-alloc per instruction)
Direct malloc+memcpy mnemonic construction instead of r_str_newf

libr/arch/p/x86/plugin_cs.c

Single malloc + in-place memmove for "ptr " removal (was r_str_newf + r_str_replace = 2 allocs)
Remove per-instruction cs_option(CS_OPT_DETAIL) (already set in init)
Inline cs_len_prefix_opcode() to eliminate function call overhead

libr/core/disasm.c

Compute decode_mask once from asm.emu/asm.cmt.esil settings
Use computed mask in all 4 main disasm loops instead of R_ARCH_OP_MASK_ALL
Use R_ARCH_OP_MASK_BASIC for color-only decode path
Skips ESIL and OPEX generation when not displayed (~30% of decode cost)

libr/anal/fcn.c

Conditional ESIL in analysis loop: only for ARM (needs pc,lr,= pattern matching)
x86/MIPS/other archs skip ESIL generation during function analysis

Testing

x86_64 asm: 1318 OK, 57 BR, 0 XX, 0 SK, 3 FX (identical to baseline)
arm64 asm: 658 OK, 25 BR, 0 XX, 0 SK, 0 FX (identical to baseline)
arm32 asm: 272 OK, 83 BR, 0 XX, 0 SK, 17 FX (identical to baseline)
aa analysis of /bin/ls completes successfully

Future optimization opportunities

Printer bypass — X86_Intel_printInst is 51% inclusive cost; analysis-only callers don't need string output
Lazy detail mode — skip cs_detail fill when caller only needs mnemonic
MCInst pooling — avoid reinitializing ~2KB struct on every cs_disasm_iter call
Batch decode API — amortize per-call overhead for sequential disassembly
r_strbuf for display — replace r_str_newf in disasm display loops with pre-allocated buffers

https://claude.ai/code/session_01KDR9eBZ4vEAftFBQ2vuhmr

Replace binary search (log2(15K) ~14 comparisons) with direct lookup tables for instruction mapping in the x86 Capstone decoder. Profiling with callgrind showed find_insn() binary search was called 2-3x per instruction and consumed ~22% of total disassembly cycles. Changes in subprojects/capstone-v5: - X86Mapping.c: Replace find_insn() binary search with O(1) direct index table (lazily built on first use, ~60KB memory) - X86Mapping.c: Replace X86_insn_reg_intel/att binary searches with packed O(1) lookup tables - SStream.c: Add fast integer-to-string formatters (fast_utoa_hex, fast_utoa_dec) replacing vsnprintf for number formatting - SStream.c: Simplify SStream_concat0 overflow check - MCInst.c: Use memset for tied_op_idx initialization instead of loop Benchmark results (x86_64 disassembly, 65KB of /bin/ls, 50 iterations): cs_disasm: 2.03M -> 2.42M insns/sec (+19%) cs_disasm_iter: 2.43M -> 3.16M insns/sec (+30%) cs_disasm_iter (no detail): 3.63M -> 4.09M insns/sec (+13%) Callgrind instruction count: 664M -> 525M (-21%) Per-function improvements: X86_get_op_access: 95.1M -> 12.5M instructions (-87%) X86_get_insn_id: 50.8M -> 17.5M instructions (-65%) X86_insn_reg_intel: 30.1M -> 4.7M instructions (-84%) All 1378 x86_64 asm tests pass (0 new failures). https://claude.ai/code/session_01KDR9eBZ4vEAftFBQ2vuhmr

Thread-safe per-handle O(1) lookup tables in Capstone X86: - Move lookup tables from static globals to cs_struct fields - Build tables in X86_global_init(), free in cs_close() - Add find_insn_h() and X86_insn_reg_{intel,att}_h() per-handle variants - Keep binary search fallback for decoder paths without handle access - Replace vsnprintf number formatting with fast custom formatters in SStream - Use memset for MCInst tied_op_idx initialization ARM plugin (plugin_cs.c): - Switch from allocating cs_disasm() to stack-based cs_disasm_iter() - Replace r_str_newf mnemonic construction with direct malloc+memcpy x86 plugin (plugin_cs.c): - Replace r_str_newf + r_str_replace (2 allocs) with single malloc + in-place memmove - Remove redundant per-instruction cs_option(CS_OPT_DETAIL) call - Inline cs_len_prefix_opcode() for branch penalty elimination Core disassembly (disasm.c): - Compute decode_mask once based on display settings (asm.emu, asm.cmt.esil) - Skip ESIL/OPEX generation when not needed for display - Use R_ARCH_OP_MASK_BASIC for color-only decode paths Analysis (fcn.c): - Remove R_ARCH_OP_MASK_ESIL from default analysis loop for non-ARM archs - ESIL generation only when architecture needs it for pattern matching https://claude.ai/code/session_01KDR9eBZ4vEAftFBQ2vuhmr

claude added 2 commits April 4, 2026 18:24

trufae changed the title ~~Optimize Capstone x86 disassembly with O(1) lookup tables~~ Optimize Capstone disassembly performance across the stack Apr 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize Capstone disassembly performance across the stack#25721

Optimize Capstone disassembly performance across the stack#25721
trufae wants to merge 2 commits intomasterfrom
claude/optimize-capstone-performance-6ewGZ

trufae commented Apr 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

trufae commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Thread safety

Profiling methodology

Benchmark results (Capstone layer)

Changes by layer

Capstone (shipped as optimize-x86-lookup-tables.patch)

libr/arch/p/arm/plugin_cs.c

libr/arch/p/x86/plugin_cs.c

libr/core/disasm.c

libr/anal/fcn.c

Testing

Future optimization opportunities

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

trufae commented Apr 4, 2026 •

edited

Loading

Capstone (shipped as `optimize-x86-lookup-tables.patch`)