Skip to content

Optimize Capstone disassembly performance across the stack#25721

Open
trufae wants to merge 2 commits intomasterfrom
claude/optimize-capstone-performance-6ewGZ
Open

Optimize Capstone disassembly performance across the stack#25721
trufae wants to merge 2 commits intomasterfrom
claude/optimize-capstone-performance-6ewGZ

Conversation

@trufae
Copy link
Copy Markdown
Collaborator

@trufae trufae commented Apr 4, 2026

Summary

  • Capstone X86: Replace binary search over 15K entries with thread-safe per-handle O(1) lookup tables for find_insn(), X86_insn_reg_intel(), and X86_insn_reg_att()
  • Capstone SStream: Add fast integer-to-string formatters to bypass expensive vsnprintf calls
  • ARM plugin: Switch from allocating cs_disasm() to stack-based cs_disasm_iter(), eliminating malloc/free per instruction
  • x86 plugin: Replace double-allocation mnemonic formatting with single malloc + in-place edit; remove redundant cs_option() calls
  • Core disasm: Compute decode mask once based on display settings, skip ESIL/OPEX generation when not needed
  • Analysis loop: Remove unconditional ESIL generation from function analysis for non-ARM architectures

Thread safety

Lookup tables are stored per-handle in cs_struct fields (not static globals), built during X86_global_init(), and freed in cs_close(). Multiple Capstone handles can coexist safely across threads.

Profiling methodology

Built a standalone benchmark disassembling 65KB of /bin/ls x86_64 code with CS_OPT_DETAIL enabled. Used callgrind for instruction-level profiling. The top bottlenecks before optimization:

Function % of cycles Root cause
X86_get_op_access 14.3% find_insn() binary search (log2(15K) ≈ 14 comparisons)
X86_get_insn_id 7.6% Same binary search, second call per instruction
X86_insn_reg_intel 5.7% Two more binary searches per instruction
SStream_concat 4.1% vsnprintf for integer formatting

Benchmark results (Capstone layer)

Mode Before After Speedup
cs_disasm (with detail) 2.03M insns/sec 2.42M insns/sec +19%
cs_disasm_iter (with detail) 2.43M insns/sec 3.16M insns/sec +30%
cs_disasm_iter (no detail) 3.63M insns/sec 4.09M insns/sec +13%

Callgrind total instructions: 664M → 525M (-21%)

Changes by layer

Capstone (shipped as optimize-x86-lookup-tables.patch)

  • Per-handle O(1) lookup tables in cs_struct (~120KB per handle)
  • find_insn_h() + X86_insn_reg_{intel,att}_h() fast variants
  • Binary search fallback for decoder paths without handle access
  • Fast SStream_concat_num() bypassing vsnprintf
  • memset for MCInst tied_op_idx initialization

libr/arch/p/arm/plugin_cs.c

  • cs_disasm()cs_disasm_iter() (zero-alloc per instruction)
  • Direct malloc+memcpy mnemonic construction instead of r_str_newf

libr/arch/p/x86/plugin_cs.c

  • Single malloc + in-place memmove for "ptr " removal (was r_str_newf + r_str_replace = 2 allocs)
  • Remove per-instruction cs_option(CS_OPT_DETAIL) (already set in init)
  • Inline cs_len_prefix_opcode() to eliminate function call overhead

libr/core/disasm.c

  • Compute decode_mask once from asm.emu/asm.cmt.esil settings
  • Use computed mask in all 4 main disasm loops instead of R_ARCH_OP_MASK_ALL
  • Use R_ARCH_OP_MASK_BASIC for color-only decode path
  • Skips ESIL and OPEX generation when not displayed (~30% of decode cost)

libr/anal/fcn.c

  • Conditional ESIL in analysis loop: only for ARM (needs pc,lr,= pattern matching)
  • x86/MIPS/other archs skip ESIL generation during function analysis

Testing

  • x86_64 asm: 1318 OK, 57 BR, 0 XX, 0 SK, 3 FX (identical to baseline)
  • arm64 asm: 658 OK, 25 BR, 0 XX, 0 SK, 0 FX (identical to baseline)
  • arm32 asm: 272 OK, 83 BR, 0 XX, 0 SK, 17 FX (identical to baseline)
  • aa analysis of /bin/ls completes successfully

Future optimization opportunities

  1. Printer bypassX86_Intel_printInst is 51% inclusive cost; analysis-only callers don't need string output
  2. Lazy detail mode — skip cs_detail fill when caller only needs mnemonic
  3. MCInst pooling — avoid reinitializing ~2KB struct on every cs_disasm_iter call
  4. Batch decode API — amortize per-call overhead for sequential disassembly
  5. r_strbuf for display — replace r_str_newf in disasm display loops with pre-allocated buffers

https://claude.ai/code/session_01KDR9eBZ4vEAftFBQ2vuhmr

claude added 2 commits April 4, 2026 18:24
Replace binary search (log2(15K) ~14 comparisons) with direct lookup
tables for instruction mapping in the x86 Capstone decoder. Profiling
with callgrind showed find_insn() binary search was called 2-3x per
instruction and consumed ~22% of total disassembly cycles.

Changes in subprojects/capstone-v5:
- X86Mapping.c: Replace find_insn() binary search with O(1) direct
  index table (lazily built on first use, ~60KB memory)
- X86Mapping.c: Replace X86_insn_reg_intel/att binary searches with
  packed O(1) lookup tables
- SStream.c: Add fast integer-to-string formatters (fast_utoa_hex,
  fast_utoa_dec) replacing vsnprintf for number formatting
- SStream.c: Simplify SStream_concat0 overflow check
- MCInst.c: Use memset for tied_op_idx initialization instead of loop

Benchmark results (x86_64 disassembly, 65KB of /bin/ls, 50 iterations):
  cs_disasm:          2.03M -> 2.42M insns/sec (+19%)
  cs_disasm_iter:     2.43M -> 3.16M insns/sec (+30%)
  cs_disasm_iter (no detail): 3.63M -> 4.09M insns/sec (+13%)

Callgrind instruction count: 664M -> 525M (-21%)

Per-function improvements:
  X86_get_op_access:  95.1M -> 12.5M instructions (-87%)
  X86_get_insn_id:    50.8M -> 17.5M instructions (-65%)
  X86_insn_reg_intel: 30.1M ->  4.7M instructions (-84%)

All 1378 x86_64 asm tests pass (0 new failures).

https://claude.ai/code/session_01KDR9eBZ4vEAftFBQ2vuhmr
Thread-safe per-handle O(1) lookup tables in Capstone X86:
- Move lookup tables from static globals to cs_struct fields
- Build tables in X86_global_init(), free in cs_close()
- Add find_insn_h() and X86_insn_reg_{intel,att}_h() per-handle variants
- Keep binary search fallback for decoder paths without handle access
- Replace vsnprintf number formatting with fast custom formatters in SStream
- Use memset for MCInst tied_op_idx initialization

ARM plugin (plugin_cs.c):
- Switch from allocating cs_disasm() to stack-based cs_disasm_iter()
- Replace r_str_newf mnemonic construction with direct malloc+memcpy

x86 plugin (plugin_cs.c):
- Replace r_str_newf + r_str_replace (2 allocs) with single malloc + in-place memmove
- Remove redundant per-instruction cs_option(CS_OPT_DETAIL) call
- Inline cs_len_prefix_opcode() for branch penalty elimination

Core disassembly (disasm.c):
- Compute decode_mask once based on display settings (asm.emu, asm.cmt.esil)
- Skip ESIL/OPEX generation when not needed for display
- Use R_ARCH_OP_MASK_BASIC for color-only decode paths

Analysis (fcn.c):
- Remove R_ARCH_OP_MASK_ESIL from default analysis loop for non-ARM archs
- ESIL generation only when architecture needs it for pattern matching

https://claude.ai/code/session_01KDR9eBZ4vEAftFBQ2vuhmr
@trufae trufae changed the title Optimize Capstone x86 disassembly with O(1) lookup tables Optimize Capstone disassembly performance across the stack Apr 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants