Optimize Capstone disassembly performance across the stack#25721
Open
Optimize Capstone disassembly performance across the stack#25721
Conversation
Replace binary search (log2(15K) ~14 comparisons) with direct lookup tables for instruction mapping in the x86 Capstone decoder. Profiling with callgrind showed find_insn() binary search was called 2-3x per instruction and consumed ~22% of total disassembly cycles. Changes in subprojects/capstone-v5: - X86Mapping.c: Replace find_insn() binary search with O(1) direct index table (lazily built on first use, ~60KB memory) - X86Mapping.c: Replace X86_insn_reg_intel/att binary searches with packed O(1) lookup tables - SStream.c: Add fast integer-to-string formatters (fast_utoa_hex, fast_utoa_dec) replacing vsnprintf for number formatting - SStream.c: Simplify SStream_concat0 overflow check - MCInst.c: Use memset for tied_op_idx initialization instead of loop Benchmark results (x86_64 disassembly, 65KB of /bin/ls, 50 iterations): cs_disasm: 2.03M -> 2.42M insns/sec (+19%) cs_disasm_iter: 2.43M -> 3.16M insns/sec (+30%) cs_disasm_iter (no detail): 3.63M -> 4.09M insns/sec (+13%) Callgrind instruction count: 664M -> 525M (-21%) Per-function improvements: X86_get_op_access: 95.1M -> 12.5M instructions (-87%) X86_get_insn_id: 50.8M -> 17.5M instructions (-65%) X86_insn_reg_intel: 30.1M -> 4.7M instructions (-84%) All 1378 x86_64 asm tests pass (0 new failures). https://claude.ai/code/session_01KDR9eBZ4vEAftFBQ2vuhmr
Thread-safe per-handle O(1) lookup tables in Capstone X86:
- Move lookup tables from static globals to cs_struct fields
- Build tables in X86_global_init(), free in cs_close()
- Add find_insn_h() and X86_insn_reg_{intel,att}_h() per-handle variants
- Keep binary search fallback for decoder paths without handle access
- Replace vsnprintf number formatting with fast custom formatters in SStream
- Use memset for MCInst tied_op_idx initialization
ARM plugin (plugin_cs.c):
- Switch from allocating cs_disasm() to stack-based cs_disasm_iter()
- Replace r_str_newf mnemonic construction with direct malloc+memcpy
x86 plugin (plugin_cs.c):
- Replace r_str_newf + r_str_replace (2 allocs) with single malloc + in-place memmove
- Remove redundant per-instruction cs_option(CS_OPT_DETAIL) call
- Inline cs_len_prefix_opcode() for branch penalty elimination
Core disassembly (disasm.c):
- Compute decode_mask once based on display settings (asm.emu, asm.cmt.esil)
- Skip ESIL/OPEX generation when not needed for display
- Use R_ARCH_OP_MASK_BASIC for color-only decode paths
Analysis (fcn.c):
- Remove R_ARCH_OP_MASK_ESIL from default analysis loop for non-ARM archs
- ESIL generation only when architecture needs it for pattern matching
https://claude.ai/code/session_01KDR9eBZ4vEAftFBQ2vuhmr
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
find_insn(),X86_insn_reg_intel(), andX86_insn_reg_att()vsnprintfcallscs_disasm()to stack-basedcs_disasm_iter(), eliminating malloc/free per instructioncs_option()callsThread safety
Lookup tables are stored per-handle in
cs_structfields (not static globals), built duringX86_global_init(), and freed incs_close(). Multiple Capstone handles can coexist safely across threads.Profiling methodology
Built a standalone benchmark disassembling 65KB of
/bin/lsx86_64 code withCS_OPT_DETAILenabled. Used callgrind for instruction-level profiling. The top bottlenecks before optimization:X86_get_op_accessfind_insn()binary search (log2(15K) ≈ 14 comparisons)X86_get_insn_idX86_insn_reg_intelSStream_concatvsnprintffor integer formattingBenchmark results (Capstone layer)
cs_disasm(with detail)cs_disasm_iter(with detail)cs_disasm_iter(no detail)Callgrind total instructions: 664M → 525M (-21%)
Changes by layer
Capstone (shipped as
optimize-x86-lookup-tables.patch)cs_struct(~120KB per handle)find_insn_h()+X86_insn_reg_{intel,att}_h()fast variantsSStream_concat_num()bypassing vsnprintfmemsetforMCInsttied_op_idx initializationlibr/arch/p/arm/plugin_cs.c
cs_disasm()→cs_disasm_iter()(zero-alloc per instruction)malloc+memcpymnemonic construction instead ofr_str_newflibr/arch/p/x86/plugin_cs.c
malloc+ in-placememmovefor "ptr " removal (wasr_str_newf+r_str_replace= 2 allocs)cs_option(CS_OPT_DETAIL)(already set in init)cs_len_prefix_opcode()to eliminate function call overheadlibr/core/disasm.c
decode_maskonce fromasm.emu/asm.cmt.esilsettingsR_ARCH_OP_MASK_ALLR_ARCH_OP_MASK_BASICfor color-only decode pathlibr/anal/fcn.c
pc,lr,=pattern matching)Testing
aaanalysis of /bin/ls completes successfullyFuture optimization opportunities
X86_Intel_printInstis 51% inclusive cost; analysis-only callers don't need string outputcs_detailfill when caller only needs mnemoniccs_disasm_itercallr_str_newfin disasm display loops with pre-allocated buffershttps://claude.ai/code/session_01KDR9eBZ4vEAftFBQ2vuhmr