Skip to content

Comments

Enable case-sensitive LeadingStrings with frequency-based heuristic#31

Closed
danmoseley wants to merge 1 commit intomainfrom
regex-redux/leading-strings-frequency
Closed

Enable case-sensitive LeadingStrings with frequency-based heuristic#31
danmoseley wants to merge 1 commit intomainfrom
regex-redux/leading-strings-frequency

Conversation

@danmoseley
Copy link
Owner

@danmoseley danmoseley commented Feb 23, 2026

I thought it would be interesting to see whether AI could take another look at the commented out search strategy originally introduced by @stephentoub in dotnet#98791 to see whether we can enable it and keep the wins without the regressions that caused it to be commented out.

AI tried various experiments, and got to a dead end. I recalled the frequency table approach that Ripgrep uses. Turns out that fixes the regressions entirely. This means our engine now has assumptions built in about char frequencies in ASCII (only) text. That's an approach that's been proven in ripgrep, one of the fastest engines, for 10 years, and it turns out to work OK for regex-redux as well because a, c, g, t are relatively high frequency in English anyway. Code unchanged if pattern has anything other than ASCII (see benchmark results below).

This gives us a nice win on regex-redux, a few other wins in existing tests, and no regressions.

====
When a regex has multiple alternation prefixes (e.g. a|b|c|...), this change decides whether to use SearchValues<string> (Teddy/Aho-Corasick) or fall through to FixedDistanceSets (IndexOfAny) based on the frequency of the starting characters.

High-frequency starters (common letters like lowercase vowels) benefit from multi-string search; low-frequency starters (uppercase, digits, rare consonants) are already excellent IndexOfAny filters. Non-ASCII starters bail out (no frequency data), preserving baseline behavior.

Benchmark results (444 benchmarks, BDN A/B with --statisticalTest 3ms)

Benchmark Baseline PR Ratio Verdict
RegexRedux_1 (Compiled) 25.77ms 14.27ms 1.81x faster Faster
Leipzig Tom.*river (Compiled) 6.13ms 1.87ms 3.28x faster Faster
RegexRedux_5 (Compiled) 2.83ms 2.35ms 1.20x faster Faster
Sherlock, BinaryData, BoostDocs, Mariomkas, SliceSlice Same
LeadingStrings_NonAscii (all variants) Same
LeadingStrings_BinaryData (all variants) Same

No regressions detected. All MannWhitney tests report Same for non-improved benchmarks.

Key design decisions

  • Frequency table: First 128 entries of Rust's BYTE_FREQUENCIES from @BurntSushi's aho-corasick crate
  • Threshold: Average rank >= 200 triggers LeadingStrings; below 200 falls through to FixedDistanceSets
  • Non-ASCII: Returns false (no frequency data), so the heuristic does not engage and behavior is unchanged

Companion benchmarks: danmoseley/performance#6

@danmoseley
Copy link
Owner Author

New benchmark results (not yet in dotnet/performance, won't be picked up by PR bot)

These benchmarks are from the companion PR danmoseley/performance#6.

BenchmarkDotNet v0.16.0-custom.20260127.101, Windows 11 (10.0.26100.7840/24H2/2024Update/HudsonValley)
Intel Core i9-14900K 3.20GHz, 1 CPU, 32 logical and 24 physical cores
Benchmark Options Baseline PR Ratio MannWhitney(3ms)
LeadingStrings_BinaryData None 4,483 μs 4,365 μs 0.97 Same
LeadingStrings_BinaryData Compiled 2,188 μs 2,184 μs 1.00 Same
LeadingStrings_BinaryData NonBacktracking 3,734 μs 3,725 μs 1.00 Same
LeadingStrings_NonAscii Count None 913 μs 956 μs 1.05 Same
LeadingStrings_NonAscii Count Compiled 244 μs 243 μs 1.00 Same
LeadingStrings_NonAscii CountIgnoreCase None 1,758 μs 1,714 μs 0.98 Same
LeadingStrings_NonAscii CountIgnoreCase Compiled 258 μs 250 μs 0.97 Same
LeadingStrings_NonAscii Count NonBacktracking 392 μs 398 μs 1.02 Same
LeadingStrings_NonAscii CountIgnoreCase NonBacktracking 409 μs 431 μs 1.05 Same

All MannWhitney tests report Same — no regressions on binary or non-ASCII input.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enables case-sensitive LeadingStrings optimization with a frequency-based heuristic to decide between SearchValues (Teddy/Aho-Corasick) and IndexOfAny for patterns with multiple alternation prefixes. The heuristic uses empirical ASCII character frequency ranks (borrowed from Rust's aho-corasick crate) to determine when starting characters are common enough that IndexOfAny would match too frequently, making SearchValues the better choice.

Changes:

  • Uncommented and enabled the previously disabled case-sensitive LeadingStrings optimization (lines 186-195)
  • Added HasHighFrequencyStartingChars method that uses frequency-based heuristic with 128-bit bitset for deduplication
  • Added AsciiCharFrequencyRank table with 128 empirical frequency ranks for ASCII characters

@danmoseley danmoseley force-pushed the regex-redux/leading-strings-frequency branch 2 times, most recently from eb39721 to e877181 Compare February 23, 2026 04:19
@stephentoub
Copy link

Oops? :)
image

@danmoseley danmoseley force-pushed the regex-redux/leading-strings-frequency branch from e877181 to da12aa4 Compare February 23, 2026 04:22
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@danmoseley danmoseley force-pushed the regex-redux/leading-strings-frequency branch from da12aa4 to 3744268 Compare February 23, 2026 04:24
@danmoseley
Copy link
Owner Author

danmoseley commented Feb 23, 2026

Yeah, i fixed the oops right away -- it committed all its scratch binaries.. doh.

@danmoseley
Copy link
Owner Author

Hold on this is in my fork. YOu want dotnet#124736

I'm using PR's in fork to iterate with the AI without bugging anyone

@danmoseley danmoseley closed this Feb 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants