Enable case-sensitive LeadingStrings with frequency-based heuristic#31
Enable case-sensitive LeadingStrings with frequency-based heuristic#31danmoseley wants to merge 1 commit intomainfrom
Conversation
New benchmark results (not yet in dotnet/performance, won't be picked up by PR bot)These benchmarks are from the companion PR danmoseley/performance#6.
All MannWhitney tests report Same — no regressions on binary or non-ASCII input. |
There was a problem hiding this comment.
Pull request overview
This PR enables case-sensitive LeadingStrings optimization with a frequency-based heuristic to decide between SearchValues (Teddy/Aho-Corasick) and IndexOfAny for patterns with multiple alternation prefixes. The heuristic uses empirical ASCII character frequency ranks (borrowed from Rust's aho-corasick crate) to determine when starting characters are common enough that IndexOfAny would match too frequently, making SearchValues the better choice.
Changes:
- Uncommented and enabled the previously disabled case-sensitive LeadingStrings optimization (lines 186-195)
- Added HasHighFrequencyStartingChars method that uses frequency-based heuristic with 128-bit bitset for deduplication
- Added AsciiCharFrequencyRank table with 128 empirical frequency ranks for ASCII characters
.../System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexFindOptimizations.cs
Outdated
Show resolved
Hide resolved
eb39721 to
e877181
Compare
e877181 to
da12aa4
Compare
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
da12aa4 to
3744268
Compare
|
Yeah, i fixed the oops right away -- it committed all its scratch binaries.. doh. |
|
Hold on this is in my fork. YOu want dotnet#124736 I'm using PR's in fork to iterate with the AI without bugging anyone |
I thought it would be interesting to see whether AI could take another look at the commented out search strategy originally introduced by @stephentoub in dotnet#98791 to see whether we can enable it and keep the wins without the regressions that caused it to be commented out.
AI tried various experiments, and got to a dead end. I recalled the frequency table approach that Ripgrep uses. Turns out that fixes the regressions entirely. This means our engine now has assumptions built in about char frequencies in ASCII (only) text. That's an approach that's been proven in ripgrep, one of the fastest engines, for 10 years, and it turns out to work OK for regex-redux as well because a, c, g, t are relatively high frequency in English anyway. Code unchanged if pattern has anything other than ASCII (see benchmark results below).
This gives us a nice win on regex-redux, a few other wins in existing tests, and no regressions.
====
When a regex has multiple alternation prefixes (e.g.
a|b|c|...), this change decides whether to useSearchValues<string>(Teddy/Aho-Corasick) or fall through toFixedDistanceSets(IndexOfAny) based on the frequency of the starting characters.High-frequency starters (common letters like lowercase vowels) benefit from multi-string search; low-frequency starters (uppercase, digits, rare consonants) are already excellent
IndexOfAnyfilters. Non-ASCII starters bail out (no frequency data), preserving baseline behavior.Benchmark results (444 benchmarks, BDN A/B with --statisticalTest 3ms)
No regressions detected. All MannWhitney tests report Same for non-improved benchmarks.
Key design decisions
BYTE_FREQUENCIESfrom @BurntSushi's aho-corasick crateLeadingStrings; below 200 falls through toFixedDistanceSetsCompanion benchmarks: danmoseley/performance#6