Enable case-sensitive LeadingStrings with frequency-based heuristic by danmoseley · Pull Request #31 · danmoseley/runtime

danmoseley · 2026-02-23T01:45:12Z

I thought it would be interesting to see whether AI could take another look at the commented out search strategy originally introduced by @stephentoub in dotnet#98791 to see whether we can enable it and keep the wins without the regressions that caused it to be commented out.

AI tried various experiments, and got to a dead end. I recalled the frequency table approach that Ripgrep uses. Turns out that fixes the regressions entirely. This means our engine now has assumptions built in about char frequencies in ASCII (only) text. That's an approach that's been proven in ripgrep, one of the fastest engines, for 10 years, and it turns out to work OK for regex-redux as well because a, c, g, t are relatively high frequency in English anyway. Code unchanged if pattern has anything other than ASCII (see benchmark results below).

This gives us a nice win on regex-redux, a few other wins in existing tests, and no regressions.

====
When a regex has multiple alternation prefixes (e.g. a|b|c|...), this change decides whether to use SearchValues<string> (Teddy/Aho-Corasick) or fall through to FixedDistanceSets (IndexOfAny) based on the frequency of the starting characters.

High-frequency starters (common letters like lowercase vowels) benefit from multi-string search; low-frequency starters (uppercase, digits, rare consonants) are already excellent IndexOfAny filters. Non-ASCII starters bail out (no frequency data), preserving baseline behavior.

Benchmark results (444 benchmarks, BDN A/B with --statisticalTest 3ms)

Benchmark	Baseline	PR	Ratio	Verdict
RegexRedux_1 (Compiled)	25.77ms	14.27ms	1.81x faster	Faster
Leipzig Tom.*river (Compiled)	6.13ms	1.87ms	3.28x faster	Faster
RegexRedux_5 (Compiled)	2.83ms	2.35ms	1.20x faster	Faster
Sherlock, BinaryData, BoostDocs, Mariomkas, SliceSlice	—	—	—	Same
LeadingStrings_NonAscii (all variants)	—	—	—	Same
LeadingStrings_BinaryData (all variants)	—	—	—	Same

No regressions detected. All MannWhitney tests report Same for non-improved benchmarks.

Key design decisions

Frequency table: First 128 entries of Rust's BYTE_FREQUENCIES from @BurntSushi's aho-corasick crate
Threshold: Average rank >= 200 triggers LeadingStrings; below 200 falls through to FixedDistanceSets
Non-ASCII: Returns false (no frequency data), so the heuristic does not engage and behavior is unchanged

Companion benchmarks: danmoseley/performance#6

danmoseley · 2026-02-23T02:01:59Z

New benchmark results (not yet in dotnet/performance, won't be picked up by PR bot)

These benchmarks are from the companion PR danmoseley/performance#6.

BenchmarkDotNet v0.16.0-custom.20260127.101, Windows 11 (10.0.26100.7840/24H2/2024Update/HudsonValley)
Intel Core i9-14900K 3.20GHz, 1 CPU, 32 logical and 24 physical cores

Benchmark	Options	Baseline	PR	Ratio	MannWhitney(3ms)
LeadingStrings_BinaryData	None	4,483 μs	4,365 μs	0.97	Same
LeadingStrings_BinaryData	Compiled	2,188 μs	2,184 μs	1.00	Same
LeadingStrings_BinaryData	NonBacktracking	3,734 μs	3,725 μs	1.00	Same
LeadingStrings_NonAscii Count	None	913 μs	956 μs	1.05	Same
LeadingStrings_NonAscii Count	Compiled	244 μs	243 μs	1.00	Same
LeadingStrings_NonAscii CountIgnoreCase	None	1,758 μs	1,714 μs	0.98	Same
LeadingStrings_NonAscii CountIgnoreCase	Compiled	258 μs	250 μs	0.97	Same
LeadingStrings_NonAscii Count	NonBacktracking	392 μs	398 μs	1.02	Same
LeadingStrings_NonAscii CountIgnoreCase	NonBacktracking	409 μs	431 μs	1.05	Same

All MannWhitney tests report Same — no regressions on binary or non-ASCII input.

Copilot

Pull request overview

This PR enables case-sensitive LeadingStrings optimization with a frequency-based heuristic to decide between SearchValues (Teddy/Aho-Corasick) and IndexOfAny for patterns with multiple alternation prefixes. The heuristic uses empirical ASCII character frequency ranks (borrowed from Rust's aho-corasick crate) to determine when starting characters are common enough that IndexOfAny would match too frequently, making SearchValues the better choice.

Changes:

Uncommented and enabled the previously disabled case-sensitive LeadingStrings optimization (lines 186-195)
Added HasHighFrequencyStartingChars method that uses frequency-based heuristic with 128-bit bitset for deduplication
Added AsciiCharFrequencyRank table with 128 empirical frequency ranks for ASCII characters

.../System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexFindOptimizations.cs

stephentoub · 2026-02-23T04:21:57Z

Oops? :)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

danmoseley · 2026-02-23T05:10:25Z

Yeah, i fixed the oops right away -- it committed all its scratch binaries.. doh.

danmoseley · 2026-02-23T05:12:08Z

Hold on this is in my fork. YOu want dotnet#124736

I'm using PR's in fork to iterate with the AI without bugging anyone

danmoseley mentioned this pull request Feb 23, 2026

Add LeadingStrings benchmarks for binary and non-ASCII regex patterns danmoseley/performance#6

Closed

danmoseley requested a review from Copilot February 23, 2026 02:02

Copilot started reviewing on behalf of danmoseley February 23, 2026 02:02 View session

Copilot AI reviewed Feb 23, 2026

View reviewed changes

.../System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexFindOptimizations.cs Outdated Show resolved Hide resolved

danmoseley force-pushed the regex-redux/leading-strings-frequency branch 2 times, most recently from eb39721 to e877181 Compare February 23, 2026 04:19

danmoseley force-pushed the regex-redux/leading-strings-frequency branch from e877181 to da12aa4 Compare February 23, 2026 04:22

Enable case-sensitive LeadingStrings with frequency-based heuristic

3744268

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

danmoseley force-pushed the regex-redux/leading-strings-frequency branch from da12aa4 to 3744268 Compare February 23, 2026 04:24

danmoseley closed this Feb 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Enable case-sensitive LeadingStrings with frequency-based heuristic#31

Enable case-sensitive LeadingStrings with frequency-based heuristic#31
danmoseley wants to merge 1 commit intomainfrom
regex-redux/leading-strings-frequency

danmoseley commented Feb 23, 2026 •

edited

Loading

Uh oh!

danmoseley commented Feb 23, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

stephentoub commented Feb 23, 2026

Uh oh!

danmoseley commented Feb 23, 2026 •

edited

Loading

Uh oh!

danmoseley commented Feb 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

danmoseley commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark results (444 benchmarks, BDN A/B with --statisticalTest 3ms)

Key design decisions

Uh oh!

danmoseley commented Feb 23, 2026

New benchmark results (not yet in dotnet/performance, won't be picked up by PR bot)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

stephentoub commented Feb 23, 2026

Uh oh!

danmoseley commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danmoseley commented Feb 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

danmoseley commented Feb 23, 2026 •

edited

Loading

danmoseley commented Feb 23, 2026 •

edited

Loading