Skip to content

Comments

Enable case-sensitive LeadingStrings with frequency-based heuristic#124736

Open
danmoseley wants to merge 1 commit intodotnet:mainfrom
danmoseley:regex-redux/leading-strings-frequency
Open

Enable case-sensitive LeadingStrings with frequency-based heuristic#124736
danmoseley wants to merge 1 commit intodotnet:mainfrom
danmoseley:regex-redux/leading-strings-frequency

Conversation

@danmoseley
Copy link
Member

@danmoseley danmoseley commented Feb 23, 2026

I thought it would be interesting to see whether AI could take another look at the commented out search strategy originally introduced by @stephentoub in #98791 to see whether we can enable it and keep the wins without the regressions that caused it to be commented out.

AI tried various experiments, and got to a dead end. I recalled the frequency table approach that Ripgrep uses (credit to @BurntSushi). Turns out that fixes the regressions entirely. This means our engine now has assumptions built in about char frequencies in ASCII (only) text. That's an approach that's been proven in ripgrep, one of the fastest engines, for 10 years, and it turns out to work OK for regex-redux as well because a, c, g, t are relatively high frequency in English anyway. Code unchanged if pattern has anything other than ASCII (see benchmark results below).

This gives us a nice win on regex-redux, a few other wins in existing tests, and no regressions.

Note: a char frequency table already existed in RegexPrefixAnalyzer.cs for ranking which fixed-distance character sets are most selective. Our new table serves a different purpose: deciding whether to use LeadingStrings vs FixedDistanceSets at all. The two are complementary.

====

When a regex has multiple alternation prefixes (e.g. a|b|c|...), this change decides whether to use SearchValues<string> (Teddy/Aho-Corasick) or fall through to FixedDistanceSets (IndexOfAny) based on the frequency of the starting characters.

High-frequency starters (common letters like lowercase vowels) benefit from multi-string search; low-frequency starters (uppercase, digits, rare consonants) are already excellent IndexOfAny filters. Non-ASCII starters bail out (no frequency data), preserving baseline behavior.

Benchmark results (444 benchmarks, BDN A/B with --statisticalTest 3ms)

Benchmark Baseline PR Ratio Verdict
RegexRedux_1 (Compiled) 25.77ms 14.27ms 1.81x faster Faster
Leipzig Tom.*river (Compiled) 6.13ms 1.87ms 3.28x faster Faster
RegexRedux_5 (Compiled) 2.83ms 2.35ms 1.20x faster Faster
Sherlock, BinaryData, BoostDocs, Mariomkas, SliceSlice -- -- -- Same
LeadingStrings_NonAscii (all variants) -- -- -- Same
LeadingStrings_BinaryData (all variants) -- -- -- Same

Leipzig win is because the pattern is Tom.{10,25}river|river.{10,25}Tom so there is a short prefix that is common in the text; with this change it notices r is common and T fairly common in English text, so it switches to SearchValues which looks for Tom and river simultaneously, causing far fewer false starts.

regex-redux win is because it was previously looking for short, very common prefixes naively, and now (specifically because the pattern chars are common) it changed to use SearchValues (ie Aho-Corasick/Teddy) to search for the longer strings simultaneously.

No regressions detected. All MannWhitney tests report Same for non-improved benchmarks.

Key design decisions

  • Frequency table: First 128 entries of Rust's BYTE_FREQUENCIES from @BurntSushi's aho-corasick crate
  • Threshold: Average rank >= 200 triggers LeadingStrings; below 200 falls through to FixedDistanceSets
  • Non-ASCII: Returns false (no frequency data), so the heuristic does not engage and behavior is unchanged

Companion benchmarks: dotnet/performance#5126

New benchmark results (not yet in dotnet/performance, won't be picked up by PR bot)

These benchmarks are from the companion PR dotnet/performance#5126.

BenchmarkDotNet v0.16.0-custom.20260127.101, Windows 11 (10.0.26100.7840/24H2/2024Update/HudsonValley)
Intel Core i9-14900K 3.20GHz, 1 CPU, 32 logical and 24 physical cores
Benchmark Options Baseline PR Ratio MannWhitney(3ms)
LeadingStrings_BinaryData None 4,483 us 4,365 us 0.97 Same
LeadingStrings_BinaryData Compiled 2,188 us 2,184 us 1.00 Same
LeadingStrings_BinaryData NonBacktracking 3,734 us 3,725 us 1.00 Same
LeadingStrings_NonAscii Count None 913 us 956 us 1.05 Same
LeadingStrings_NonAscii Count Compiled 244 us 243 us 1.00 Same
LeadingStrings_NonAscii CountIgnoreCase None 1,758 us 1,714 us 0.98 Same
LeadingStrings_NonAscii CountIgnoreCase Compiled 258 us 250 us 0.97 Same
LeadingStrings_NonAscii Count NonBacktracking 392 us 398 us 1.02 Same
LeadingStrings_NonAscii CountIgnoreCase NonBacktracking 409 us 431 us 1.05 Same

Binary didn't regress even though it's ASCII with non English frequencies because the pattern has chars that are not particularly common in English, so it uses the old codepath. It's hard to hypothesize about what one might search for in a binary file; searching for a bunch of leading lower case ASCII chars might regress somewhat. We've never particularly tested on binary before, and I don't recall seeing any bugs mentioning binary, so I don't think this is particularly interesting.

NonASCII didn't regress since as previously mentioned, the non ASCII leading chars in the pattern (presumably likely for any searching of non ASCII text) causes it to choose the existing codepath.

All MannWhitney tests report Same -- no regressions on binary or non-ASCII input.

Copilot AI review requested due to automatic review settings February 23, 2026 02:08
@danmoseley
Copy link
Member Author

@MihuBot benchmark Regex

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enables case-sensitive multi-string prefix optimization for regex patterns by introducing a frequency-based heuristic to decide between SearchValues<string> (Teddy/Aho-Corasick) and IndexOfAny with character sets. Previously, case-sensitive prefixes were disabled due to regressions in patterns with low-frequency starting characters (e.g., uppercase letters, digits). The new heuristic uses empirical byte frequency data from Rust's aho-corasick crate to determine if starting characters are common enough in typical text to warrant multi-string search, or rare enough that IndexOfAny remains a better filter.

Changes:

  • Uncommented and enabled case-sensitive prefix optimization with a frequency guard
  • Added HasHighFrequencyStartingChars method to evaluate whether prefix starting characters are high-frequency (threshold >= 200)
  • Added AsciiCharFrequencyRank table containing the first 128 entries from BurntSushi's BYTE_FREQUENCIES data

@danmoseley danmoseley force-pushed the regex-redux/leading-strings-frequency branch from fc2a7e2 to eb39721 Compare February 23, 2026 02:31
@MihuBot
Copy link

MihuBot commented Feb 23, 2026

@danmoseley danmoseley force-pushed the regex-redux/leading-strings-frequency branch from eb39721 to e877181 Compare February 23, 2026 04:19
Copilot AI review requested due to automatic review settings February 23, 2026 04:19
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 9433 changed files in this pull request and generated no new comments.

@danmoseley danmoseley force-pushed the regex-redux/leading-strings-frequency branch from e877181 to da12aa4 Compare February 23, 2026 04:22
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@danmoseley danmoseley force-pushed the regex-redux/leading-strings-frequency branch from da12aa4 to 3744268 Compare February 23, 2026 04:24
Copilot AI review requested due to automatic review settings February 23, 2026 04:24
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

Comments suppressed due to low confidence (4)

src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexFindOptimizations.cs:232

  • leadingStringsFrequency uses -1 as the sentinel for "not computed / not applicable", but the subsequent checks use > 0. A valid computed frequency can be 0 (e.g., prefixes starting with '\x00' or other chars with 0 frequency in the table), which would incorrectly skip the LeadingStrings-vs-set comparison. Consider tracking availability via caseSensitivePrefixes is not null and checking leadingStringsFrequency >= 0 (or using a separate bool) so computed-0 still participates in the heuristic.
                if (leadingStringsFrequency > 0)
                {
                    bool preferLeadingStrings = true;

src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexFindOptimizations.cs:307

  • Same sentinel issue as above: the fallback if (leadingStringsFrequency > 0) will skip using computed case-sensitive prefixes if the computed value is 0, potentially leaving FindMode as NoSearch when no other strategy is selected. Use an availability check like caseSensitivePrefixes is not null && leadingStringsFrequency >= 0 (or a dedicated flag) instead of > 0.
            // If we have case-sensitive leading prefixes and nothing else was selected, use them.
            if (leadingStringsFrequency > 0)
            {
                LeadingPrefixes = caseSensitivePrefixes!;

src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexPrefixAnalyzer.cs:1512

  • The PR description says this change is based on Rust aho-corasick BYTE_FREQUENCIES ranks and a rank threshold (>= 200), but the implementation is using the existing RegexPrefixAnalyzer.Frequency table of percentage occurrences (generated from runtime/Gutenberg text) and compares summed percentages. Either the description needs updating to reflect the actual heuristic/table used, or the code needs to align with the described rank-based approach.
        /// <summary>Percent occurrences in source text (100 * char count / total count).</summary>
        internal static ReadOnlySpan<float> Frequency =>
        [
            0.000f /* '\x00' */, 0.000f /* '\x01' */, 0.000f /* '\x02' */, 0.000f /* '\x03' */, 0.000f /* '\x04' */, 0.000f /* '\x05' */, 0.000f /* '\x06' */, 0.000f /* '\x07' */,

src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexFindOptimizations.cs:255

  • This PR introduces new behavior for RegexOptions.Compiled / NonBacktracking where case-sensitive LeadingStrings may now be selected based on a frequency heuristic. There are existing RegexFindOptimizationsTests, but none appear to cover this new decision logic in compiled/NB modes; adding a few targeted test cases (e.g., where compiled should choose LeadingStrings vs LeadingSet depending on starter frequency, and a non-ASCII starter that must not engage the heuristic) would help prevent future regressions.
                // Compute case-sensitive leading prefixes, but don't commit yet. We'll compare
                // their starting-char frequency against the best FixedDistanceSet below to decide
                // which strategy to use.
                caseSensitivePrefixes = RegexPrefixAnalyzer.FindPrefixes(root, ignoreCase: false) is { Length: > 1 } csp ? csp : null;
                leadingStringsFrequency = caseSensitivePrefixes is not null ? SumStartingCharFrequencies(caseSensitivePrefixes) : -1;
            }

            // Build up a list of all of the sets that are a fixed distance from the start of the expression.
            List<FixedDistanceSet>? fixedDistanceSets = RegexPrefixAnalyzer.FindFixedDistanceSets(root, thorough: !interpreter);
            Debug.Assert(fixedDistanceSets is null || fixedDistanceSets.Count != 0);

            // See if we can make a string of at least two characters long out of those sets.  We should have already caught
            // one at the beginning of the pattern, but there may be one hiding at a non-zero fixed distance into the pattern.
            if (fixedDistanceSets is not null &&
                FindFixedDistanceString(fixedDistanceSets) is (string String, int Distance) bestFixedDistanceString)
            {
                FindMode = FindNextStartingPositionMode.FixedDistanceString_LeftToRight;
                FixedDistanceLiteral = ('\0', bestFixedDistanceString.String, bestFixedDistanceString.Distance);
                return;
            }

            // As a backup, see if we can find a literal after a leading atomic loop.  That might be better than whatever sets we find, so
            // we want to know whether we have one in our pocket before deciding whether to use a leading set (we'll prefer a leading
            // set if it's something for which we can search efficiently).
            (RegexNode LoopNode, (char Char, string? String, StringComparison StringComparison, char[]? Chars) Literal)? literalAfterLoop = RegexPrefixAnalyzer.FindLiteralFollowingLeadingLoop(root);

            // If we got such sets, we'll likely use them.  However, if the best of them is something that doesn't support an efficient
            // search and we did successfully find a literal after an atomic loop we could search instead, we prefer the efficient search.
            // For example, if we have a negated set, we will still prefer the literal-after-an-atomic-loop because negated sets typically
            // contain _many_ characters (e.g. [^a] is everything but 'a') and are thus more likely to very quickly match, which means any
            // vectorization employed is less likely to kick in and be worth the startup overhead.
            if (fixedDistanceSets is not null)
            {
                // Sort the sets by "quality", such that whatever set is first is the one deemed most efficient to use.
                // In some searches, we may use multiple sets, so we want the subsequent ones to also be the efficiency runners-up.
                RegexPrefixAnalyzer.SortFixedDistanceSetsByQuality(fixedDistanceSets);

                // If we have case-sensitive leading prefixes, compare the frequency of their starting characters
                // against the best fixed-distance set's characters. If the best set isn't more selective than the
                // starting chars (i.e. its frequency is at least as high), prefer LeadingStrings (SearchValues)
                // which can match full multi-character prefixes simultaneously. Also prefer LeadingStrings when
                // the best set is negated or range-based (no Chars), since those are weak filters.
                if (leadingStringsFrequency > 0)
                {
                    bool preferLeadingStrings = true;
                    if (fixedDistanceSets[0].Chars is { } bestSetChars &&
                        !fixedDistanceSets[0].Negated)
                    {
                        ReadOnlySpan<float> frequency = RegexPrefixAnalyzer.Frequency;
                        Debug.Assert(frequency.Length == 128);
                        float bestSetFrequency = 0;
                        foreach (char c in bestSetChars)
                        {
                            bestSetFrequency += c < frequency.Length ? frequency[c] : 0;
                        }

                        preferLeadingStrings = bestSetFrequency >= leadingStringsFrequency;
                    }

                    if (preferLeadingStrings)
                    {
                        LeadingPrefixes = caseSensitivePrefixes!;
                        FindMode = FindNextStartingPositionMode.LeadingStrings_LeftToRight;
#if SYSTEM_TEXT_REGULAREXPRESSIONS
                        LeadingStrings = SearchValues.Create(LeadingPrefixes, StringComparison.Ordinal);
#endif
                        return;
                    }

@danmoseley
Copy link
Member Author

@MihuBot benchmark Regex

@danmoseley
Copy link
Member Author

I also ran regex tests locally to verify it's still good, so I think this ready for final (?) review.

@danmoseley
Copy link
Member Author

danmoseley commented Feb 23, 2026

In my local runs, i get these wins. all others unchanged, no regressions

Suite Benchmark Options Mean (Main) Mean (PR) Ratio Alloc (Main) Alloc (PR) Alloc Ratio
Leipzig Tom.{10,25}river|river.{10,25}Tom Compiled 6,421.7 μs 1,169.9 μs 0.18 51 B 10 B 0.20
Leipzig Tom.{10,25}river|river.{10,25}Tom NonBacktracking 7,183.8 μs 1,442.7 μs 0.20 16,244 B 4,428 B 0.27
Common SplitWords Compiled 2,517.20 ns 1,202.89 ns 0.48 7,432 B 7,432 B 1.00
Common MatchesWords Compiled 2,551.48 ns 1,240.66 ns 0.49 3,448 B 3,448 B 1.00
Common ReplaceWords Compiled 2,423.42 ns 1,233.94 ns 0.51 6,848 B 6,848 B 1.00
Common MatchWord Compiled 126.98 ns 64.76 ns 0.51 208 B 208 B 1.00
RegexRedux_1 RegexRedux_1 Compiled 37.44 ms 20.46 ms 0.55 3.39 MB 3.41 MB 1.01
RegexRedux_5 RegexRedux_5 Compiled 4.852 ms 3.558 ms 0.73 3.21 MB 3.21 MB 1.00
Common ReplaceWords IgnoreCase, Compiled 1,795.05 ns 1,488.70 ns 0.83 6,848 B 6,848 B 1.00
Common OneNodeBacktracking Compiled 79.92 ns 70.05 ns 0.88 - - NA

@danmoseley
Copy link
Member Author

danmoseley commented Feb 23, 2026

How would this affect C# (and F#) on the leaderboard at https://programming-language-benchmarks.vercel.app/problem/regex-redux? ( the original does not have a [GeneratedRegex] entry yet.)

Currently the top .NET entry is 6th (AOT+generated). I didn't measure AOT, but assuming the ratio is the same as for jit, we'd get to 3rd on the list.

Rust would still be over 2x faster and the main reason is very likely because its regex engine operates on UTF-8 and ours uses UTF-16 so there's just twice the bytes to process, meaning SIMD has to do more chunks,. ..

@MihuBot
Copy link

MihuBot commented Feb 23, 2026

@danmoseley
Copy link
Member Author

OK fishy mihubot was good before, but second run now doesn't match my local good results. Let me see

@stephentoub
Copy link
Member

meaning SIMD has to do more chunks

Only kinda sorta. We still need to churn through 2x the bytes, but the core logic when only ASCII is concerned is handled by "packed" variants of the algorithms, that combine two vectors of chars into a single vector of bytes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants