Add RegexOptions.AnyNewLine via parser lowering#124701
Add RegexOptions.AnyNewLine via parser lowering#124701danmoseley wants to merge 17 commits intodotnet:mainfrom
Conversation
Add AnyNewLine = 0x0800 to RegexOptions enum. Update ValidateOptions to bump MaxOptionShift to 12 and reject AnyNewLine | NonBacktracking. ECMAScript already rejects unknown options via allowlist. Update source generator to include AnyNewLine in SupportedOptions mask. Update tests that used 0x800 as an invalid option value to use 0x1000. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When AnyNewLine is set without Multiline, lower $ from EndZ into an equivalent sub-tree: (?=\r\n\z|\r?\z)|(?<!\r)(?=\n\z) This matches at end of string, or before \r\n, \r, or \n at end of string, but not between \r and \n. Works across all engines (interpreter, compiled, source generator) since it's pure parser lowering. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When AnyNewLine is set, lower \Z using the same sub-tree as $ without Multiline. \Z is not affected by Multiline, so the same lowering applies regardless. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When both Multiline and AnyNewLine are set, lower $ to: (?=\r\n|\r|\z)|(?<!\r)(?=\n) This matches at \r\n, \r, \n boundaries and end-of-string, without matching between \r and \n of a \r\n sequence. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When both Multiline and AnyNewLine are set, lower ^ to: (?<=\A|\r\n|\n)|(?<=\r)(?!\n) This matches after \r\n, \n, bare \r (not followed by \n), and at start of string. Without Multiline, ^ remains \A unchanged. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When AnyNewLine is set (without Singleline), lower . to [^\n\r] instead of [^\n], so dot does not match \r or \n. Add NotNewLineOrCarriageReturnClass constant to RegexCharClass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Combined ^/$/. tests, Replace/Split, RightToLeft, mixed newlines, empty lines, \Z with trailing newlines, and edge cases. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Integration tests using a ~50 char string with all newline types (\r\n, \r, \n, \u0085, \u2028, \u2029) exercising ^, $, \Z, and . together. Replace/Split tests with MatchEvaluator line numbering. Deduplicated cases moved into per-feature tests (RightToLeft, empty lines). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Expand test coverage across all AnyNewLine-affected constructs:
- Dollar, EndZ, DollarMultiline, CaretMultiline, Dot test data
with adjacent newlines, newlines at string boundaries,
empty segments, RightToLeft, and all Unicode newline types
- Advanced tests: inline options, backreferences, conditionals,
alternation with anchors, lookahead/lookbehind, quantified dot,
lazy quantifiers, named/atomic groups, word boundaries near
newlines, explicit char classes unaffected
- Methods test: IsMatch, Count, EnumerateMatches, Match with
startat, Replace with group ref, Split
- Unicode expansion: \s/\S behavior, \w behavior, \p{Zl}/\p{Zp}
categories, adjacent Unicode+ASCII newlines, baselines without
AnyNewLine
No bugs found — all initial test failures were wrong expectations.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Verify the fixer correctly emits RegexOptions.Multiline | RegexOptions.AnyNewLine in enum value order when upgrading to GeneratedRegex. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Test cases derived from cross-validation with PCRE2 NEWLINE_ANY behavior (BSD-licensed) and analysis of real-world patterns from dotnet/runtime-assets: - (.+)# greedy where .+ cannot cross newlines (PCRE2 JIT 472) - (.)(.) requiring consecutive non-newlines (PCRE2 JIT 471) - (.). with mixed newline types (PCRE2 JIT 469) - Blank line detection (^ +$) with \n, \r\n, \u0085 separators All 31,528 tests pass. No bugs found — our implementation is fully consistent with PCRE2 NEWLINE_ANY behavior and handles real-world patterns correctly. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add more RightToLeft + AnyNewLine tests (various newline types, dot, anchors, \Z) - Add more Singleline | AnyNewLine tests (all newline types, combined with Multiline) - Replace RegexOptions.AnyNewLine with RegexHelpers.RegexOptionAnyNewLine throughout tests for net481 compilation compatibility - Wrap Count/EnumerateMatches in #if NET for net481 compat - Add clarifying comments on Split behavior with/without AnyNewLine Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
(Finally got around to having AI finish my lowering branch..) |
There was a problem hiding this comment.
Pull request overview
This pull request implements RegexOptions.AnyNewLine (value 0x0800 = 2048), a new regex option that makes ^, $, \Z, and . recognize all Unicode line boundaries (\r, \r\n, \n, \u0085 NEL, \u2028 LS, \u2029 PS) instead of only \n. This addresses a major usability issue where users had to manually work around .NET's hardcoded \n-only line ending behavior.
Changes:
- Added
RegexOptions.AnyNewLine = 0x0800enum value with incompatibility checks for NonBacktracking and ECMAScript modes - Implemented parser-level lowering of
^,$,\Z, and.into equivalent lookaround-based RegexNode trees when AnyNewLine is enabled - Added comprehensive test coverage (~800 new test lines) covering all anchor types, newline combinations, RightToLeft mode, inline options, and edge cases
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexOptions.cs |
Added AnyNewLine = 0x0800 enum value with XML documentation |
src/libraries/System.Text.RegularExpressions/ref/System.Text.RegularExpressions.cs |
Updated ref assembly with AnyNewLine = 2048 |
src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Regex.cs |
Updated MaxOptionShift to 12 and added AnyNewLine to NonBacktracking incompatibility check |
src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexParser.cs |
Implemented lowering methods (AnyNewLineEndZNode, AnyNewLineEolNode, AnyNewLineBolNode) and integrated into ^, $, \Z, . parsing |
src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCharClass.cs |
Added NotNewLineOrCarriageReturnClass constant for . with AnyNewLine |
src/libraries/System.Text.RegularExpressions/gen/RegexGenerator.Parser.cs |
Added AnyNewLine to source generator's supported options |
src/libraries/System.Text.RegularExpressions/tests/FunctionalTests/Regex.Match.Tests.cs |
Added ~800 lines of comprehensive tests for all anchor types, newline combinations, and edge cases |
src/libraries/System.Text.RegularExpressions/tests/FunctionalTests/Regex.Tests.Common.cs |
Added RegexOptionAnyNewLine constant for test compatibility |
src/libraries/System.Text.RegularExpressions/tests/FunctionalTests/Regex.Ctor.Tests.cs |
Updated invalid option test from 0x800 to 0x1000; added NonBacktracking+AnyNewLine incompatibility test |
src/libraries/System.Text.RegularExpressions/tests/FunctionalTests/Regex.MultipleMatches.Tests.cs |
Updated invalid option comments and tests from 0x800 to 0x1000 |
src/libraries/System.Text.RegularExpressions/tests/FunctionalTests/Regex.EnumerateMatches.Tests.cs |
Updated invalid option tests from 0x800 to 0x1000 |
src/libraries/System.Text.RegularExpressions/tests/FunctionalTests/RegexGeneratorParserTests.cs |
Updated invalid option tests from 0x800 to 0x1000 |
src/libraries/System.Text.RegularExpressions/tests/FunctionalTests/UpgradeToGeneratedRegexAnalyzerTests.cs |
Updated tests for 0x1000 as invalid option; added AnyNewLine test case for code fixer |
src/libraries/System.Text.RegularExpressions/tests/FunctionalTests/Regex.Ctor.Tests.cs
Show resolved
Hide resolved
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
@MihuBot benchmark Regex |
|
See benchmark results at https://gist.github.com/MihuBot/c7399f4f318e4febcfd0018436d5fe53 |
|
Mihubot confirms zero perf impact on existing patterns/options, |
AnyNewLine Performance Analysis (Release, Compiled, .NET 11.0, BenchmarkDotNet)Measured impact of converting existing newline-workaround patterns to simplified AnyNewLine equivalents. All scenarios use Section 1: Real-World Patterns on Windows
|
| Old Pattern | New Pattern (+ AnyNewLine) | Old (us) | New (us) | Ratio |
|---|---|---|---|---|
^.+\r?$ (1K lines) |
^.+$ |
46.7 | 48.8 | 1.05x |
^.+\r?$ (10K lines) |
^.+$ |
1,694 | 1,760 | 1.04x |
\[assembly:...\]\s*$(\r?\n)? |
\[assembly:...\]\s*$ |
38.3 | 32.4 | 0.85x |
^([^\s:]+):\s*(.+?)\r?$ |
^([^\s:]+):\s*(.+?)$ |
105.9 | 105.9 | 1.00x |
^# .+\r?$ |
^# .+$ |
11.1 | 9.1 | 0.83x |
^.+\r?$ (CSV, 1K rows) |
^.+$ |
44.4 | 49.2 | 1.11x |
[^\r\n]+ |
.+ |
44.2 | 43.8 | 0.99x |
\w+\r?$ |
\w+$ |
90.8 | 128.7 | 1.42x |
(?:^|\r\n)\w+ |
^\w+ |
208.7 | 214.5 | 1.03x |
Section 2: Unix \n Text (overhead of just enabling the flag)
| Old Pattern | New Pattern (+ AnyNewLine) | Old (us) | New (us) | Ratio |
|---|---|---|---|---|
^.+$ |
^.+$ |
43.5 | 48.9 | 1.12x |
[^\n]+ |
.+ |
39.0 | 44.9 | 1.15x |
Section 3: Mixed \n/\r\n Text
| Old Pattern | New Pattern (+ AnyNewLine) | Old (us) | New (us) | Ratio |
|---|---|---|---|---|
[^\r\n\u0085\u2028\u2029]+ |
.+ |
45.4 | 44.2 | 0.97x |
^.+\r?$ (1K lines) |
^.+$ |
44.1 | 50.1 | 1.14x |
Section 4: Non-anchor/dot Patterns (zero impact expected)
| Old Pattern | New Pattern (+ AnyNewLine) | Old (us) | New (us) | Ratio |
|---|---|---|---|---|
\r\n|\r|\n |
\r\n|\r|\n |
20.0 | 21.7 | 1.08x |
\w+ |
\w+ |
322.4 | 336.4 | 1.04x |
Section 5: Pathological Cases (unlikely in practice)
| Old Pattern | New Pattern (+ AnyNewLine) | Old (us) | New (us) | Ratio |
|---|---|---|---|---|
$ |
$ |
98.2 | 134.1 | 1.37x |
^ |
^ |
145.6 | 131.6 | 0.90x |
\w+\r?\Z (329K chars) |
\w+\Z |
494.2 | 1,039.3 | 2.10x |
Summary
-
Real-world patterns in Compiled mode show 0.83x--1.14x -- essentially zero cost, and sometimes faster because the AnyNewLine pattern is simpler (e.g.,
^# .+$vs^# .+\r?$-- removing the\r?node saves more than the lowered$costs). -
Where small regressions occur (1.1x--1.4x), the cause is the lowered anchor tree: a native
$(Eol) is a single "is next char\n?" check, but AnyNewLine lowers it to a lookahead alternation like(?=\r\n|\r|\n|\u0085|\u2028|\u2029|\z). Even when the input only contains\r\n, the engine must evaluate the alternation branches. This overhead is proportionally more visible when the anchor dominates the work (e.g.,\w+$where the\w+match is short), and nearly invisible when.+dominates each line's work (e.g.,^.+$at 1.04x). -
Patterns without anchors or dot are completely unaffected (1.04--1.08x, within noise) -- the flag only changes behavior of
.,^,$,\Z. -
Only pathological case:
\w+\Zon very large input (329K chars) at 2.1x -- the lowered\Zalternation tree is evaluated during backtracking at many positions. Unlikely in practice. -
In Compiled/source-generated mode, the JIT compiles the lowered alternation branches into efficient single-char comparisons, keeping overhead minimal. Interpreted mode shows larger gaps (2--3x for typical patterns) but AnyNewLine + interpreted + perf-sensitive is an unlikely combination.
Benchmark source code (BenchmarkDotNet)
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Configs;
using BenchmarkDotNet.Running;
using BenchmarkDotNet.Jobs;
using BenchmarkDotNet.Columns;
using BenchmarkDotNet.Reports;
using BenchmarkDotNet.Toolchains.InProcess.Emit;
BenchmarkRunner.Run<AnyNewLineBenchmarks>(
DefaultConfig.Instance
.WithSummaryStyle(SummaryStyle.Default.WithRatioStyle(RatioStyle.Percentage))
.AddJob(Job.ShortRun.WithToolchain(InProcessEmitToolchain.Instance)));
[MemoryDiagnoser(false)]
[HideColumns("Job", "Error", "StdDev", "RatioSD", "Alloc Ratio")]
public class AnyNewLineBenchmarks
{
private const RegexOptions AnyNewLine = (RegexOptions)0x0800;
private static string GenerateText(int lineCount, string[] newlines)
{
var sb = new StringBuilder();
for (int i = 0; i < lineCount; i++)
{
sb.Append("Lorem ipsum dolor sit amet ");
sb.Append(i);
sb.Append(newlines[i % newlines.Length]);
}
return sb.ToString();
}
private static readonly string WinText1K = GenerateText(1000, ["\r\n"]);
private static readonly string WinText10K = GenerateText(10000, ["\r\n"]);
private static readonly string UnixText1K = GenerateText(1000, ["\n"]);
private static readonly string MixedNR1K = GenerateText(1000, ["\n", "\r\n"]);
private static readonly string MixedAll1K = GenerateText(1000,
["\n", "\r\n", "\r", "\u0085", "\u2028", "\u2029"]);
private static readonly string AssemblyInfo;
private static readonly string KvConfig;
private static readonly string Markdown;
private static readonly string CsvData;
static AnyNewLineBenchmarks()
{
var sb = new StringBuilder();
string[] attrs = {
"[assembly: AssemblyTitle(\"MyApp\")]",
"[assembly: AssemblyDescription(\"A sample app\")]",
"[assembly: AssemblyConfiguration(\"\")]",
"[assembly: AssemblyCompany(\"Contoso\")]",
"[assembly: AssemblyProduct(\"MyApp\")]",
"[assembly: AssemblyCopyright(\"Copyright 2024\")]",
"[assembly: AssemblyTrademark(\"\")]",
"[assembly: AssemblyCulture(\"\")]",
"[assembly: AssemblyVersion(\"1.0.0.0\")]",
"[assembly: AssemblyFileVersion(\"1.0.0.0\")]"
};
foreach (var attr in attrs) { sb.Append(attr); sb.Append("\r\n"); }
AssemblyInfo = string.Concat(Enumerable.Repeat(sb.ToString(), 50));
sb.Clear();
string[] keys = { "Server", "Database", "User", "Password", "Timeout",
"MaxPool", "MinPool", "Encrypt", "TrustCert", "AppName" };
for (int i = 0; i < 50; i++)
{
sb.Append(keys[i % keys.Length]); sb.Append(": value_"); sb.Append(i); sb.Append("\r\n");
}
KvConfig = string.Concat(Enumerable.Repeat(sb.ToString(), 20));
sb.Clear();
for (int i = 0; i < 200; i++)
{
sb.Append($"# Heading {i}\r\n");
sb.Append($"Some paragraph text about topic {i}.\r\n");
sb.Append($"Another line of content here.\r\n\r\n");
}
Markdown = sb.ToString();
sb.Clear();
sb.Append("Name,Age,City,Email\r\n");
for (int i = 0; i < 1000; i++)
sb.Append($"User{i},{20 + i % 50},City{i % 100},user{i}@example.com\r\n");
CsvData = sb.ToString();
}
// Section 1: Real-world on Windows \r\n text
private static readonly Regex Old_1a = new(@"^.+\r?$", RegexOptions.Compiled | RegexOptions.Multiline);
private static readonly Regex New_1a = new(@"^.+$", RegexOptions.Compiled | RegexOptions.Multiline | AnyNewLine);
[Benchmark(Baseline = true, Description = "1a_Lines1K_Old")]
public int Lines1K_Old() => Old_1a.Matches(WinText1K).Count;
[Benchmark(Description = "1a_Lines1K_New")]
public int Lines1K_New() => New_1a.Matches(WinText1K).Count;
private static readonly Regex Old_1b = new(@"^.+\r?$", RegexOptions.Compiled | RegexOptions.Multiline);
private static readonly Regex New_1b = new(@"^.+$", RegexOptions.Compiled | RegexOptions.Multiline | AnyNewLine);
[Benchmark(Description = "1b_Lines10K_Old")]
public int Lines10K_Old() => Old_1b.Matches(WinText10K).Count;
[Benchmark(Description = "1b_Lines10K_New")]
public int Lines10K_New() => New_1b.Matches(WinText10K).Count;
private static readonly Regex Old_2 = new(@"\[assembly:\s*\w+\(.*?\)\]\s*$(\r?\n)?",
RegexOptions.Compiled | RegexOptions.Multiline);
private static readonly Regex New_2 = new(@"\[assembly:\s*\w+\(.*?\)\]\s*$",
RegexOptions.Compiled | RegexOptions.Multiline | AnyNewLine);
[Benchmark(Description = "2_Assembly_Old")]
public int Assembly_Old() => Old_2.Matches(AssemblyInfo).Count;
[Benchmark(Description = "2_Assembly_New")]
public int Assembly_New() => New_2.Matches(AssemblyInfo).Count;
private static readonly Regex Old_3 = new(@"^([^\s:]+):\s*(.+?)\r?$",
RegexOptions.Compiled | RegexOptions.Multiline);
private static readonly Regex New_3 = new(@"^([^\s:]+):\s*(.+?)$",
RegexOptions.Compiled | RegexOptions.Multiline | AnyNewLine);
[Benchmark(Description = "3_KeyVal_Old")]
public int KeyVal_Old() => Old_3.Matches(KvConfig).Count;
[Benchmark(Description = "3_KeyVal_New")]
public int KeyVal_New() => New_3.Matches(KvConfig).Count;
private static readonly Regex Old_4 = new(@"^# .+\r?$", RegexOptions.Compiled | RegexOptions.Multiline);
private static readonly Regex New_4 = new(@"^# .+$", RegexOptions.Compiled | RegexOptions.Multiline | AnyNewLine);
[Benchmark(Description = "4_Markdown_Old")]
public int Markdown_Old() => Old_4.Matches(Markdown).Count;
[Benchmark(Description = "4_Markdown_New")]
public int Markdown_New() => New_4.Matches(Markdown).Count;
private static readonly Regex Old_5 = new(@"^.+\r?$", RegexOptions.Compiled | RegexOptions.Multiline);
private static readonly Regex New_5 = new(@"^.+$", RegexOptions.Compiled | RegexOptions.Multiline | AnyNewLine);
[Benchmark(Description = "5_CSV_Old")]
public int CSV_Old() => Old_5.Matches(CsvData).Count;
[Benchmark(Description = "5_CSV_New")]
public int CSV_New() => New_5.Matches(CsvData).Count;
private static readonly Regex Old_6 = new(@"[^\r\n]+", RegexOptions.Compiled);
private static readonly Regex New_6 = new(@".+", RegexOptions.Compiled | AnyNewLine);
[Benchmark(Description = "6_DotExcl_Old")]
public int DotExcl_Old() => Old_6.Matches(WinText1K).Count;
[Benchmark(Description = "6_DotExcl_New")]
public int DotExcl_New() => New_6.Matches(WinText1K).Count;
private static readonly Regex Old_7 = new(@"\w+\r?$", RegexOptions.Compiled | RegexOptions.Multiline);
private static readonly Regex New_7 = new(@"\w+$", RegexOptions.Compiled | RegexOptions.Multiline | AnyNewLine);
[Benchmark(Description = "7_WordEOL_Old")]
public int WordEOL_Old() => Old_7.Matches(WinText1K).Count;
[Benchmark(Description = "7_WordEOL_New")]
public int WordEOL_New() => New_7.Matches(WinText1K).Count;
private static readonly Regex Old_8 = new(@"(?:^|\r\n)\w+", RegexOptions.Compiled | RegexOptions.Multiline);
private static readonly Regex New_8 = new(@"^\w+", RegexOptions.Compiled | RegexOptions.Multiline | AnyNewLine);
[Benchmark(Description = "8_LineSt_Old")]
public int LineStart_Old() => Old_8.Matches(WinText1K).Count;
[Benchmark(Description = "8_LineSt_New")]
public int LineStart_New() => New_8.Matches(WinText1K).Count;
// Section 2: Unix \n text (control)
private static readonly Regex Old_9 = new(@"^.+$", RegexOptions.Compiled | RegexOptions.Multiline);
private static readonly Regex New_9 = new(@"^.+$", RegexOptions.Compiled | RegexOptions.Multiline | AnyNewLine);
[Benchmark(Description = "9_UnixLines_Old")]
public int UnixLines_Old() => Old_9.Matches(UnixText1K).Count;
[Benchmark(Description = "9_UnixLines_New")]
public int UnixLines_New() => New_9.Matches(UnixText1K).Count;
private static readonly Regex Old_10 = new(@"[^\n]+", RegexOptions.Compiled);
private static readonly Regex New_10 = new(@".+", RegexOptions.Compiled | AnyNewLine);
[Benchmark(Description = "10_UnixDot_Old")]
public int UnixDot_Old() => Old_10.Matches(UnixText1K).Count;
[Benchmark(Description = "10_UnixDot_New")]
public int UnixDot_New() => New_10.Matches(UnixText1K).Count;
// Section 3: Mixed newline text
private static readonly Regex Old_11 = new(@"[^\r\n\u0085\u2028\u2029]+", RegexOptions.Compiled);
private static readonly Regex New_11 = new(@".+", RegexOptions.Compiled | AnyNewLine);
[Benchmark(Description = "11_MixedDot_Old")]
public int MixedDot_Old() => Old_11.Matches(MixedAll1K).Count;
[Benchmark(Description = "11_MixedDot_New")]
public int MixedDot_New() => New_11.Matches(MixedAll1K).Count;
private static readonly Regex Old_12 = new(@"^.+\r?$", RegexOptions.Compiled | RegexOptions.Multiline);
private static readonly Regex New_12 = new(@"^.+$", RegexOptions.Compiled | RegexOptions.Multiline | AnyNewLine);
[Benchmark(Description = "12_MixedLines_Old")]
public int MixedLines_Old() => Old_12.Matches(MixedNR1K).Count;
[Benchmark(Description = "12_MixedLines_New")]
public int MixedLines_New() => New_12.Matches(MixedNR1K).Count;
// Section 4: Non-anchor patterns (zero impact)
private static readonly Regex Old_14 = new(@"\r\n|\r|\n", RegexOptions.Compiled);
private static readonly Regex New_14 = new(@"\r\n|\r|\n", RegexOptions.Compiled | AnyNewLine);
[Benchmark(Description = "14_Literal_Old")]
public int Literal_Old() => Old_14.Matches(MixedAll1K).Count;
[Benchmark(Description = "14_Literal_New")]
public int Literal_New() => New_14.Matches(MixedAll1K).Count;
private static readonly Regex Old_15 = new(@"\w+", RegexOptions.Compiled);
private static readonly Regex New_15 = new(@"\w+", RegexOptions.Compiled | AnyNewLine);
[Benchmark(Description = "15_Words_Old")]
public int Words_Old() => Old_15.Matches(WinText1K).Count;
[Benchmark(Description = "15_Words_New")]
public int Words_New() => New_15.Matches(WinText1K).Count;
// Section 5: Pathological
private static readonly Regex Old_P1 = new(@"$", RegexOptions.Compiled | RegexOptions.Multiline);
private static readonly Regex New_P1 = new(@"$", RegexOptions.Compiled | RegexOptions.Multiline | AnyNewLine);
[Benchmark(Description = "P1_BareEOL_Old")]
public int BareEOL_Old() => Old_P1.Matches(WinText1K).Count;
[Benchmark(Description = "P1_BareEOL_New")]
public int BareEOL_New() => New_P1.Matches(WinText1K).Count;
private static readonly Regex Old_P2 = new(@"^", RegexOptions.Compiled | RegexOptions.Multiline);
private static readonly Regex New_P2 = new(@"^", RegexOptions.Compiled | RegexOptions.Multiline | AnyNewLine);
[Benchmark(Description = "P2_BareBOL_Old")]
public int BareBOL_Old() => Old_P2.Matches(WinText1K).Count;
[Benchmark(Description = "P2_BareBOL_New")]
public int BareBOL_New() => New_P2.Matches(WinText1K).Count;
private static readonly Regex Old_P3 = new(@"\w+\r?\Z", RegexOptions.Compiled);
private static readonly Regex New_P3 = new(@"\w+\Z", RegexOptions.Compiled | AnyNewLine);
[Benchmark(Description = "P3_EndZ_Old")]
public bool EndZ_Old() => Old_P3.IsMatch(WinText10K);
[Benchmark(Description = "P3_EndZ_New")]
public bool EndZ_New() => New_P3.IsMatch(WinText10K);
}Simplify the lowered trees for $, ^, and \Z anchors: - Eol ($): Merge \r\n|\r|[\u0085\u2028\u2029] into [\r\u0085\u2028\u2029] (4 branches -> 2) \r covers both \r\n and bare \r since lookahead only checks first char - Bol (^): Merge \r\n|\n|[\u0085\u2028\u2029] into [\n\u0085\u2028\u2029] (4 branches -> 2) \n covers both \r\n and bare \n since lookbehind only checks last char - EndZ (\Z): Merge \r? and [\u0085\u2028\u2029] into [\r\u0085\u2028\u2029]? (3 branches -> 2) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Thinking about how to optimize '.' more here:
The compiled JIT for Notone is literally one compare-and-branch. The character class, even with a bitmap optimization, has to do range subtraction + bit shifting. Per character, that's maybe So the overhead is not from "more chars to match against" — it's from crossing the boundary between the fast-path Notone node type (single char compare) and the generic Set node type In principle, the engine could be taught to recognize small negated character classes and emit a chain of != comparisons instead of bitmap ops, but that would be an engine optimization beyond the scope of this PR. |
Updated AnyNewLine Performance Analysis (with anchor optimization)After the initial analysis above, I optimized the lowered anchor trees by merging redundant alternation branches:
Re-measured with BenchmarkDotNet MediumRun (15 target iterations, 3 warmup — more stable than previous ShortRun). All scenarios use Section 1: Real-World Patterns on Windows
|
| Old Pattern | New Pattern (+ AnyNewLine) | Old (µs) | New (µs) | Ratio |
|---|---|---|---|---|
^.+\r?$ (1K lines) |
^.+$ |
44.6 | 49.9 | 1.12x |
^.+\r?$ (10K lines) |
^.+$ |
1,749 | 1,797 | 1.03x |
\[assembly:...\]\s*$(\r?\n)? |
\[assembly:...\]\s*$ |
37.1 | 33.1 | 0.89x |
^([^\s:]+):\s*(.+?)\r?$ |
^([^\s:]+):\s*(.+?)$ |
105.8 | 100.0 | 0.95x |
^# .+\r?$ |
^# .+$ |
11.0 | 9.2 | 0.83x |
^.+\r?$ (CSV, 1K rows) |
^.+$ |
44.4 | 47.5 | 1.07x |
[^\r\n]+ |
.+ |
41.9 | 43.4 | 1.04x |
\w+\r?$ |
\w+$ |
85.1 | 112.9 | 1.33x |
(?:^|\r\n)\w+ |
^\w+ |
195.7 | 193.4 | 0.99x |
Section 2: Unix \n Text (overhead of just enabling the flag)
| Old Pattern | New Pattern (+ AnyNewLine) | Old (µs) | New (µs) | Ratio |
|---|---|---|---|---|
^.+$ |
^.+$ |
42.3 | 46.6 | 1.10x |
[^\n]+ |
.+ |
36.3 | 42.9 | 1.18x |
Section 3: Mixed \n/\r\n Text
| Old Pattern | New Pattern (+ AnyNewLine) | Old (µs) | New (µs) | Ratio |
|---|---|---|---|---|
[^\r\n\u0085\u2028\u2029]+ |
.+ |
43.5 | 45.6 | 1.05x |
^.+\r?$ (1K lines) |
^.+$ |
46.0 | 50.5 | 1.10x |
Section 4: Non-anchor/dot Patterns (zero impact expected)
| Old Pattern | New Pattern (+ AnyNewLine) | Old (µs) | New (µs) | Ratio |
|---|---|---|---|---|
\r\n|\r|\n |
\r\n|\r|\n |
20.1 | 19.6 | 0.98x |
\w+ |
\w+ |
286.5 | 285.9 | 1.00x |
Section 5: Pathological Cases (unlikely in practice)
| Old Pattern | New Pattern (+ AnyNewLine) | Old (µs) | New (µs) | Ratio |
|---|---|---|---|---|
$ (1K bare evals) |
$ |
98.6 | 120.3 | 1.22x |
^ (1K bare evals) |
^ |
133.3 | 108.3 | 0.81x |
\w+\r?\Z (329K chars) |
\w+\Z |
480.3 | 931.5 | 1.94x |
Summary
-
Real-world patterns in Compiled mode show 0.83x–1.12x — essentially zero to modest cost, and sometimes faster because the AnyNewLine pattern is simpler (e.g.,
^# .+$vs^# .+\r?$— removing the\r?node saves more than the lowered$costs). -
Where regressions occur (1.1x–1.3x), the cause is the lowered anchor/dot trees: a native
$(Eol) is a single "is next char\n?" check, but AnyNewLine lowers it to a lookahead with a character class[\r\u0085\u2028\u2029]plus\z. Even when the input only contains\r\n, the engine must evaluate the character class. Similarly,.becomes[^\r\n\u0085\u2028\u2029]which crosses from the fast-pathNotonenode (singlech != '\n'comparison) to aSetnode (bitmap/bitmask operation — branchless but ~3-5 IL ops per character). This overhead is proportionally more visible when the anchor/dot dominates the work (e.g.,\w+$where the\w+match is short, or[^\n]+→.+on Unix text), and nearly invisible when the overall pattern has other dominant work. -
The anchor optimization improved the worst real-world case (
\w+$) from 1.42x to 1.33x, and made bare^actually faster (0.81x) by using a 2-branch merged character class instead of 4 separate alternation branches. -
Patterns without anchors or dot are completely unaffected (0.98x–1.00x) — the flag only changes behavior of
.,^,$,\Z. -
Only pathological case:
\w+\Zon very large input (329K chars) at 1.94x — the lowered\Zalternation tree is evaluated during backtracking at many positions. Unlikely in practice. -
In Compiled/source-generated mode, the JIT compiles the lowered alternation branches into efficient single-char comparisons, keeping overhead minimal. Interpreted mode would show larger gaps but AnyNewLine + interpreted + perf-sensitive is an unlikely combination.
Benchmark source code (BenchmarkDotNet, MediumRun)
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Configs;
using BenchmarkDotNet.Running;
using BenchmarkDotNet.Jobs;
using BenchmarkDotNet.Columns;
using BenchmarkDotNet.Reports;
using BenchmarkDotNet.Toolchains.InProcess.Emit;
BenchmarkRunner.Run<AnyNewLineBenchmarks>(
DefaultConfig.Instance
.WithSummaryStyle(SummaryStyle.Default.WithRatioStyle(RatioStyle.Percentage))
.AddJob(Job.MediumRun.WithToolchain(InProcessEmitToolchain.Instance)));
[MemoryDiagnoser(false)]
[HideColumns("Job", "Error", "StdDev", "RatioSD", "Alloc Ratio")]
public class AnyNewLineBenchmarks
{
private const RegexOptions AnyNewLine = (RegexOptions)0x0800;
private static string GenerateText(int lineCount, string[] newlines)
{
var sb = new StringBuilder();
for (int i = 0; i < lineCount; i++)
{
sb.Append("Lorem ipsum dolor sit amet ");
sb.Append(i);
sb.Append(newlines[i % newlines.Length]);
}
return sb.ToString();
}
private static readonly string WinText1K = GenerateText(1000, ["\r\n"]);
private static readonly string WinText10K = GenerateText(10000, ["\r\n"]);
private static readonly string UnixText1K = GenerateText(1000, ["\n"]);
private static readonly string MixedNR1K = GenerateText(1000, ["\n", "\r\n"]);
private static readonly string MixedAll1K = GenerateText(1000,
["\n", "\r\n", "\r", "\u0085", "\u2028", "\u2029"]);
private static readonly string AssemblyInfo;
private static readonly string KvConfig;
private static readonly string Markdown;
private static readonly string CsvData;
static AnyNewLineBenchmarks()
{
var sb = new StringBuilder();
string[] attrs = {
"[assembly: AssemblyTitle(\"MyApp\")]",
"[assembly: AssemblyDescription(\"A sample app\")]",
"[assembly: AssemblyConfiguration(\"\")]",
"[assembly: AssemblyCompany(\"Contoso\")]",
"[assembly: AssemblyProduct(\"MyApp\")]",
"[assembly: AssemblyCopyright(\"Copyright 2024\")]",
"[assembly: AssemblyTrademark(\"\")]",
"[assembly: AssemblyCulture(\"\")]",
"[assembly: AssemblyVersion(\"1.0.0.0\")]",
"[assembly: AssemblyFileVersion(\"1.0.0.0\")]"
};
foreach (var attr in attrs) { sb.Append(attr); sb.Append("\r\n"); }
AssemblyInfo = string.Concat(Enumerable.Repeat(sb.ToString(), 50));
sb.Clear();
string[] keys = { "Server", "Database", "User", "Password", "Timeout",
"MaxPool", "MinPool", "Encrypt", "TrustCert", "AppName" };
for (int i = 0; i < 50; i++)
{
sb.Append(keys[i % keys.Length]); sb.Append(": value_"); sb.Append(i); sb.Append("\r\n");
}
KvConfig = string.Concat(Enumerable.Repeat(sb.ToString(), 20));
sb.Clear();
for (int i = 0; i < 200; i++)
{
sb.Append($"# Heading {i}\r\n");
sb.Append($"Some paragraph text about topic {i}.\r\n");
sb.Append($"Another line of content here.\r\n\r\n");
}
Markdown = sb.ToString();
sb.Clear();
sb.Append("Name,Age,City,Email\r\n");
for (int i = 0; i < 1000; i++)
sb.Append($"User{i},{20 + i % 50},City{i % 100},user{i}@example.com\r\n");
CsvData = sb.ToString();
}
// Section 1: Real-world on Windows \r\n text
private static readonly Regex Old_1a = new(@"^.+\r?$", RegexOptions.Compiled | RegexOptions.Multiline);
private static readonly Regex New_1a = new(@"^.+$", RegexOptions.Compiled | RegexOptions.Multiline | AnyNewLine);
[Benchmark(Baseline = true, Description = "1a_Lines1K_Old")]
public int Lines1K_Old() => Old_1a.Matches(WinText1K).Count;
[Benchmark(Description = "1a_Lines1K_New")]
public int Lines1K_New() => New_1a.Matches(WinText1K).Count;
private static readonly Regex Old_1b = new(@"^.+\r?$", RegexOptions.Compiled | RegexOptions.Multiline);
private static readonly Regex New_1b = new(@"^.+$", RegexOptions.Compiled | RegexOptions.Multiline | AnyNewLine);
[Benchmark(Description = "1b_Lines10K_Old")]
public int Lines10K_Old() => Old_1b.Matches(WinText10K).Count;
[Benchmark(Description = "1b_Lines10K_New")]
public int Lines10K_New() => New_1b.Matches(WinText10K).Count;
private static readonly Regex Old_2 = new(@"\[assembly:\s*\w+\(.*?\)\]\s*$(\r?\n)?",
RegexOptions.Compiled | RegexOptions.Multiline);
private static readonly Regex New_2 = new(@"\[assembly:\s*\w+\(.*?\)\]\s*$",
RegexOptions.Compiled | RegexOptions.Multiline | AnyNewLine);
[Benchmark(Description = "2_Assembly_Old")]
public int Assembly_Old() => Old_2.Matches(AssemblyInfo).Count;
[Benchmark(Description = "2_Assembly_New")]
public int Assembly_New() => New_2.Matches(AssemblyInfo).Count;
private static readonly Regex Old_3 = new(@"^([^\s:]+):\s*(.+?)\r?$",
RegexOptions.Compiled | RegexOptions.Multiline);
private static readonly Regex New_3 = new(@"^([^\s:]+):\s*(.+?)$",
RegexOptions.Compiled | RegexOptions.Multiline | AnyNewLine);
[Benchmark(Description = "3_KeyVal_Old")]
public int KeyVal_Old() => Old_3.Matches(KvConfig).Count;
[Benchmark(Description = "3_KeyVal_New")]
public int KeyVal_New() => New_3.Matches(KvConfig).Count;
private static readonly Regex Old_4 = new(@"^# .+\r?$", RegexOptions.Compiled | RegexOptions.Multiline);
private static readonly Regex New_4 = new(@"^# .+$", RegexOptions.Compiled | RegexOptions.Multiline | AnyNewLine);
[Benchmark(Description = "4_Markdown_Old")]
public int Markdown_Old() => Old_4.Matches(Markdown).Count;
[Benchmark(Description = "4_Markdown_New")]
public int Markdown_New() => New_4.Matches(Markdown).Count;
private static readonly Regex Old_5 = new(@"^.+\r?$", RegexOptions.Compiled | RegexOptions.Multiline);
private static readonly Regex New_5 = new(@"^.+$", RegexOptions.Compiled | RegexOptions.Multiline | AnyNewLine);
[Benchmark(Description = "5_CSV_Old")]
public int CSV_Old() => Old_5.Matches(CsvData).Count;
[Benchmark(Description = "5_CSV_New")]
public int CSV_New() => New_5.Matches(CsvData).Count;
private static readonly Regex Old_6 = new(@"[^\r\n]+", RegexOptions.Compiled);
private static readonly Regex New_6 = new(@".+", RegexOptions.Compiled | AnyNewLine);
[Benchmark(Description = "6_DotExcl_Old")]
public int DotExcl_Old() => Old_6.Matches(WinText1K).Count;
[Benchmark(Description = "6_DotExcl_New")]
public int DotExcl_New() => New_6.Matches(WinText1K).Count;
private static readonly Regex Old_7 = new(@"\w+\r?$", RegexOptions.Compiled | RegexOptions.Multiline);
private static readonly Regex New_7 = new(@"\w+$", RegexOptions.Compiled | RegexOptions.Multiline | AnyNewLine);
[Benchmark(Description = "7_WordEOL_Old")]
public int WordEOL_Old() => Old_7.Matches(WinText1K).Count;
[Benchmark(Description = "7_WordEOL_New")]
public int WordEOL_New() => New_7.Matches(WinText1K).Count;
private static readonly Regex Old_8 = new(@"(?:^|\r\n)\w+", RegexOptions.Compiled | RegexOptions.Multiline);
private static readonly Regex New_8 = new(@"^\w+", RegexOptions.Compiled | RegexOptions.Multiline | AnyNewLine);
[Benchmark(Description = "8_LineSt_Old")]
public int LineStart_Old() => Old_8.Matches(WinText1K).Count;
[Benchmark(Description = "8_LineSt_New")]
public int LineStart_New() => New_8.Matches(WinText1K).Count;
// Section 2: Unix \n text (control)
private static readonly Regex Old_9 = new(@"^.+$", RegexOptions.Compiled | RegexOptions.Multiline);
private static readonly Regex New_9 = new(@"^.+$", RegexOptions.Compiled | RegexOptions.Multiline | AnyNewLine);
[Benchmark(Description = "9_UnixLines_Old")]
public int UnixLines_Old() => Old_9.Matches(UnixText1K).Count;
[Benchmark(Description = "9_UnixLines_New")]
public int UnixLines_New() => New_9.Matches(UnixText1K).Count;
private static readonly Regex Old_10 = new(@"[^\n]+", RegexOptions.Compiled);
private static readonly Regex New_10 = new(@".+", RegexOptions.Compiled | AnyNewLine);
[Benchmark(Description = "10_UnixDot_Old")]
public int UnixDot_Old() => Old_10.Matches(UnixText1K).Count;
[Benchmark(Description = "10_UnixDot_New")]
public int UnixDot_New() => New_10.Matches(UnixText1K).Count;
// Section 3: Mixed newline text
private static readonly Regex Old_11 = new(@"[^\r\n\u0085\u2028\u2029]+", RegexOptions.Compiled);
private static readonly Regex New_11 = new(@".+", RegexOptions.Compiled | AnyNewLine);
[Benchmark(Description = "11_MixedDot_Old")]
public int MixedDot_Old() => Old_11.Matches(MixedAll1K).Count;
[Benchmark(Description = "11_MixedDot_New")]
public int MixedDot_New() => New_11.Matches(MixedAll1K).Count;
private static readonly Regex Old_12 = new(@"^.+\r?$", RegexOptions.Compiled | RegexOptions.Multiline);
private static readonly Regex New_12 = new(@"^.+$", RegexOptions.Compiled | RegexOptions.Multiline | AnyNewLine);
[Benchmark(Description = "12_MixedLines_Old")]
public int MixedLines_Old() => Old_12.Matches(MixedNR1K).Count;
[Benchmark(Description = "12_MixedLines_New")]
public int MixedLines_New() => New_12.Matches(MixedNR1K).Count;
// Section 4: Non-anchor patterns (zero impact)
private static readonly Regex Old_14 = new(@"\r\n|\r|\n", RegexOptions.Compiled);
private static readonly Regex New_14 = new(@"\r\n|\r|\n", RegexOptions.Compiled | AnyNewLine);
[Benchmark(Description = "14_Literal_Old")]
public int Literal_Old() => Old_14.Matches(MixedAll1K).Count;
[Benchmark(Description = "14_Literal_New")]
public int Literal_New() => New_14.Matches(MixedAll1K).Count;
private static readonly Regex Old_15 = new(@"\w+", RegexOptions.Compiled);
private static readonly Regex New_15 = new(@"\w+", RegexOptions.Compiled | AnyNewLine);
[Benchmark(Description = "15_Words_Old")]
public int Words_Old() => Old_15.Matches(WinText1K).Count;
[Benchmark(Description = "15_Words_New")]
public int Words_New() => New_15.Matches(WinText1K).Count;
// Section 5: Pathological
private static readonly Regex Old_P1 = new(@"$", RegexOptions.Compiled | RegexOptions.Multiline);
private static readonly Regex New_P1 = new(@"$", RegexOptions.Compiled | RegexOptions.Multiline | AnyNewLine);
[Benchmark(Description = "P1_BareEOL_Old")]
public int BareEOL_Old() => Old_P1.Matches(WinText1K).Count;
[Benchmark(Description = "P1_BareEOL_New")]
public int BareEOL_New() => New_P1.Matches(WinText1K).Count;
private static readonly Regex Old_P2 = new(@"^", RegexOptions.Compiled | RegexOptions.Multiline);
private static readonly Regex New_P2 = new(@"^", RegexOptions.Compiled | RegexOptions.Multiline | AnyNewLine);
[Benchmark(Description = "P2_BareBOL_Old")]
public int BareBOL_Old() => Old_P2.Matches(WinText1K).Count;
[Benchmark(Description = "P2_BareBOL_New")]
public int BareBOL_New() => New_P2.Matches(WinText1K).Count;
private static readonly Regex Old_P3 = new(@"\w+\r?\Z", RegexOptions.Compiled);
private static readonly Regex New_P3 = new(@"\w+\Z", RegexOptions.Compiled | AnyNewLine);
[Benchmark(Description = "P3_EndZ_Old")]
public bool EndZ_Old() => Old_P3.IsMatch(WinText10K);
[Benchmark(Description = "P3_EndZ_New")]
public bool EndZ_New() => New_P3.IsMatch(WinText10K);
}|
Some real test failures... |
AnyNewLine (0x800) is not a valid RegexOptions value on .NET Framework, so the Regex constructor throws ArgumentOutOfRangeException. Add [SkipOnTargetFramework] to all 10 AnyNewLine test methods. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Motivation
.NET's
Regexclass hardcodes\nas the only newline character. WithRegexOptions.Multiline,$matches before\nbut not before\r,\r\n, or Unicode line breaks. This is "by far one of the biggest gotchas" withSystem.Text.RegularExpressions:Users are forced into fragile workarounds like
\r?$or(\r\n|\n)to handle mixed line endings. Real-world NuGet packages show how common this is -- from the real-world regex patterns dataset:(\r\n|\n)(18,474 packages) -- CSV parser manually matching both line endings\r?\nin PEM key parsing (1,964 packages) --\r?\nsprinkled throughout withMultiline$(\r?\n)?in assembly attribute matching (2,108 packages) -- usingMultilinewith manual newline handling[\r\n]+(2,422 packages) -- matching any newline characterThese workarounds are error-prone, don't compose well with
^and$anchors, and miss Unicode newlines (\u0085,\u2028,\u2029).Summary
Implements
RegexOptions.AnyNewLine(api-approved) which makes$,^,\Z, and.recognize all Unicode line boundaries:\r,\r\n,\n,\u0085(NEL),\u2028(LS),\u2029(PS) -- consistent with Unicode TR18 RL1.6 and PCRE2's(*ANY)behavior.With
AnyNewLine, the example above just works:Approach: Parser Lowering
All logic lives in
RegexParser.cs-- no changes to the interpreter, compiler, or source generator engines. Each affected construct is lowered into an equivalentRegexNodesub-tree:$(no Multiline) /\Z(?=\r\n\z|\r?\z)|(?<!\r)(?=\n\z)|(?=[\u0085\u2028\u2029]\z)$(Multiline)(?=\r\n|\r|[\u0085\u2028\u2029]|\z)|(?<!\r)(?=\n)^(Multiline)(?<=\A|\r\n|\n|[\u0085\u2028\u2029])|(?<=\r)(?!\n).[^\r\n\u0085\u2028\u2029](butSinglelinetakes precedence)Key design choices:
\r\nis atomic:$never matches between\rand\n. This is enforced with lookbehind/lookahead guards.Singlelinetakes precedence:.withSingleline | AnyNewLinematches everything (including newlines), consistent withSingleline's documented behavior.\Aand\zare unaffected: absolute start/end anchors don't change.NonBacktrackingandECMAScript: throwsArgumentOutOfRangeException(lowered patterns use lookaround).AnyNewLineflag, so patterns that don't use it take the same code paths as before. The only new cost is a flag check ((_options & RegexOptions.AnyNewLine) != 0) in the parser for$,^,\Z, and., which is negligible.Out of scope:
\RUnicode TR18 RL1.6 also recommends a meta-character
\Rfor matching any newline sequence (consuming the characters), equivalent to(?:\r\n|[\n\v\f\r\u0085\u2028\u2029]). This is distinct from whatAnyNewLinedoes:AnyNewLinemodifies the behavior of existing zero-width anchors (^,$,\Z) and the character class., while\Rwould be a new consuming pattern element. Adding\Rcould be done independently as a separate feature.Changes
Production code
RegexOptions.cs-- addAnyNewLine = 0x0800RegexParser.cs-- lowering methodsAnyNewLineEndZNode(),AnyNewLineEolNode(),AnyNewLineBolNode(), plus.handlingRegexCharClass.cs-- addNotNewLineOrCarriageReturnClassconstantRegex.cs/RegexCompilationInfo.cs-- validationTests
$,^,\Z),RightToLeft,Singleline,Multiline,Replace,Split,Count,EnumerateMatches,NonBacktrackingrejection, edge cases (adjacent newlines, empty lines, all-newline strings), and PCRE2-inspired scenariosFixes #25598