Review normalizer changes by hsivonen · Pull Request #7528 · unicode-org/icu4x

hsivonen · 2026-01-30T11:19:35Z

This PR is meant for review, not for landing. See #7526 for the landing mechanics. CI won't pass, since this has local path references to utf8_iter and utf16_iter.

This PR is missing the code for constructing with old-style canonical composition data.

This PR requires the decomposition data to use a fast trie, unless the serde feature is used. What does this mean for struct versioning?

The numeric passthrough bounds in the data are pretty useless. Since all characters below U+0300 are assigned, their normalization characteristics won't change per Unicode policies, so we might as well not have passthrough bounds in data and could set a single flag in the constructors to remember if we have K or non-K form, and then choose hard-code bounds based on that flag.

The bounds are now only used for slice-mode UTF-16 and in the str case in the composing normalizer. The latter now makes it a safety issue that the bounds must be in the two-byte UTF-8 range. UTS 46 only uses the iterator mode, so it can have a lower bound (that's useless for optimization).

* AbstractCodePointTrie allows code to be generic over both typed and untyped tries. * UTF-8 accessors allow optimal access from within a UTF-8 decoder. * Latin1 accessor allows optimal access with Latin1.

Manishearth · 2026-01-30T15:20:51Z

Do you have links to the discussion where we first worked out the architecture for this?

Manishearth

This overall appears to be the correct approach. I'll probably do a more careful unsafe review when you have a PR that just does the collections changes.

I think this is already doing the thing I wanted it to do around the two data payloads when we had the discussion; I'm not sure if there's anything left for me to help with

Manishearth · 2026-01-30T21:47:46Z

components/normalizer/src/lib.rs

+#[derive(Debug)]
+pub(crate) enum CanonicalCompositionsPayload {
+    Current(DataPayload<NormalizerNfcV2>),
+    #[cfg(feature = "serde")]


observation: this seems like a good way to do this

Manishearth · 2026-01-30T21:48:07Z

components/normalizer/src/lib.rs

+#[derive(Debug, Copy, Clone)]
+pub(crate) enum CanonicalCompositionsBorrowed<'data> {
+    Current(&'data CanonicalCompositionsNew<'data>),
+    #[cfg(feature = "serde")]


observation: same

Manishearth · 2026-01-30T21:52:56Z

components/collections/src/codepointtrie/cptrie.rs

+    }
+
+    #[inline(always)]
+    unsafe fn get_bit_prefix_suffix_assuming_fast_index(


nit: document invariant

Manishearth · 2026-01-30T21:55:48Z

components/collections/src/codepointtrie/cptrie.rs

+    pub unsafe fn get7(&self, ascii: u8) -> T {
+        debug_assert!(ascii < 128);
+        debug_assert!((ascii as usize) < self.data.len());
+        // SAFETY: Length of `self.data` checked in the constructor.


nit: ideally, this should be done with data documented as having an invariant, and anything that sets or mutates data says "data invariant upheld: length >= 128", so here you can just reference the field invariant

hsivonen · 2026-02-03T08:49:05Z

The changes here address these open issues at least:

hsivonen · 2026-02-25T08:06:41Z

The previously-discussed plan was that the requirement for the NFD and NFKD tries to be always fast unless the serde feature is used would require incrementing the marker version to V2.

I think #7665 can be fixed by taking an currently always-set bit and making it not set signify the possibily to stay on the fast patch across a single so marked non-starter. This would be compatible with past data, just the past data would not get the fastest path.

So I think we should bundle this landing and a fix for #7665 into the same ICU4X release to make the V2 aspect for the decomposition trie cover both changes.

Manishearth · 2026-02-25T18:57:21Z

Seems fine. What is the path to actually turning this into landable code? From my end, the architecture seems fine, I just need smaller PRs that I can review for safety/etc.

hsivonen · 2026-03-11T16:36:56Z

Seems fine. What is the path to actually turning this into landable code? From my end, the architecture seems fine, I just need smaller PRs that I can review for safety/etc.

Let's get #7768 through review and onto crates.io, and then the normalizer and collator PRs become smaller and also runnable in CI.

hsivonen added 20 commits January 22, 2026 12:01

Introduce AbstractCodePointTrie, Latin1 getter, and UTF-8 getters

8fbb1f1

* AbstractCodePointTrie allows code to be generic over both typed and untyped tries. * UTF-8 accessors allow optimal access from within a UTF-8 decoder. * Latin1 accessor allows optimal access with Latin1.

Decouple UAX 15 and UTS 46 trie types

73356ab

Add iterators that also do trie lookups

9a4983a

Use trie-aware iterators in the normalizer

ed51573

Use fused trie lookup and UTF decoding in the collator

da07c72

Add functions for normalizing Latin1 to UTF-16

da4183c

Prepare for Gecko

9db4973

Implement new data layout for canonical composition data

db21369

Optimize NFD

7e7b618

Optimize Latin1

67c7783

Stay on NFD fast track for single combining mark

2e2611c

Rework Latin1 norm

a481c01

Likely/unlikely now used in non-UTF-16

5063f39

Merge remote-tracking branch 'origin/latin1chunk' into nfdsinglemark

e291315

Tweak passthrough bounds

9b6b8ce

Prepare for Gecko landing

6110491

if outside loop

3480e1e

Fix clippy lints

e6ecb7c

Removed stale conditional compliation

be60b32

Merge branch 'main' into normreview

2e9c82c

hsivonen requested review from a team, Manishearth, echeran, robertbastian and sffc as code owners January 30, 2026 11:19

Manishearth reviewed Jan 30, 2026

View reviewed changes

hsivonen changed the title ~~Normreview~~ Review normalizer changes Feb 3, 2026

This was referenced Feb 3, 2026

Make NFD to NFD able to stay on the fast path across more than one combining mark #7555

Open

Implement Hangul NFD to NFC fast path #7516

Open

This was referenced Feb 3, 2026

Figure out how Hangul NFD to NFC normalization can be made faster #6860

Open

Make the collator a lot faster #7600

Open

hsivonen added 2 commits February 10, 2026 15:00

Merge branch 'main' into normreview

4cc1d79

cargo fmt

6b25873

hsivonen mentioned this pull request Feb 18, 2026

Make the composing normalizer stay on the fast path across virama or nukta between starters #7665

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Review normalizer changes#7528

Review normalizer changes#7528
hsivonen wants to merge 22 commits intounicode-org:mainfrom
hsivonen:normreview

hsivonen commented Jan 30, 2026

Uh oh!

Manishearth commented Jan 30, 2026 •

edited

Loading

Uh oh!

Manishearth left a comment

Uh oh!

Manishearth Jan 30, 2026

Uh oh!

Manishearth Jan 30, 2026

Uh oh!

Manishearth Jan 30, 2026

Uh oh!

Manishearth Jan 30, 2026

Uh oh!

hsivonen commented Feb 3, 2026

Uh oh!

hsivonen commented Feb 25, 2026

Uh oh!

Manishearth commented Feb 25, 2026

Uh oh!

hsivonen commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hsivonen commented Jan 30, 2026

Uh oh!

Manishearth commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Manishearth left a comment

Choose a reason for hiding this comment

Uh oh!

Manishearth Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

Manishearth Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

Manishearth Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

Manishearth Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

hsivonen commented Feb 3, 2026

Uh oh!

hsivonen commented Feb 25, 2026

Uh oh!

Manishearth commented Feb 25, 2026

Uh oh!

hsivonen commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Manishearth commented Jan 30, 2026 •

edited

Loading