Skip to content

Review normalizer changes#7528

Open
hsivonen wants to merge 22 commits intounicode-org:mainfrom
hsivonen:normreview
Open

Review normalizer changes#7528
hsivonen wants to merge 22 commits intounicode-org:mainfrom
hsivonen:normreview

Conversation

@hsivonen
Copy link
Member

For #7526.

This PR is meant for review, not for landing. See #7526 for the landing mechanics. CI won't pass, since this has local path references to utf8_iter and utf16_iter.

This PR is missing the code for constructing with old-style canonical composition data.

This PR requires the decomposition data to use a fast trie, unless the serde feature is used. What does this mean for struct versioning?

The numeric passthrough bounds in the data are pretty useless. Since all characters below U+0300 are assigned, their normalization characteristics won't change per Unicode policies, so we might as well not have passthrough bounds in data and could set a single flag in the constructors to remember if we have K or non-K form, and then choose hard-code bounds based on that flag.

The bounds are now only used for slice-mode UTF-16 and in the str case in the composing normalizer. The latter now makes it a safety issue that the bounds must be in the two-byte UTF-8 range. UTS 46 only uses the iterator mode, so it can have a lower bound (that's useless for optimization).

@Manishearth
Copy link
Member

Manishearth commented Jan 30, 2026

Do you have links to the discussion where we first worked out the architecture for this?

Copy link
Member

@Manishearth Manishearth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This overall appears to be the correct approach. I'll probably do a more careful unsafe review when you have a PR that just does the collections changes.

I think this is already doing the thing I wanted it to do around the two data payloads when we had the discussion; I'm not sure if there's anything left for me to help with

#[derive(Debug)]
pub(crate) enum CanonicalCompositionsPayload {
Current(DataPayload<NormalizerNfcV2>),
#[cfg(feature = "serde")]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

observation: this seems like a good way to do this

#[derive(Debug, Copy, Clone)]
pub(crate) enum CanonicalCompositionsBorrowed<'data> {
Current(&'data CanonicalCompositionsNew<'data>),
#[cfg(feature = "serde")]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

observation: same

}

#[inline(always)]
unsafe fn get_bit_prefix_suffix_assuming_fast_index(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: document invariant

pub unsafe fn get7(&self, ascii: u8) -> T {
debug_assert!(ascii < 128);
debug_assert!((ascii as usize) < self.data.len());
// SAFETY: Length of `self.data` checked in the constructor.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: ideally, this should be done with data documented as having an invariant, and anything that sets or mutates data says "data invariant upheld: length >= 128", so here you can just reference the field invariant

@hsivonen
Copy link
Member Author

The previously-discussed plan was that the requirement for the NFD and NFKD tries to be always fast unless the serde feature is used would require incrementing the marker version to V2.

I think #7665 can be fixed by taking an currently always-set bit and making it not set signify the possibily to stay on the fast patch across a single so marked non-starter. This would be compatible with past data, just the past data would not get the fastest path.

So I think we should bundle this landing and a fix for #7665 into the same ICU4X release to make the V2 aspect for the decomposition trie cover both changes.

@Manishearth
Copy link
Member

Seems fine. What is the path to actually turning this into landable code? From my end, the architecture seems fine, I just need smaller PRs that I can review for safety/etc.

@hsivonen
Copy link
Member Author

Seems fine. What is the path to actually turning this into landable code? From my end, the architecture seems fine, I just need smaller PRs that I can review for safety/etc.

Let's get #7768 through review and onto crates.io, and then the normalizer and collator PRs become smaller and also runnable in CI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants