Conversation
* AbstractCodePointTrie allows code to be generic over both typed and untyped tries. * UTF-8 accessors allow optimal access from within a UTF-8 decoder. * Latin1 accessor allows optimal access with Latin1.
|
Do you have links to the discussion where we first worked out the architecture for this? |
Manishearth
left a comment
There was a problem hiding this comment.
This overall appears to be the correct approach. I'll probably do a more careful unsafe review when you have a PR that just does the collections changes.
I think this is already doing the thing I wanted it to do around the two data payloads when we had the discussion; I'm not sure if there's anything left for me to help with
| #[derive(Debug)] | ||
| pub(crate) enum CanonicalCompositionsPayload { | ||
| Current(DataPayload<NormalizerNfcV2>), | ||
| #[cfg(feature = "serde")] |
There was a problem hiding this comment.
observation: this seems like a good way to do this
| #[derive(Debug, Copy, Clone)] | ||
| pub(crate) enum CanonicalCompositionsBorrowed<'data> { | ||
| Current(&'data CanonicalCompositionsNew<'data>), | ||
| #[cfg(feature = "serde")] |
| } | ||
|
|
||
| #[inline(always)] | ||
| unsafe fn get_bit_prefix_suffix_assuming_fast_index( |
| pub unsafe fn get7(&self, ascii: u8) -> T { | ||
| debug_assert!(ascii < 128); | ||
| debug_assert!((ascii as usize) < self.data.len()); | ||
| // SAFETY: Length of `self.data` checked in the constructor. |
There was a problem hiding this comment.
nit: ideally, this should be done with data documented as having an invariant, and anything that sets or mutates data says "data invariant upheld: length >= 128", so here you can just reference the field invariant
|
The previously-discussed plan was that the requirement for the NFD and NFKD tries to be always fast unless the I think #7665 can be fixed by taking an currently always-set bit and making it not set signify the possibily to stay on the fast patch across a single so marked non-starter. This would be compatible with past data, just the past data would not get the fastest path. So I think we should bundle this landing and a fix for #7665 into the same ICU4X release to make the V2 aspect for the decomposition trie cover both changes. |
|
Seems fine. What is the path to actually turning this into landable code? From my end, the architecture seems fine, I just need smaller PRs that I can review for safety/etc. |
Let's get #7768 through review and onto crates.io, and then the normalizer and collator PRs become smaller and also runnable in CI. |
For #7526.
This PR is meant for review, not for landing. See #7526 for the landing mechanics. CI won't pass, since this has local path references to
utf8_iterandutf16_iter.This PR is missing the code for constructing with old-style canonical composition data.
This PR requires the decomposition data to use a fast trie, unless the
serdefeature is used. What does this mean for struct versioning?The numeric passthrough bounds in the data are pretty useless. Since all characters below U+0300 are assigned, their normalization characteristics won't change per Unicode policies, so we might as well not have passthrough bounds in data and could set a single flag in the constructors to remember if we have K or non-K form, and then choose hard-code bounds based on that flag.
The bounds are now only used for slice-mode UTF-16 and in the
strcase in the composing normalizer. The latter now makes it a safety issue that the bounds must be in the two-byte UTF-8 range. UTS 46 only uses the iterator mode, so it can have a lower bound (that's useless for optimization).