CodePointTrie support for normalizer and collator perf improvements by hsivonen · Pull Request #7768 · unicode-org/icu4x

hsivonen · 2026-03-11T16:35:15Z

Split out of #7526 and #7600. The code here needs to be published to crates.io, before those changes can land, because the utf8_iter and utf16_iter crates need to depend on a version icu_collections that has this code on crates.io.

Changelog

Make icu_collections not depend on utf8_iter to avoid a circular dependency.
Add CodePointTrie getters for fusing lookup into iterating over text: getters by Latin1, ASCII, two-byte UTF-8, and three-byte UTF-8.
Serde and databake support for typed CodePointTries.
A WithTrie trait for obtaining the trie referenced by an iterator that iterates over char and TrieValue pairs.
Iterators by char and TrieValue pairs for Latin1,str, and delegate iterator over char.
A variant of the iterator of str that returns TrieValue::default() for ASCII instead reading from the trie.

Manishearth

Will take some time to properly review

Manishearth · 2026-03-11T16:53:15Z

components/collections/src/codepointtrie/cptrie.rs

+    }
+
+    #[inline(always)]
+    unsafe fn get_bit_prefix_suffix_assuming_fast_index(


issue: please document safety invariants (even if it's obvious from the function name)

edit: it's not; because there are invariants on bit_prefix and bit_suffix.

We should document this and ensure it's upheld by the callers.

Manishearth · 2026-03-11T16:56:18Z

components/collections/src/codepointtrie/cptrie.rs

+    pub unsafe fn get7(&self, ascii: u8) -> T {
+        debug_assert!(ascii < 128);
+        debug_assert!((ascii as usize) < self.data.len());
+        // SAFETY: Length of `self.data` checked in the constructor.


issue: We may add more ctors in the future. This should reference the safety invariants on data, just say // SAFETY: Allowed by datas safety invariant, updating data's invariant to require that it has at least 128 elements and updating the constructor validation to saying something like // data safety invariant upheld here

Manishearth · 2026-03-11T16:56:47Z

components/collections/src/codepointtrie/cptrie.rs

+        debug_assert!(low_six <= 0b111_111); // Safety invariant.
+        debug_assert!(high_five <= 0b11_111); // Safety invariant.
+        debug_assert!(high_five > 0b1); // Non-shortest form; not safety invariant.
+                                        // SAFETY: The highest character representable as a two-byte


nit: maybe introduce a newline so that this formats better

Manishearth · 2026-03-12T01:53:03Z

components/collections/src/codepointtrie/iter.rs

@@ -0,0 +1,1240 @@
+// This file is part of ICU4X. For terms of use, please see the file


This file is a lot of unsafe code and I'm not convinced it is justified. Can we reduce the amount of unsafe code in this PR by writing these iterators to wrap CharIndices? There will still be unsafe in this file, but it will be around CPT invariants rather than also around UTF8 decoding.

Separately we can try and justify additional unsafe using benchmarks if needed, and that would be a nice scoped PR that can be easily reviewed and benchmarked.

Fusing the trie lookup into UTF-8 decoding is the key point of this changeset: CPT in ICU4C has been designed so that its bit split lines up with the bits in the last UTF-8 trail byte, and we've been using it pessimally in ICU4X.

I guess I will need to port the UTF-16 NFC to NFD throughput benchmark to str and then get exact numbers for the effect here.

Hmm, I see. I feel like using CharIndices (especially with its offset function) you might still be able to get the same benefits, but I understand if that was the point of this change.

In that case we should probably have more careful tracking of the invariant on the contained iterator whenever it is advanced.

components/collections/src/codepointtrie/iter.rs

components/collections/src/codepointinvliststringlist/mod.rs

components/collections/src/codepointtrie/cptrie.rs

Manishearth · 2026-03-12T02:00:47Z

components/collections/src/codepointtrie/cptrie.rs

+    pub fn get8(&self, latin1: u8) -> T {
+        let code_point = u32::from(latin1);
+        debug_assert!(code_point <= SMALL_TYPE_FAST_INDEXING_MAX);
+        // SAFETY: `u8` is always below `SMALL_TYPE_FAST_INDEXING_MAX` and,


suggestion (non blocking): worth documenting on those two constants that their precise values are extremely safety relevant and relied upon by many different checks in this file

Manishearth · 2026-03-12T02:04:52Z

components/collections/src/codepointtrie/cptrie.rs

+        debug_assert!(low_six <= 0b111_111); // Safety invariant.
+        debug_assert!(high_five <= 0b11_111); // Safety invariant.
+        debug_assert!(high_five > 0b1); // Non-shortest form; not safety invariant.
+                                        // SAFETY: The highest character representable as a two-byte


issue: the safety invariants on this function are not currently documented, but once they are, this comment should be in terms of those invariants

Added a line break.

Manishearth · 2026-03-12T02:05:43Z

components/collections/src/codepointtrie/cptrie.rs

+    ///
+    /// `low_six` must not have bit positions other than the lowest 6 set to 1.
+    ///
+    /// # Intended Invariant


question: what is this? Is this a non-safety-relevant invariant?

Perhaps explicitly say it is non-safety relevant.

Manishearth · 2026-03-12T02:06:09Z

components/collections/src/codepointtrie/cptrie.rs

+    /// sequence.
+    #[inline(always)]
+    #[allow(clippy::unusual_byte_groupings)]
+    pub unsafe fn get_utf8_three_byte(&self, high_ten: u32, low_six: u32) -> T {


review progress note: this is as far as I've gotten with the thorough review, haven't yet reviewed this body.

Manishearth · 2026-03-12T02:06:40Z

components/collections/src/codepointtrie/cptrie.rs

+/// Method naming intentionally differs from the method naming on
+/// those types in order to disambiguate.
+#[allow(private_bounds)] // Permit sealing
+pub trait AbstractCodePointTrie<'trie, T: TrieValue>: Seal {


nit: I think we call it Sealed in ICU4X, but it doesn't really matter

components/collections/src/codepointinvliststringlist/mod.rs

Co-authored-by: Robert Bastian <4706271+robertbastian@users.noreply.github.com>

CodePointTrie support for normalizer and collator perf improvements

576bf49

hsivonen requested a review from echeran as a code owner March 11, 2026 16:35

hsivonen requested a review from Manishearth March 11, 2026 16:35

hsivonen mentioned this pull request Mar 11, 2026

Review normalizer changes #7528

Open

Manishearth reviewed Mar 11, 2026

View reviewed changes

Manishearth reviewed Mar 12, 2026

View reviewed changes

robertbastian reviewed Mar 12, 2026

View reviewed changes

components/collections/src/codepointinvliststringlist/mod.rs Outdated Show resolved Hide resolved

components/collections/src/codepointinvliststringlist/mod.rs Outdated Show resolved Hide resolved

hsivonen and others added 4 commits March 17, 2026 11:14

Avoid validating whole string in inversion list lookup

dc5c98e

Co-authored-by: Robert Bastian <4706271+robertbastian@users.noreply.github.com>

Adjust comments around fast ASCII asccess invariant

1df7592

Better formatting

35abb96

Rework UTF-8 iteration safety remarks

77f585b

hsivonen requested a review from a team as a code owner March 17, 2026 11:00

		@@ -0,0 +1,1240 @@
		// This file is part of ICU4X. For terms of use, please see the file

Conversation

hsivonen commented Mar 11, 2026

Changelog

Uh oh!

Manishearth left a comment

Choose a reason for hiding this comment

Uh oh!

Manishearth Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Manishearth Mar 11, 2026 •

edited

Loading