Skip to content

WIP - use sorted tokens for cache locality in StageAttributeData#859

Draft
KarthikSubbarao wants to merge 27 commits intovalkey-io:mainfrom
KarthikSubbarao:page
Draft

WIP - use sorted tokens for cache locality in StageAttributeData#859
KarthikSubbarao wants to merge 27 commits intovalkey-io:mainfrom
KarthikSubbarao:page

Conversation

@KarthikSubbarao
Copy link
Member

@KarthikSubbarao KarthikSubbarao commented Mar 6, 2026

In this PR, we change the Tokenize function signature from returning a std::vector<std::string> to instead return a vector of a Token struct.

This vector is also sorted by the token's text content in the end of the Tokenize function. The benefit here is the user of this function, StageAttributeData, will see higher cache locality.

The StageAttributeData function previously would loop across a vector of strings where each element could be a different token. Then it would perform a find on a hash map, to obtain the PositionMap. These random jumps for every element of the vector of tokens are what I believe is the cause for large amount of page faults in the ingestion perf test scenario 12A (and other similar scenarios). By having the same tokens grouped token and by holding the current PositionMap of the current token in StageAttributeData, I suspect the page faults will reduce significantly.

We are yet to perf test this.

I also believe the Tokenize function can be improved further based on an approach I describe here: #859 (comment)

Signed-off-by: KarthikSubbarao <karthikrs2021@gmail.com>
Signed-off-by: KarthikSubbarao <karthikrs2021@gmail.com>
Signed-off-by: KarthikSubbarao <karthikrs2021@gmail.com>
Signed-off-by: KarthikSubbarao <karthikrs2021@gmail.com>
Signed-off-by: KarthikSubbarao <karthikrs2021@gmail.com>
Signed-off-by: KarthikSubbarao <karthikrs2021@gmail.com>
Signed-off-by: KarthikSubbarao <karthikrs2021@gmail.com>
Signed-off-by: KarthikSubbarao <karthikrs2021@gmail.com>
Signed-off-by: KarthikSubbarao <karthikrs2021@gmail.com>
Signed-off-by: KarthikSubbarao <karthikrs2021@gmail.com>
Signed-off-by: KarthikSubbarao <karthikrs2021@gmail.com>
Signed-off-by: KarthikSubbarao <karthikrs2021@gmail.com>
Signed-off-by: KarthikSubbarao <karthikrs2021@gmail.com>
Signed-off-by: KarthikSubbarao <karthikrs2021@gmail.com>
Signed-off-by: KarthikSubbarao <karthikrs2021@gmail.com>
Signed-off-by: KarthikSubbarao <karthikrs2021@gmail.com>
Signed-off-by: KarthikSubbarao <karthikrs2021@gmail.com>
Signed-off-by: KarthikSubbarao <karthikrs2021@gmail.com>
Signed-off-by: KarthikSubbarao <karthikrs2021@gmail.com>
Signed-off-by: KarthikSubbarao <karthikrs2021@gmail.com>
Signed-off-by: KarthikSubbarao <karthikrs2021@gmail.com>
Signed-off-by: KarthikSubbarao <karthikrs2021@gmail.com>
Signed-off-by: KarthikSubbarao <karthikrs2021@gmail.com>
Signed-off-by: KarthikSubbarao <karthikrs2021@gmail.com>
sb_stemmer* stemmer = stemming_enabled ? GetStemmer() : nullptr;
// Deque grows by adding new blocks—avoids the cost of copying
// existing elements during reallocation.
std::vector<std::string> tokens;
Copy link
Member Author

@KarthikSubbarao KarthikSubbarao Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: It should be possible for us to move away from using a string here and instead use a tagged pointer which can operate in two modes:

  1. Original - No transformations needed (normalization, escaped char, etc). In this case, we point to the portion of the absl::string_view text data received for the entire field's content.

  2. Transformed - Due to normalization/escaped char handling, we have a new string allocated for this token. Therefore, we hold the content of this string rather than the original content in the field. The token here is marked using LSB for freeing later on.

This needs some testing and review of the tagged pointed Token struct. I have prototyped it in a separate branch, but don't plan on including this for 1.2 unless we have high confidence in it. See here for details.

So, we can return a Vector of Tokens as the result.

The benefits of this will be:

  1. Less string allocations when transformations were not done on the token
  2. Smaller Token Struct size (less overhead for the overall tokenization process)
  3. Faster processing of tokens / sorting at the end of the tokenization

Signed-off-by: KarthikSubbarao <karthikrs2021@gmail.com>
Signed-off-by: KarthikSubbarao <karthikrs2021@gmail.com>
Signed-off-by: KarthikSubbarao <karthikrs2021@gmail.com>
@KarthikSubbarao KarthikSubbarao marked this pull request as draft March 7, 2026 01:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant