WIP - use sorted tokens for cache locality in StageAttributeData by KarthikSubbarao · Pull Request #859 · valkey-io/valkey-search

KarthikSubbarao · 2026-03-06T19:09:21Z

In this PR, we change the Tokenize function signature from returning a std::vector<std::string> to instead return a vector of a Token struct.

This vector is also sorted by the token's text content in the end of the Tokenize function. The benefit here is the user of this function, StageAttributeData, will see higher cache locality.

The StageAttributeData function previously would loop across a vector of strings where each element could be a different token. Then it would perform a find on a hash map, to obtain the PositionMap. These random jumps for every element of the vector of tokens are what I believe is the cause for large amount of page faults in the ingestion perf test scenario 12A (and other similar scenarios). By having the same tokens grouped token and by holding the current PositionMap of the current token in StageAttributeData, I suspect the page faults will reduce significantly.

We are yet to perf test this.

I also believe the Tokenize function can be improved further based on an approach I describe here: #859 (comment)

Signed-off-by: KarthikSubbarao <karthikrs2021@gmail.com>

KarthikSubbarao · 2026-03-06T21:52:14Z

src/indexes/text/lexer.cc

  sb_stemmer* stemmer = stemming_enabled ? GetStemmer() : nullptr;
-  // Deque grows by adding new blocks—avoids the cost of copying
-  // existing elements during reallocation.
-  std::vector<std::string> tokens;


Note: It should be possible for us to move away from using a string here and instead use a tagged pointer which can operate in two modes:

Original - No transformations needed (normalization, escaped char, etc). In this case, we point to the portion of the absl::string_view text data received for the entire field's content.

Transformed - Due to normalization/escaped char handling, we have a new string allocated for this token. Therefore, we hold the content of this string rather than the original content in the field. The token here is marked using LSB for freeing later on.

This needs some testing and review of the tagged pointed Token struct. I have prototyped it in a separate branch, but don't plan on including this for 1.2 unless we have high confidence in it. See here for details.

So, we can return a Vector of Tokens as the result.

The benefits of this will be:

Less string allocations when transformations were not done on the token

Smaller Token Struct size (less overhead for the overall tokenization process)

Faster processing of tokens / sorting at the end of the tokenization

Signed-off-by: KarthikSubbarao <karthikrs2021@gmail.com>

KarthikSubbarao added 24 commits March 5, 2026 01:26

WIP

773fddc

Signed-off-by: KarthikSubbarao <karthikrs2021@gmail.com>

update

b89dec7

Signed-off-by: KarthikSubbarao <karthikrs2021@gmail.com>

Merge remote-tracking branch 'upstream/main' into tok

ccc0b9c

Signed-off-by: KarthikSubbarao <karthikrs2021@gmail.com>

Update

df5f606

Signed-off-by: KarthikSubbarao <karthikrs2021@gmail.com>

Update

0fd8fd6

Signed-off-by: KarthikSubbarao <karthikrs2021@gmail.com>

WIP

efd23e2

Signed-off-by: KarthikSubbarao <karthikrs2021@gmail.com>

clean

b231215

Signed-off-by: KarthikSubbarao <karthikrs2021@gmail.com>

clean

9f3f259

Signed-off-by: KarthikSubbarao <karthikrs2021@gmail.com>

clean

2bf0e2f

Signed-off-by: KarthikSubbarao <karthikrs2021@gmail.com>

clean

6fb8402

Signed-off-by: KarthikSubbarao <karthikrs2021@gmail.com>

WIP

89c3011

Signed-off-by: KarthikSubbarao <karthikrs2021@gmail.com>

update

99b0cb0

Signed-off-by: KarthikSubbarao <karthikrs2021@gmail.com>

Update

2dfcb3e

Signed-off-by: KarthikSubbarao <karthikrs2021@gmail.com>

update

814b370

Signed-off-by: KarthikSubbarao <karthikrs2021@gmail.com>

clean

f42cfc5

Signed-off-by: KarthikSubbarao <karthikrs2021@gmail.com>

Update

7b3c577

Signed-off-by: KarthikSubbarao <karthikrs2021@gmail.com>

update

6ae9334

Signed-off-by: KarthikSubbarao <karthikrs2021@gmail.com>

update

c5b8742

Signed-off-by: KarthikSubbarao <karthikrs2021@gmail.com>

Merge branch 'main' into tok

693df2c

Signed-off-by: KarthikSubbarao <karthikrs2021@gmail.com>

update

f4d2390

Signed-off-by: KarthikSubbarao <karthikrs2021@gmail.com>

wip

aeb0126

Signed-off-by: KarthikSubbarao <karthikrs2021@gmail.com>

use sorted tokens for cache locality in StageAttributeData

71d2af9

Signed-off-by: KarthikSubbarao <karthikrs2021@gmail.com>

Clean

2e94303

Signed-off-by: KarthikSubbarao <karthikrs2021@gmail.com>

clang + tests

37e7269

Signed-off-by: KarthikSubbarao <karthikrs2021@gmail.com>

KarthikSubbarao commented Mar 6, 2026

View reviewed changes

KarthikSubbarao added 3 commits March 7, 2026 01:26

update

d42cfb5

Signed-off-by: KarthikSubbarao <karthikrs2021@gmail.com>

update

06cebb3

Signed-off-by: KarthikSubbarao <karthikrs2021@gmail.com>

update

155f502

Signed-off-by: KarthikSubbarao <karthikrs2021@gmail.com>

KarthikSubbarao marked this pull request as draft March 7, 2026 01:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP - use sorted tokens for cache locality in StageAttributeData#859

WIP - use sorted tokens for cache locality in StageAttributeData#859
KarthikSubbarao wants to merge 27 commits intovalkey-io:mainfrom
KarthikSubbarao:page

KarthikSubbarao commented Mar 6, 2026 •

edited

Loading

Uh oh!

KarthikSubbarao Mar 6, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

KarthikSubbarao commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KarthikSubbarao Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

KarthikSubbarao commented Mar 6, 2026 •

edited

Loading

KarthikSubbarao Mar 6, 2026 •

edited

Loading