WIP - use sorted tokens for cache locality in StageAttributeData#859
WIP - use sorted tokens for cache locality in StageAttributeData#859KarthikSubbarao wants to merge 27 commits intovalkey-io:mainfrom
Conversation
Signed-off-by: KarthikSubbarao <karthikrs2021@gmail.com>
Signed-off-by: KarthikSubbarao <karthikrs2021@gmail.com>
Signed-off-by: KarthikSubbarao <karthikrs2021@gmail.com>
Signed-off-by: KarthikSubbarao <karthikrs2021@gmail.com>
| sb_stemmer* stemmer = stemming_enabled ? GetStemmer() : nullptr; | ||
| // Deque grows by adding new blocks—avoids the cost of copying | ||
| // existing elements during reallocation. | ||
| std::vector<std::string> tokens; |
There was a problem hiding this comment.
Note: It should be possible for us to move away from using a string here and instead use a tagged pointer which can operate in two modes:
-
Original - No transformations needed (normalization, escaped char, etc). In this case, we point to the portion of the
absl::string_view textdata received for the entire field's content. -
Transformed - Due to normalization/escaped char handling, we have a new string allocated for this token. Therefore, we hold the content of this string rather than the original content in the field. The token here is marked using LSB for freeing later on.
This needs some testing and review of the tagged pointed Token struct. I have prototyped it in a separate branch, but don't plan on including this for 1.2 unless we have high confidence in it. See here for details.
So, we can return a Vector of Tokens as the result.
The benefits of this will be:
- Less string allocations when transformations were not done on the token
- Smaller Token Struct size (less overhead for the overall tokenization process)
- Faster processing of tokens / sorting at the end of the tokenization
In this PR, we change the Tokenize function signature from returning a
std::vector<std::string>to instead return a vector of a Token struct.This vector is also sorted by the token's text content in the end of the Tokenize function. The benefit here is the user of this function, StageAttributeData, will see higher cache locality.
The StageAttributeData function previously would loop across a vector of strings where each element could be a different token. Then it would perform a find on a hash map, to obtain the
PositionMap. These random jumps for every element of the vector of tokens are what I believe is the cause for large amount of page faults in the ingestion perf test scenario 12A (and other similar scenarios). By having the same tokens grouped token and by holding the current PositionMap of the current token in StageAttributeData, I suspect the page faults will reduce significantly.We are yet to perf test this.
I also believe the Tokenize function can be improved further based on an approach I describe here: #859 (comment)