-
Notifications
You must be signed in to change notification settings - Fork 9
Incorrect result of Korean sample with character '고' in ARM64 architecture #9
Description
Hi there, first thanks so much for this amazing library, it works like a charm.
While developing I hit a bug using it in my Jetson AGX arm64 box. If we install the python package and run the example for tokenization the 4 languages, it fails in korean because an incorrect whitespace stripping. This could be because a difference in the ARM vs x86_x64 C++ compiler.
What happens with '고' (UTF-8: EA B3 A0) in the whitespace_tokenize function?:
First iteration: *p = 0xEA → is_whitespace_cp(0xEA) → returns false ✓
Second iteration: *p = 0xB3 → is_whitespace_cp(0xB3) → returns false ✓
Third iteration: *p = 0xA0 → is_whitespace_cp(0xA0) → returns true ❌ BUG!
The byte 0xA0 is treated as a Unicode codepoint (non-breaking space U+00A0), so the function thinks it's whitespace and splits the token, leaving only EA B3.
Solution: In functions.h:, apply the following patch to the whitespace_tokenize function.
static std::pmr::vector<std::pmr::string> whitespace_tokenize(const std::pmr::string &text)
{
if (text.empty())
{
return {};
}
thread_local char buffer[8192];
thread_local std::pmr::monotonic_buffer_resource pool(buffer, sizeof(buffer));
pool.release();
std::pmr::vector<std::pmr::string> tokens{&pool};
tokens.reserve(std::count(text.begin(), text.end(), ' ') + 1);
const char *start = text.data();
const char *end = start + text.size();
const char *token_start = nullptr;
size_t i = 0;
size_t len = text.size();
while (i < len) {
const int cp = utf8_to_codepoint(text, i);
const size_t char_len = utf8_char_length(static_cast<unsigned char>(text[i]));
if (!is_whitespace_cp(cp)) {
if (!token_start) {
token_start = start + i;
}
} else if (token_start) {
tokens.emplace_back(token_start, (start + i) - token_start);
token_start = nullptr;
}
i += char_len;
}
if (token_start) {
tokens.emplace_back(token_start, end - token_start);
}
return tokens;
}I should have done a pull request but Im quite busy, just reporting here to contribute.