Incorrect result of Korean sample with character '고' in ARM64 architecture

Hi there, first thanks so much for this amazing library, it works like a charm.

While developing I hit a bug using it in my Jetson AGX arm64 box. If we install the python package and run the example for tokenization the 4 languages, it fails in korean because an incorrect whitespace stripping. This could be because a difference in the ARM vs x86_x64 C++ compiler.

What happens with '고' (UTF-8: EA B3 A0) in the `whitespace_tokenize` function?:

First iteration: *p = 0xEA → is_whitespace_cp(0xEA) → returns false ✓
Second iteration: *p = 0xB3 → is_whitespace_cp(0xB3) → returns false ✓
Third iteration: *p = 0xA0 → is_whitespace_cp(0xA0) → returns true ❌ BUG!
The byte 0xA0 is treated as a Unicode codepoint (non-breaking space U+00A0), so the function thinks it's whitespace and splits the token, leaving only EA B3.

**Solution**: In `functions.h:`, apply the following patch to the `whitespace_tokenize` function.

```cpp
static std::pmr::vector<std::pmr::string> whitespace_tokenize(const std::pmr::string &text)
{
    if (text.empty())
    {
        return {};
    }
    thread_local char buffer[8192];
    thread_local std::pmr::monotonic_buffer_resource pool(buffer, sizeof(buffer));
    pool.release();
    std::pmr::vector<std::pmr::string> tokens{&pool};
    tokens.reserve(std::count(text.begin(), text.end(), ' ') + 1);
    
    const char *start = text.data();
    const char *end = start + text.size();
    const char *token_start = nullptr;
    
    size_t i = 0;
    size_t len = text.size();
    
    while (i < len) {
        const int cp = utf8_to_codepoint(text, i);
        const size_t char_len = utf8_char_length(static_cast<unsigned char>(text[i]));
        
        if (!is_whitespace_cp(cp)) {
            if (!token_start) {
                token_start = start + i;
            }
        } else if (token_start) {
            tokens.emplace_back(token_start, (start + i) - token_start);
            token_start = nullptr;
        }
        
        i += char_len;
    }
    
    if (token_start) {
        tokens.emplace_back(token_start, end - token_start);
    }
    
    return tokens;
}
```

I should have done a pull request but Im quite busy, just reporting here to contribute.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect result of Korean sample with character '고' in ARM64 architecture #9

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Incorrect result of Korean sample with character '고' in ARM64 architecture #9

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions