Skip to content

Incorrect result of Korean sample with character '고' in ARM64 architecture #9

@vsantosu

Description

@vsantosu

Hi there, first thanks so much for this amazing library, it works like a charm.

While developing I hit a bug using it in my Jetson AGX arm64 box. If we install the python package and run the example for tokenization the 4 languages, it fails in korean because an incorrect whitespace stripping. This could be because a difference in the ARM vs x86_x64 C++ compiler.

What happens with '고' (UTF-8: EA B3 A0) in the whitespace_tokenize function?:

First iteration: *p = 0xEA → is_whitespace_cp(0xEA) → returns false ✓
Second iteration: *p = 0xB3 → is_whitespace_cp(0xB3) → returns false ✓
Third iteration: *p = 0xA0 → is_whitespace_cp(0xA0) → returns true ❌ BUG!
The byte 0xA0 is treated as a Unicode codepoint (non-breaking space U+00A0), so the function thinks it's whitespace and splits the token, leaving only EA B3.

Solution: In functions.h:, apply the following patch to the whitespace_tokenize function.

static std::pmr::vector<std::pmr::string> whitespace_tokenize(const std::pmr::string &text)
{
    if (text.empty())
    {
        return {};
    }
    thread_local char buffer[8192];
    thread_local std::pmr::monotonic_buffer_resource pool(buffer, sizeof(buffer));
    pool.release();
    std::pmr::vector<std::pmr::string> tokens{&pool};
    tokens.reserve(std::count(text.begin(), text.end(), ' ') + 1);
    
    const char *start = text.data();
    const char *end = start + text.size();
    const char *token_start = nullptr;
    
    size_t i = 0;
    size_t len = text.size();
    
    while (i < len) {
        const int cp = utf8_to_codepoint(text, i);
        const size_t char_len = utf8_char_length(static_cast<unsigned char>(text[i]));
        
        if (!is_whitespace_cp(cp)) {
            if (!token_start) {
                token_start = start + i;
            }
        } else if (token_start) {
            tokens.emplace_back(token_start, (start + i) - token_start);
            token_start = nullptr;
        }
        
        i += char_len;
    }
    
    if (token_start) {
        tokens.emplace_back(token_start, end - token_start);
    }
    
    return tokens;
}

I should have done a pull request but Im quite busy, just reporting here to contribute.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions