Skip to content

[BPE] Fix memory leak caused by cyclic token list ownership#629

Open
sixers wants to merge 3 commits intoopenvinotoolkit:masterfrom
sixers:sixers/fix-bpe-tokenizer-memory-leak
Open

[BPE] Fix memory leak caused by cyclic token list ownership#629
sixers wants to merge 3 commits intoopenvinotoolkit:masterfrom
sixers:sixers/fix-bpe-tokenizer-memory-leak

Conversation

@sixers
Copy link

@sixers sixers commented Mar 13, 2026

Details

Fixes memory leak in BPETokenizer caused by cyclic ownership in TokensList::Node.
The memory leak is unbounded and eventually crashes the process or VM.

Root cause

BPETokenizer uses an internal doubly linked token list:

  • next is std::shared_ptr<Node>
  • prev was also std::shared_ptr<Node>

This creates ownership cycles (A <-> B) that prevent timely reclamation on uncached tokenization paths and manifests as RSS growth under sustained unique-input workloads. Even though TokensList advances head, cyclic references keep nodes alive.

Testing

Here's a python script that reproduces the bug:

Click here to show bpe_ownership_leak_repro.py
import os
import tempfile
import xml.etree.ElementTree as ET
from pathlib import Path
import openvino_tokenizers  # noqa: F401  # registers tokenizer ops/extensions
from openvino import Core, save_model
from openvino_tokenizers import convert_tokenizer
from transformers import AutoTokenizer
HF_MODEL_ID = "jhu-clsp/mmBERT-small"
N_UNIQUE = 5000
LEAK_THRESHOLD_KB = 25000
def rss_kb() -> int:
    for line in Path("/proc/self/status").read_text().splitlines():
        if line.startswith("VmRSS:"):
            return int(line.split()[1])
    raise RuntimeError("VmRSS not found")
def unique_texts(offset: int) -> list[str]:
    base = (
        "Dies ist ein reproduzierbarer Lasttext für BPETokenizer. "
        "Vogliamo misurare la crescita RSS su input unici. "
    )
    return [
        f"id={i:06d} {base} marker={i % 37} section={(i * 17) % 101} "
        + ("ß" * ((i % 5) + 1))
        + (" 😀" * ((i % 3) + 1))
        for i in range(offset, offset + N_UNIQUE)
    ]
def build_tokenizer_model_xml() -> Path:
    hf_tokenizer = AutoTokenizer.from_pretrained(HF_MODEL_ID)
    ov_tokenizer = convert_tokenizer(hf_tokenizer, use_max_padding=True, max_length=12
8)
    temp_dir = Path(tempfile.mkdtemp(prefix="ovtok_bpe_leak_repro_"))
    tokenizer_xml = temp_dir / "openvino_tokenizer.xml"
    save_model(ov_tokenizer, str(tokenizer_xml))
    tree = ET.parse(tokenizer_xml)
    root = tree.getroot()
    bpe_data = root.find(".//layer[@type='BPETokenizer']/data")
    if bpe_data is None:
        raise RuntimeError("BPETokenizer layer not found")
    if bpe_data.attrib.get("end_suffix", "") != "":
        raise RuntimeError("Expected BPETokenizer end_suffix='' for 1.1-only repro")
    bpe_data.attrib["cache_capacity"] = "0"
    tree.write(tokenizer_xml, encoding="unicode")
    return tokenizer_xml
def run_pass(compiled, texts: list[str]) -> int:
    for text in texts:
        compiled([text])
    return rss_kb()
def main() -> None:
    tokenizer_xml = build_tokenizer_model_xml()
    compiled = Core().compile_model(str(tokenizer_xml), "CPU")
    baseline = rss_kb()
    rss_after_pass1 = run_pass(compiled, unique_texts(0))
    rss_after_pass2 = run_pass(compiled, unique_texts(N_UNIQUE))
    pass1_delta = rss_after_pass1 - baseline
    pass2_delta = rss_after_pass2 - rss_after_pass1
    print(f"[repro] baseline={baseline}kB pass1_delta={pass1_delta}kB pass2_delta={pas
s2_delta}kB")
    if pass2_delta >= LEAK_THRESHOLD_KB:
        raise AssertionError(f"Leak reproduced: pass2 RSS grew by {pass2_delta} kB (>=
 {LEAK_THRESHOLD_KB} kB)")
    print("[repro] PASS: no ownership leak detected")
if __name__ == "__main__":
    main()

Run command:

OV_TOKENIZER_PREBUILD_EXTENSION_PATH="/workspace/openvino_tokenizers_iso_cachekey/buil
d_iso/src/libopenvino_tokenizers.so" \
python bpe_ownership_leak_repro.py

Current build's output:

[repro] baseline=975384kB pass1_delta=74040kB pass2_delta=75768kB
Traceback (most recent call last):
  File "/workspace/openvino_tokenizers/tests/bpe_ownership_leak_repro.py", line 91, in
 <module>
    main()
  File "/workspace/openvino_tokenizers/tests/bpe_ownership_leak_repro.py", line 85, in
 main
    raise AssertionError(f"Leak reproduced: pass2 RSS grew by {pass2_delta} kB (>= {LE
AK_THRESHOLD_KB} kB)")
AssertionError: Leak reproduced: pass2 RSS grew by 75768 kB (>= 25000 kB)

Patched build's output:

[repro] baseline=977888kB pass1_delta=24188kB pass2_delta=0kB
[repro] PASS: no ownership leak detected

Disclaimers:
AI assistance used: yes
If yes: Codex 5.3 for identifying the source of the leak, implementing the fix, and validation
Human validation performed: Confirmed output compatibility, reviewed the code

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes an unbounded memory leak in the C++ BPETokenizer implementation by breaking cyclic ownership in its internal doubly-linked token list (the prev link), preventing nodes from being kept alive indefinitely during sustained unique-input workloads.

Changes:

  • Switch TokensList::Node::prev from std::shared_ptr to std::weak_ptr to eliminate A <-> B ownership cycles.
  • Update list relinking logic during merges to use prev.lock() safely.
  • Update BPE merge/priority-queue neighbor handling to use the locked prev node when available.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
src/bpe_tokenizer.hpp Breaks cyclic ownership by making prev a weak_ptr and updates merge relinking accordingly.
src/bpe_tokenizer.cpp Adjusts neighbor invalidation and PQ updates to account for prev being weak (via lock()).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Fix build issue, std::weak_ptr has no nullptr constructor.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes an unbounded memory leak in the BPE tokenizer’s internal token list by breaking shared_ptr ownership cycles in the doubly-linked list nodes, preventing retained nodes under sustained unique-input workloads.

Changes:

  • Replaced TokensList::Node::prev from std::shared_ptr to std::weak_ptr to eliminate cyclic ownership.
  • Updated merge/list-manipulation logic to use prev.lock() when rewiring neighbors.
  • Updated BPETokenizer’s merge loop to work with weak_ptr back-links.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
src/bpe_tokenizer.hpp Breaks node ownership cycles by switching prev to weak_ptr and adjusting list merge rewiring accordingly.
src/bpe_tokenizer.cpp Updates BPE merge/invalidation logic to safely access prev via lock() after the ownership change.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +263 to +264
if (const auto prev_node = first_it->prev.lock()) {
const auto prev_pair = std::make_pair(prev_node->data, new_node->data);
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider capturing the previous node once per loop iteration (before merge_neighbors) and reusing it for both invalidation and PQ update. Right now first_it->prev.lock() is called twice, which is redundant and also makes it easier for the two checks to diverge if this logic changes later.

Copilot uses AI. Check for mistakes.
Comment on lines +27 to +29
std::weak_ptr<Node> prev;
std::shared_ptr<Node> next;
Node(const T& data) : data(data), prev(nullptr), next(nullptr) {}
Node(const T& data) : data(data), prev(), next(nullptr) {}
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change fixes a critical memory leak, but there’s no automated regression test guarding against reintroducing ownership cycles / unreclaimed nodes on the uncached tokenization path (e.g., cache_capacity=0 with many unique inputs). Adding a targeted test (potentially behind a slow/optional marker) would help catch future regressions.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants