[BPE] Fix memory leak caused by cyclic token list ownership#629
[BPE] Fix memory leak caused by cyclic token list ownership#629sixers wants to merge 3 commits intoopenvinotoolkit:masterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Fixes an unbounded memory leak in the C++ BPETokenizer implementation by breaking cyclic ownership in its internal doubly-linked token list (the prev link), preventing nodes from being kept alive indefinitely during sustained unique-input workloads.
Changes:
- Switch
TokensList::Node::prevfromstd::shared_ptrtostd::weak_ptrto eliminateA <-> Bownership cycles. - Update list relinking logic during merges to use
prev.lock()safely. - Update BPE merge/priority-queue neighbor handling to use the locked
prevnode when available.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
src/bpe_tokenizer.hpp |
Breaks cyclic ownership by making prev a weak_ptr and updates merge relinking accordingly. |
src/bpe_tokenizer.cpp |
Adjusts neighbor invalidation and PQ updates to account for prev being weak (via lock()). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Fix build issue, std::weak_ptr has no nullptr constructor.
There was a problem hiding this comment.
Pull request overview
Fixes an unbounded memory leak in the BPE tokenizer’s internal token list by breaking shared_ptr ownership cycles in the doubly-linked list nodes, preventing retained nodes under sustained unique-input workloads.
Changes:
- Replaced
TokensList::Node::prevfromstd::shared_ptrtostd::weak_ptrto eliminate cyclic ownership. - Updated merge/list-manipulation logic to use
prev.lock()when rewiring neighbors. - Updated BPETokenizer’s merge loop to work with
weak_ptrback-links.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
src/bpe_tokenizer.hpp |
Breaks node ownership cycles by switching prev to weak_ptr and adjusting list merge rewiring accordingly. |
src/bpe_tokenizer.cpp |
Updates BPE merge/invalidation logic to safely access prev via lock() after the ownership change. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if (const auto prev_node = first_it->prev.lock()) { | ||
| const auto prev_pair = std::make_pair(prev_node->data, new_node->data); |
There was a problem hiding this comment.
Consider capturing the previous node once per loop iteration (before merge_neighbors) and reusing it for both invalidation and PQ update. Right now first_it->prev.lock() is called twice, which is redundant and also makes it easier for the two checks to diverge if this logic changes later.
| std::weak_ptr<Node> prev; | ||
| std::shared_ptr<Node> next; | ||
| Node(const T& data) : data(data), prev(nullptr), next(nullptr) {} | ||
| Node(const T& data) : data(data), prev(), next(nullptr) {} |
There was a problem hiding this comment.
This change fixes a critical memory leak, but there’s no automated regression test guarding against reintroducing ownership cycles / unreclaimed nodes on the uncached tokenization path (e.g., cache_capacity=0 with many unique inputs). Adding a targeted test (potentially behind a slow/optional marker) would help catch future regressions.
Details
Fixes memory leak in
BPETokenizercaused by cyclic ownership inTokensList::Node.The memory leak is unbounded and eventually crashes the process or VM.
Root cause
BPETokenizeruses an internal doubly linked token list:nextisstd::shared_ptr<Node>prevwas alsostd::shared_ptr<Node>This creates ownership cycles (
A <-> B) that prevent timely reclamation on uncached tokenization paths and manifests as RSS growth under sustained unique-input workloads. Even thoughTokensListadvances head, cyclic references keep nodes alive.Testing
Here's a python script that reproduces the bug:
Click here to show
bpe_ownership_leak_repro.pyRun command:
Current build's output:
Patched build's output:
Disclaimers:
AI assistance used: yes
If yes: Codex 5.3 for identifying the source of the leak, implementing the fix, and validation
Human validation performed: Confirmed output compatibility, reviewed the code