Skip to content

Fix: Integer overflow in calculate_overlap_ratio (utils.py:248) a…#5007

Closed
albcunha wants to merge 22 commits intoPaddlePaddle:developfrom
albcunha:patch-1
Closed

Fix: Integer overflow in calculate_overlap_ratio (utils.py:248) a…#5007
albcunha wants to merge 22 commits intoPaddlePaddle:developfrom
albcunha:patch-1

Conversation

@albcunha
Copy link

…t paddleocr

What it does

Update calculate_overlap_ratio at utils.py to use numpy types and functions

Problem

calculate_overlap_ratio in paddlex/inference/pipelines/layout_parsing/utils.py triggers a RuntimeWarning: overflow encountered in scalar multiply at line 248:

inter_area = inter_width * inter_height

Root cause

The root cause is at lines 237-238:

bbox1 = np.array(bbox1)
bbox2 = np.array(bbox2)

np.array() without an explicit dtype preserves the input's original type. When bounding boxes come from detection models as int32 arrays, all subsequent arithmetic stays in int32, which maxes out at ~2.1 billion. Two moderately large bounding box dimensions multiplied together (e.g., 50000 × 50000 = 2.5 billion) exceed this limit, producing an overflow and incorrect overlap ratios.

Fix

Two changes in calculate_overlap_ratio:

  1. Cast inputs to float64 — prevents overflow in all downstream arithmetic:
bbox1 = np.array(bbox1, dtype=np.float64)
bbox2 = np.array(bbox2, dtype=np.float64)
  1. Use np.multiply with explicit dtype — belt-and-suspenders on the exact line that overflows:
inter_area = np.multiply(inter_width, inter_height, dtype=np.float64)

Why float64?

  • float64 is numpy's default float type and supports values up to ~1.8×10³⁰⁸
  • The function returns a floating-point ratio (0.0–1.0), and calculate_bbox_area already uses float internally — float64 keeps all arithmetic in one consistent type
  • int64 would also prevent the overflow, but the intermediate values would be implicitly upcast to float at the division step anyway

Impact

This function is called by several other functions in the same module (_get_minbox_if_overlap_by_ratio, remove_overlap_blocks, shrink_supplement_region_bbox) and is imported directly by xycut_enhanced/xycuts.py via from ..utils import calculate_overlap_ratio. The fix is fully backward-compatible — the function signature, behavior, and return type are unchanged.

Bobholamovic and others added 22 commits January 29, 2026 12:48
…addlePaddle#4961)

* bugfix: unexpected change of the constant IMAGE_LABELS

* update doc
Co-authored-by: duqiemng <1640472053@qq.com>
* vllm 0.10.2 needs transformers 4.x

* update
* fix(doc_vlm): cancel pending futures on batch request failure

When a batch of requests is sent to the VLM service and one fails,
the remaining pending futures are now properly cancelled to avoid
wasting VLM service resources.

* chore: remove test file and documentation for async cancellation fix
…ePaddle#4996)

* Use cache mount for genai docker (PaddlePaddle#4954)

* Fix HPS order bug (PaddlePaddle#4955)

* Fix transformers version (PaddlePaddle#4956)

* Fix HPS and remove scipy from required deps (PaddlePaddle#4957)

* [Cherry-Pick]bugfix: unexpected change of the constant IMAGE_LABELS (PaddlePaddle#4961)

* bugfix: unexpected change of the constant IMAGE_LABELS

* update doc

* [METAX] add ppdoclayv3 to METAX_GPU_WHITELIST (PaddlePaddle#4959)



* vllm 0.10.2 needs transformers 4.x (PaddlePaddle#4963)

* vllm 0.10.2 needs transformers 4.x

* update

* Bump version to 3.4.1

* Support setting PDF rendering scale factor (PaddlePaddle#4967)

* Fix/doc vlm async cancellation (PaddlePaddle#4969) (PaddlePaddle#4971)

* fix(doc_vlm): cancel pending futures on batch request failure

When a batch of requests is sent to the VLM service and one fails,
the remaining pending futures are now properly cancelled to avoid
wasting VLM service resources.

* chore: remove test file and documentation for async cancellation fix

* Fix typo (PaddlePaddle#4982)

* Revert "Fix typo (PaddlePaddle#4982)"

This reverts commit 0a936ba.

* feat(ROCm): Add ROCm 7.0 compatibility patches

* version

---------

Co-authored-by: Lin Manhui <bob1998425@hotmail.com>
Co-authored-by: changdazhou <142379845+changdazhou@users.noreply.github.com>
Co-authored-by: SuperNova <91192235+handsomecoderyang@users.noreply.github.com>
Co-authored-by: duqiemng <1640472053@qq.com>
Co-authored-by: zhang-prog <69562787+zhang-prog@users.noreply.github.com>
Co-authored-by: Bobholamovic <mhlin425@whu.edu.cn>
Co-authored-by: Bvicii <98971614+scyyh11@users.noreply.github.com>
* Support setting expiration for BOS URLs

* Fix docs

* Fix bugs
…t paddleocr

### What it does

Update calculate_overlap_ratio at utils.py to use numpy types and functions

### Problem

`calculate_overlap_ratio` in `paddlex/inference/pipelines/layout_parsing/utils.py` triggers a `RuntimeWarning: overflow encountered in scalar multiply` at line 248:

```python
inter_area = inter_width * inter_height
```

### Root cause
The root cause is at lines 237-238:

```python
bbox1 = np.array(bbox1)
bbox2 = np.array(bbox2)
```

`np.array()` without an explicit `dtype` preserves the input's original type. When bounding boxes come from detection models as `int32` arrays, all subsequent arithmetic stays in `int32`, which maxes out at ~2.1 billion. Two moderately large bounding box dimensions multiplied together (e.g., 50000 × 50000 = 2.5 billion) exceed this limit, producing an overflow and incorrect overlap ratios.

### Fix

Two changes in `calculate_overlap_ratio`:

1. **Cast inputs to `float64`** — prevents overflow in all downstream arithmetic:

```python
bbox1 = np.array(bbox1, dtype=np.float64)
bbox2 = np.array(bbox2, dtype=np.float64)
```

2. **Use `np.multiply` with explicit dtype** — belt-and-suspenders on the exact line that overflows:

```python
inter_area = np.multiply(inter_width, inter_height, dtype=np.float64)
```

### Why `float64`?

- `float64` is numpy's default float type and supports values up to ~1.8×10³⁰⁸
- The function returns a floating-point ratio (0.0–1.0), and `calculate_bbox_area` already uses `float` internally — `float64` keeps all arithmetic in one consistent type
- `int64` would also prevent the overflow, but the intermediate values would be implicitly upcast to float at the division step anyway

### Impact

This function is called by several other functions in the same module (`_get_minbox_if_overlap_by_ratio`, `remove_overlap_blocks`, `shrink_supplement_region_bbox`) and is imported directly by `xycut_enhanced/xycuts.py` via `from ..utils import calculate_overlap_ratio`. The fix is fully backward-compatible — the function signature, behavior, and return type are unchanged.
@paddle-bot
Copy link

paddle-bot bot commented Feb 20, 2026

Thanks for your contribution!

@CLAassistant
Copy link

CLAassistant commented Feb 20, 2026

CLA assistant check
All committers have signed the CLA.

@paddle-bot paddle-bot bot added the contributor External developers label Feb 20, 2026
@albcunha
Copy link
Author

This is a bug that is happening at paddleocr, when calling paddlex

@albcunha
Copy link
Author

Monkey patch while the pr is checked:

from __future__ import annotations

import importlib
import logging
import shutil
from pathlib import Path
from typing import Any

logger = logging.getLogger(__name__)


# ===========================================================================
# Pre-import patches — must run BEFORE `from paddleocr import PPStructureV3`
# because that import transitively loads all paddlex submodules into memory.
# ===========================================================================

def _get_paddlex_dir() -> Path:
    spec = importlib.util.find_spec("paddlex")
    if spec and spec.submodule_search_locations:
        return Path(next(iter(spec.submodule_search_locations))).resolve()
    raise RuntimeError("Could not locate the paddlex package on sys.path")


_paddlex_dir = _get_paddlex_dir()

# --- Patch: fix integer overflow in calculate_overlap_ratio ---------------
# The original uses np.array(bbox) which preserves int dtypes; multiplying
# large int32/int64 widths × heights overflows at utils.py:248.
# We patch the source file directly because other modules (e.g. xycuts.py)
# import with `from ..utils import calculate_overlap_ratio`, binding to the
# function object — runtime monkey patching can't reach those references.
_layout_utils_path = (
    _paddlex_dir / "inference" / "pipelines" / "layout_parsing" / "utils.py"
)

_OVERFLOW_PATCHES = [
    (
        "    bbox1 = np.array(bbox1)\n    bbox2 = np.array(bbox2)",
        "    bbox1 = np.array(bbox1, dtype=np.float64)\n    bbox2 = np.array(bbox2, dtype=np.float64)",
    ),
    (
        "    inter_area = inter_width * inter_height",
        "    print('--------MONKEYPATCH!!!--------')\n    inter_area = np.multiply(inter_width, inter_height, dtype=np.float64)",
    ),
]

try:
    _utils_src = _layout_utils_path.read_text(encoding="utf-8")
    _patched = False
    for _orig, _fix in _OVERFLOW_PATCHES:
        if _orig in _utils_src:
            _utils_src = _utils_src.replace(_orig, _fix)
            _patched = True
    if _patched:
        _layout_utils_path.write_text(_utils_src, encoding="utf-8")
        # Remove stale bytecode so Python recompiles from the patched source
        _pyc_cache = _layout_utils_path.parent / "__pycache__"
        if _pyc_cache.is_dir():
            for _pyc in _pyc_cache.glob("utils.*.pyc"):
                _pyc.unlink(missing_ok=True)
        logger.info(
            "Patched PaddleX utils.py: calculate_overlap_ratio now uses float64 "
            "(int overflow fix)"
        )
    elif all(_fix in _utils_src for _, _fix in _OVERFLOW_PATCHES):
        logger.debug("PaddleX utils.py already patched (float64 overflow fix)")
except Exception as e:
    logger.warning(f"Failed to patch PaddleX utils.py for overflow fix: {e}")

# ===========================================================================
# Now safe to import paddleocr — patched files will be compiled fresh.
# ===========================================================================
from paddleocr import PPStructureV3  # noqa: E402
``

@luotao1
Copy link
Collaborator

luotao1 commented Feb 26, 2026

Please submit the PR to the develop branch at first.

@albcunha albcunha changed the base branch from release/3.4 to develop February 27, 2026 00:44
@albcunha albcunha closed this Feb 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contributor External developers

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants