feat: add v2.2.0 sync infrastructure, CI/CD, and comprehensive tests by hyunhee-jo · Pull Request #16 · opendataloader-project/langchain-opendataloader-pdf

hyunhee-jo · 2026-04-02T03:34:55Z

Objective

When opendataloader-pdf releases a new version (e.g., v2.1.0 adding detect_strikethrough), there is no automated way to sync the new options into the langchain wrapper. Every update requires manual parameter copy-paste, README edits, and ad-hoc testing — error-prone and slow. The project also lacks CI to catch regressions on PRs.

Approach

Build three layers of automation — sync, test, release — so upstream releases flow into the wrapper with minimal manual work.

1. Sync markers + upstream workflow (sync-upstream.yml)
Insert BEGIN/END SYNCED PARAMS markers in document_loaders.py and README so the upstream code generator (generate-langchain.mjs) can patch parameters without touching surrounding code. The workflow triggers on repository_dispatch or manual input, generates code, runs tests, and creates a PR (draft on test failure, ready on success). Breaking changes (removed params) are auto-detected and labeled.

2. PR + full test workflows (test.yml, test-full.yml)
Every PR runs unit tests across Python 3.10/3.12/3.13 with --disable-socket network isolation, plus a minimum dependency test (langchain-core 1.0.0 + opendataloader-pdf 2.1.0). Manual full tests cover 3 OS × 2 Python with Java integration.

3. Comprehensive test expansion (34 new tests → 117 total)
Unit: Path objects, detect_strikethrough, JSON edge cases, format validation, temp cleanup.
Integration: replace_invalid_chars, content_safety_off, image options, directory loading.
Regression: snapshot comparison (text/markdown/json), metadata consistency, instance isolation.

Security hardening:

All run: blocks use shell $VERSION instead of ${{ env.VERSION }} to prevent injection from repository_dispatch payloads
Cross-job outputs (needs.*) mapped to env vars before shell evaluation
Top-level permissions: {} with per-job overrides
Slack webhook guarded via env intermediary (secrets not accessible in if: conditions)

Additional changes:

Add detect_strikethrough parameter (new in opendataloader-pdf v2.1.0)
Bump minimum dependency to opendataloader-pdf>=2.1.0
Fix hybrid_timeout docstring default: "30000" → "0"
Fix README AI-AGENT-SUMMARY license: MIT → Apache-2.0
Add CHANGELOG.md and pre-release checklist design doc

Evidence

Loaded a real PDF (samples/pdf/lorem.pdf) with the updated wrapper under user-identical conditions:

Scenario	Expected	Actual
`detect_strikethrough=True`, format=markdown	Load succeeds, no error	1 doc, 462 chars — OK
`detect_strikethrough` omitted (default)	Load succeeds, backward compatible	1 doc, 462 chars — OK
`split_pages=True` (keyword arg, moved position)	Per-page documents	1 page doc — OK
`split_pages=False` (keyword arg)	Single whole document	1 whole doc — OK
`format="xml"` (invalid)	`ValueError` raised	`ValueError: Invalid format 'xml'. Valid options are: 'json', 'text', 'html', 'markdown'`
opendataloader-pdf 2.2.0 `convert()` signature	`detect_strikethrough` param exists	`True`

Cannot verify sync-upstream.yml end-to-end — requires an actual upstream release to trigger repository_dispatch. Workflow syntax and job outputs validated via CI dry-run.

🤖 Generated with Claude Code

.github/workflows/sync-upstream.yml

.github/workflows/test-full.yml

.github/workflows/test.yml

claude · 2026-04-02T03:36:44Z

PR Review: feat: v2.2.0 동기화 인프라, CI/CD, 테스트 전면 보강

Overall this is a well-structured PR that adds meaningful automation infrastructure and test coverage. Here are my observations:

Bugs / Issues

1. Optional[str] without null guard — document_loaders.py

format and image_output were changed from str to Optional[str], but no null handling was added. If a caller passes format=None, this line will crash at runtime:

self.format = format.lower()  # AttributeError: 'NoneType' object has no attribute 'lower'

Either revert to str (since None is not a valid value), or add a guard:

self.format = format.lower() if format else "text"

2. README format default inconsistency

The synced table shows format with description "Default: json" but the actual default in code is "text". This appears to have been auto-generated from upstream options.json where the default differs. Since this is a wrapper library, the LangChain-specific default should take precedence in the docs:

-| `format` | `str` | `text` | Output formats (comma-separated). Values: json, text, html, pdf, markdown, markdown-with-html, markdown-with-images. Default: json |
+| `format` | `str` | `"text"` | Output format. Values: json, text, html, pdf, markdown, markdown-with-html, markdown-with-images. Default: text |

Test Issues

3. Contradictory test comment in test_split_pages_no_separators

def test_split_pages_no_separators(self):
    """Content with no separators should yield nothing (no preamble match)."""
    ...
    assert len(docs) == 1  # ← contradicts the docstring

The docstring says "yield nothing" but asserts 1 document. Please align the docstring with the actual expected behavior.

4. test_invalid_format_raises_on_load — relies on undocumented behavior

def test_invalid_format_raises_on_load(self):
    loader = OpenDataLoaderPDFLoader(file_path="test.pdf", format="xml")
    with pytest.raises(ValueError, match="Invalid format"):
        list(loader.lazy_load())

This test asserts a ValueError is raised, but the validation logic isn't visible in the diff. If format validation lives in the upstream opendataloader_pdf.convert() call (which is mocked in unit tests), this test may not actually exercise the real code path. Consider checking where format validation is enforced.

CI/CD Concerns

5. sync-upstream.yml: Shell string injection risk in PR body

The PR body is constructed by concatenating shell variables and $(git diff ...) output, then passed directly to gh pr create --body "$BODY". If git diff --stat output contains special characters (e.g., $, backticks, quotes), the shell expansion could break or produce unexpected results. Consider writing to a file instead:

cat > /tmp/pr-body.md << 'EOF_BODY'
...
EOF_BODY
# Append dynamic content separately
git diff HEAD~1 --stat >> /tmp/pr-body.md
gh pr create --body-file /tmp/pr-body.md ...

6. sync-upstream.yml: create-pr job may fail if generate job failed

The create-pr job has if: needs.generate.outputs.changed == 'true' && always(). The always() makes the job run even when test fails (to create a draft PR), but if the generate job itself failed before pushing the branch, needs.generate.outputs.branch will be empty and the checkout step will fail.

Consider adding a guard:

if: needs.generate.result == 'success' && needs.generate.outputs.changed == 'true'

7. sync-upstream.yml: No retry on PyPI availability

The PyPI poll loop exits with ::warning:: after 10 minutes but still exits with exit 0 — meaning the workflow continues and installs whatever version is available (possibly an older one). Consider using exit 1 to fail loudly when the expected version isn't available.

Minor Observations

split_pages parameter moved to the end of __init__ — This is a technically non-breaking change (keyword-only usage is the norm), but worth noting in the changelog for users who may rely on positional arguments.
Snapshot files lack trailing newline (lorem_json.json, lorem_markdown.md, lorem_text.txt) — standard practice is to end text files with a newline.
test_temp_files_deleted_after_read — The test places a file directly in the tmpdir root as test.txt, but in production the code reads files matching the PDF filename. The test works because the mock bypasses real conversion, but the setup is fragile. A comment explaining this assumption would help.
docs/plans/ directory — Design documents committed to the repo are fine, but consider whether docs/plans/ belongs in the main branch or if it's better tracked elsewhere (e.g., GitHub wiki or issues).

What's Done Well

The sync marker pattern (BEGIN/END SYNCED PARAMS/ASSIGNMENTS/CONVERT KWARGS) is an elegant solution for keeping this wrapper in sync with the upstream library automatically.
Moving split_pages and page separators outside the synced regions is the right call — these are LangChain-specific concerns.
--disable-socket network isolation in unit tests is excellent practice.
The min-dep-test job in test.yml is a valuable addition that catches version compatibility regressions early.
Test coverage for edge cases (empty kids, unsorted pages, Path objects) is solid.

🤖 Generated with Claude Code

coderabbitai · 2026-04-02T03:40:10Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

Adds three GitHub Actions workflows (upstream sync with generate→test→create-pr, per-PR unit/min-dep tests, full multi-OS manual tests), introduces changelog and a pre-release design doc, adds detect_strikethrough to the PDF loader and updates deps, and expands unit, integration, and snapshot/regression tests plus new test utilities and snapshots.

Changes

Cohort / File(s)	Summary
CI Workflows `.github/workflows/sync-upstream.yml`, `.github/workflows/test.yml`, `.github/workflows/test-full.yml`	Add sync-upstream workflow (generate → test → create-pr with PyPI polling, breaking-change detection, branch creation, Slack notify), per-PR test workflow (unit matrix + min-dep test), and full manual multi-OS test workflow (Java install, matrix, Slack notify).
Docs & Release `CHANGELOG.md`, `README.md`, `docs/plans/2026-04-02-pre-release-checklist-design.md`	Add CHANGELOG, change README license and parameter table (defaults/descriptions including `detect_strikethrough`), and add pre-release/sync design doc describing CI/sync behavior and policies.
Library Code `langchain_opendataloader_pdf/document_loaders.py`, `pyproject.toml`	Add `detect_strikethrough: bool = False` to `OpenDataLoaderPDFLoader.__init__` and forward to converter; reorder some init params and update `hybrid_timeout` doc default; bump `opendataloader-pdf` min to `>=2.1.0`; add `pytest-socket>=0.7.0` to dev deps.
Unit Tests `tests/test_document_loaders.py`	Large expansion: Path handling tests, `detect_strikethrough` coverage, format validation, splitting edge cases, lazy_load format errors, temp-file cleanup, and many new unit cases.
Integration Tests & Fixtures `tests/test_integration.py`, `tests/conftest.py`	Refactor Java detection to `tests.conftest.java_available()`, add integration tests for new options (images, replace_invalid_chars, content_safety_off, detect_strikethrough) and path polymorphism.
Regression & Snapshots `tests/test_regression.py`, `tests/snapshots/lorem_text.txt`, `tests/snapshots/lorem_markdown.md`, `tests/snapshots/lorem_json.json`	Add Java-gated snapshot regression suite with autouse snapshot-dir fixture, snapshot write/update behavior, JSON-structure checks, metadata assertions, sequential loader concurrency test, and three new snapshot files.
Misc test infra `tests/...` (new/modified test modules)	Broadens pytest usage (socket isolation), adds snapshot helpers and many new/modified test cases—review test flags and snapshot update logic.

Sequence Diagram(s)

sequenceDiagram
    participant User as User
    participant GitHub as GitHub Actions
    participant Repo as Repository
    participant PyPI as PyPI
    participant Slack as Slack

    User->>GitHub: Trigger sync (repository_dispatch / workflow_dispatch) with version
    GitHub->>Repo: generate — checkout repo & upstream tag, run generator, detect diffs & breaking changes
    alt Changes detected
        Repo->>Repo: Create branch `sync/v${VERSION}` and push
        GitHub->>Repo: test — checkout branch, run tests matrix
        GitHub->>PyPI: Poll for `opendataloader-pdf==${VERSION}`
        PyPI-->>GitHub: Version available
        GitHub->>Repo: Run pytest matrix (3.10,3.13) with socket isolation
        alt Tests pass
            GitHub->>Repo: create-pr — draft=false
        else Tests fail
            GitHub->>Repo: create-pr — draft=true
        end
        Repo->>GitHub: Create PR (label `breaking-change` if applicable)
        GitHub->>Slack: Send notification (status, breaking-change info, run link)
    else No changes
        GitHub->>Slack: Send notification (no changes)
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main change: adding v2.2.0 sync infrastructure, CI/CD workflows, and comprehensive tests to match the substantial changeset across automation, testing, and documentation.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

.github/workflows/sync-upstream.yml

claude · 2026-04-02T03:44:39Z

test pipe content

coderabbitai

Actionable comments posted: 8

♻️ Duplicate comments (1)

.github/workflows/test-full.yml (1)
1-49: ⚠️ Potential issue | 🟠 Major

Add explicit permissions to scope GITHUB_TOKEN access.

This workflow currently uses default token permissions. Define least-privilege permissions at workflow level (or per job) to reduce blast radius.
Proposed fix
 name: Full Test (Multi-platform)
 
 on:
   workflow_dispatch:
+
+permissions:
+  contents: read
 
 jobs:
   full-test:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/test-full.yml around lines 1 - 49, Add an explicit
permissions block to scope GITHUB_TOKEN to least privilege (e.g., at top-level
of the workflow) instead of using defaults; update the workflow to include a
permissions section such as "permissions: contents: read" (optionally also
"actions: read" if any action queries workflows) so the jobs "full-test" and
"notify-slack" run with limited token rights rather than full default
permissions.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/sync-upstream.yml:
- Around line 182-191: The Determine PR status step currently only checks
needs.test.result == "success" and treats any other value (including "skipped")
as failed; update the shell logic in that step to explicitly handle the
"skipped" state from needs.test.result (or check needs.generate.outputs.changed
to infer that tests were intentionally skipped) — either treat "skipped" the
same as "success" (set draft=false and set result to a message like "✅ All tests
passed or skipped") or add a separate branch for "skipped" that sets the desired
draft flag and a clear result message; modify the condition in the Determine PR
status step (referencing needs.test.result and optionally
needs.generate.outputs.changed) accordingly so skipped tests are handled
correctly and the result message reflects that decision.

In `@docs/plans/2026-04-02-pre-release-checklist-design.md`:
- Around line 91-95: Update the design doc to match the workflow implementation
by adding pytest-socket to the installation command; specifically edit the pip
install line that currently reads pip install "langchain-core==1.0.0"
"opendataloader-pdf==2.0.0" pytest so it includes pytest-socket (matching the
.github/workflows/test.yml), ensuring the two documents stay consistent.

In `@langchain_opendataloader_pdf/document_loaders.py`:
- Around line 69-70: Update the hybrid_timeout parameter to use a string default
of "0" and a concrete str type so runtime behavior matches the docs: change the
function/class signature's hybrid_timeout annotation from Optional[str] = None
to hybrid_timeout: str = "0", update any internal assignment that sets or
normalizes hybrid_timeout (the assignment around the previous default) to treat
it as a string, and ensure the value passed to opendataloader_pdf.convert(...)
is forwarded as this string (not None). Adjust the inline docstring/example text
if needed so the documented default ("0" meaning no timeout) matches the
implemented default.

In `@README.md`:
- Around line 238-260: The table shows inconsistent quoting for default string
values; standardize all string defaults to use consistent quoting (either quote
all string defaults or none). Update the default cells for parameters like
`format` (text), `reading_order` (xycut), `image_format` (png), `hybrid` (off),
`hybrid_mode` (auto) and any other string defaults so they follow the chosen
style across the entire synced params table (e.g., change text → "text", xycut →
"xycut", png → "png" or remove quotes everywhere), keeping the parameter names
(e.g., `format`, `reading_order`, `image_format`, `hybrid`, `hybrid_mode`)
unchanged.

In `@tests/snapshots/lorem_markdown.md`:
- Line 3: The snapshot ending with the long "Lorem ipsum..." paragraph must have
exactly one trailing newline at EOF to satisfy markdownlint MD047; open the file
containing that paragraph and ensure the file ends with a single '\n' (remove
any extra blank lines or missing newline), so the final byte is a single newline
character after the paragraph.

In `@tests/test_integration.py`:
- Around line 323-334: The test test_load_with_directory_path currently asserts
only ">= 1"; change it to compute the expected number of PDF files in
SAMPLES_DIR and assert equality with the loaded documents length: after creating
OpenDataLoaderPDFLoader (with split_pages=False) call loader.load(), compute
expected_count by listing files in SAMPLES_DIR that match the PDF pattern (e.g.
using pathlib.Path(SAMPLES_DIR).glob("*.pdf")) and replace the assertion with
assert len(documents) == expected_count so the test verifies directory-loading
loads exactly all PDFs.

In `@tests/test_regression.py`:
- Around line 140-160: Rename the test function
test_parallel_loaders_no_conflict (in class TestConcurrency) to
test_multiple_loaders_no_conflict and update its docstring to state it verifies
instance-level isolation (not concurrent execution), or alternatively update
only the docstring to clarify it tests separate loader instances using
OpenDataLoaderPDFLoader and their load() results are independent; keep the same
assertions and use of load() unchanged.
- Around line 18-23: In the java_available() function, explicitly set
check=False on the subprocess.run(...) call so the invocation does not raise
CalledProcessError and your manual returncode check remains the authority;
update the subprocess.run call in java_available to include the check=False
argument while keeping capture_output=True and text=True.

---

Duplicate comments:
In @.github/workflows/test-full.yml:
- Around line 1-49: Add an explicit permissions block to scope GITHUB_TOKEN to
least privilege (e.g., at top-level of the workflow) instead of using defaults;
update the workflow to include a permissions section such as "permissions:
contents: read" (optionally also "actions: read" if any action queries
workflows) so the jobs "full-test" and "notify-slack" run with limited token
rights rather than full default permissions.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 88436707-70e6-4793-ab44-e0ca4565b7d5

📥 Commits

Reviewing files that changed from the base of the PR and between 3fc399b and 8467f67.

📒 Files selected for processing (14)

.github/workflows/sync-upstream.yml
.github/workflows/test-full.yml
.github/workflows/test.yml
CHANGELOG.md
README.md
docs/plans/2026-04-02-pre-release-checklist-design.md
langchain_opendataloader_pdf/document_loaders.py
pyproject.toml
tests/snapshots/lorem_json.json
tests/snapshots/lorem_markdown.md
tests/snapshots/lorem_text.txt
tests/test_document_loaders.py
tests/test_integration.py
tests/test_regression.py

.github/workflows/sync-upstream.yml

docs/plans/2026-04-02-pre-release-checklist-design.md

coderabbitai · 2026-04-02T03:47:58Z

langchain_opendataloader_pdf/document_loaders.py

        hybrid_timeout: Optional[str] = None,
        hybrid_fallback: bool = False,


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail python -m pip install --quiet "opendataloader-pdf>=2.0.0" python - <<'PY' import inspect import opendataloader_pdf sig = inspect.signature(opendataloader_pdf.convert) p = sig.parameters.get("hybrid_timeout") print("convert signature:", sig) print("hybrid_timeout default:", p.default if p else "<missing>") print("hybrid_timeout annotation:", p.annotation if p else "<missing>") PY

Repository: opendataloader-project/langchain-opendataloader-pdf

Length of output: 1439

🏁 Script executed:

cat -n langchain_opendataloader_pdf/document_loaders.py | head -150

Repository: opendataloader-project/langchain-opendataloader-pdf

Length of output: 7802

🏁 Script executed:

cat -n langchain_opendataloader_pdf/document_loaders.py | sed -n '270,280p'

Repository: opendataloader-project/langchain-opendataloader-pdf

Length of output: 715

Change hybrid_timeout default from None to "0" to match documented behavior.

Line 69 defaults to None while the documentation at line 114 explicitly states the default should be "0" (no timeout). This value is forwarded unchanged to opendataloader_pdf.convert(...) at line 276, causing the actual default behavior to diverge from the documented contract. Update the parameter default and type annotation to hybrid_timeout: str = "0" to align code with documentation.

Proposed fix

- hybrid_timeout: Optional[str] = None, + hybrid_timeout: str = "0",

Also applies to: lines 114 (documentation), 142 (assignment), 273–277 (forwarding)

🧰 Tools

🪛 Ruff (0.15.7)

[warning] 70-70: Boolean-typed positional argument in function definition

(FBT001)

[warning] 70-70: Boolean default positional argument in function definition

(FBT002)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@langchain_opendataloader_pdf/document_loaders.py` around lines 69 - 70, Update the hybrid_timeout parameter to use a string default of "0" and a concrete str type so runtime behavior matches the docs: change the function/class signature's hybrid_timeout annotation from Optional[str] = None to hybrid_timeout: str = "0", update any internal assignment that sets or normalizes hybrid_timeout (the assignment around the previous default) to treat it as a string, and ensure the value passed to opendataloader_pdf.convert(...) is forwarded as this string (not None). Adjust the inline docstring/example text if needed so the documented default ("0" meaning no timeout) matches the implemented default.

Fixed — hybrid_timeout docstring updated to "0" (no timeout). Also bumped min dep to >=2.1.0.

@hyunhee-jo, thanks for updating the docstring and bumping the minimum dependency. However, the core concern is still present in the current file: the actual Python default value for hybrid_timeout remains None (Line 69), while the docstring now documents it as "0". This means callers who don't pass hybrid_timeout will still have None forwarded to opendataloader_pdf.convert(...), not "0".

To fully close the gap, the parameter signature itself still needs to change:

- hybrid_timeout: Optional[str] = None, + hybrid_timeout: str = "0",

If the upstream opendataloader_pdf.convert() treats None and "0" identically (i.e., both mean "no timeout"), then keeping None is functionally safe but still semantically inconsistent with the documented contract. Could you confirm whether that's the case? If so, the current state is acceptable as a doc-only fix, but it would still be cleaner to align the code default with the docs.

Fixed — docstring updated to "0" (no timeout). Minimum dep bumped to >=2.1.0.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail pip install --quiet "opendataloader-pdf>=2.1.0" 2>/dev/null python - <<'PY' import inspect import opendataloader_pdf sig = inspect.signature(opendataloader_pdf.convert) p = sig.parameters.get("hybrid_timeout") print("hybrid_timeout default:", repr(p.default) if p else "<missing>") print("hybrid_timeout annotation:", p.annotation if p else "<missing>") PY # Also try to find source handling of hybrid_timeout=None python - <<'PY' import opendataloader_pdf, inspect, pathlib, ast, sys src_file = pathlib.Path(inspect.getfile(opendataloader_pdf)) print("Source file:", src_file) src = src_file.read_text(encoding="utf-8") # Find lines referencing hybrid_timeout for i, line in enumerate(src.splitlines(), 1): if "hybrid_timeout" in line: print(f" L{i}: {line}") PY

Length of output: 171

@hyunhee-jo, I've now inspected the upstream package directly. The script confirms that opendataloader_pdf.convert() itself declares hybrid_timeout: Optional[str] = None — so passing None from this loader is exactly right and matches the upstream API's own default. The "0" (no timeout) in the docstring correctly describes the effective Java-engine behavior when no value is supplied, not a Python-level literal that needs to be mirrored in the signature.

My original suggestion to change to hybrid_timeout: str = "0" was incorrect — doing so would send an explicit "0" string instead of letting the upstream handle its own default, which could behave subtly differently. No code change is needed here; the current Optional[str] = None default is the right choice.

Apologies for the noise on this one — the docstring update you made is the correct and complete fix.

✏️ Learnings added

Learnt from: hyunhee-jo Repo: opendataloader-project/langchain-opendataloader-pdf PR: 16 File: langchain_opendataloader_pdf/document_loaders.py:69-70 Timestamp: 2026-04-03T00:22:12.277Z Learning: In `langchain_opendataloader_pdf/document_loaders.py`, `hybrid_timeout: Optional[str] = None` is intentionally correct. The upstream `opendataloader_pdf.convert()` also declares `hybrid_timeout: Optional[str] = None`. The docstring phrase `"0" (no timeout)` describes the upstream Java engine's effective behavior when `None` is passed — it does NOT mean the Python default should be changed to the string `"0"`. Do not flag this as a mismatch in future reviews.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

🧠 Learnings used

Learnt from: hyunhee-jo Repo: opendataloader-project/langchain-opendataloader-pdf PR: 16 File: langchain_opendataloader_pdf/document_loaders.py:65-73 Timestamp: 2026-04-03T00:14:47.404Z Learning: In `langchain_opendataloader_pdf/document_loaders.py`, the `OpenDataLoaderPDFLoader.__init__` constructor is documented and contracted as keyword-argument-only usage. All public examples use keyword args, so positional argument ordering (e.g., position of `split_pages` relative to `detect_strikethrough` or hybrid params) is intentionally not part of the public API contract.

coderabbitai · 2026-04-02T03:47:59Z

README.md

+<!-- BEGIN SYNCED PARAMS TABLE -->
+| `format` | `str` | `text` | Output format. Values: text, markdown, json, html. Default: text |
+| `quiet` | `bool` | `False` | Suppress console logging output |
+| `content_safety_off` | `List[str]` | `None` | Disable content safety filters. Values: all, hidden-text, off-page, tiny, hidden-ocg |
+| `password` | `str` | `None` | Password for encrypted PDF files |
+| `keep_line_breaks` | `bool` | `False` | Preserve original line breaks in extracted text |
+| `replace_invalid_chars` | `str` | `" "` | Replacement character for invalid/unrecognized characters. Default: space |
+| `use_struct_tree` | `bool` | `False` | Use PDF structure tree (tagged PDF) for reading order and semantic structure |
+| `table_method` | `str` | `"default"` | Table detection method. Values: default (border-based), cluster (border + cluster). Default: default |
+| `reading_order` | `str` | `"xycut"` | Reading order algorithm. Values: off, xycut. Default: xycut |
+| `image_output` | `str` | `off` | Image output mode. Values: off (no images), embedded (Base64), external (file references). Default: off |
+| `image_format` | `str` | `"png"` | Output format for extracted images. Values: png, jpeg. Default: png |
+| `image_dir` | `str` | `None` | Directory for extracted images |
+| `sanitize` | `bool` | `False` | Enable sensitive data sanitization. Replaces emails, phone numbers, IPs, credit cards, and URLs with placeholders |
+| `pages` | `str` | `None` | Pages to extract (e.g., "1,3,5-7"). Default: all pages |
 | `include_header_footer` | `bool` | `False` | Include page headers and footers in output |
-| `content_safety_off` | `List[str]` | `None` | Disable safety filters: `"hidden-text"`, `"off-page"`, `"tiny"`, `"hidden-ocg"`, `"all"` |
-| `replace_invalid_chars` | `str` | `None` | Replacement for invalid characters |
-| `hybrid` | `str` | `None` | Hybrid AI backend: `"docling-fast"`. Requires running backend server |
-| `hybrid_mode` | `str` | `None` | `"auto"` (route complex pages) or `"full"` (route all pages) |
-| `hybrid_url` | `str` | `None` | Backend server URL. Default: `http://localhost:5002` |
-| `hybrid_timeout` | `str` | `None` | Backend timeout in ms. Default: `"30000"` |
-| `hybrid_fallback` | `bool` | `False` | Fall back to Java extraction on backend failure |
+| `detect_strikethrough` | `bool` | `False` | Detect strikethrough text and wrap with ~~ in Markdown output (experimental) |
+| `hybrid` | `str` | `"off"` | Hybrid backend (requires a running server). Quick start: pip install "opendataloader-pdf[hybrid]" && opendataloader-pdf-hybrid --port 5002. For remote servers use --hybrid-url. Values: off (default), docling-fast |
+| `hybrid_mode` | `str` | `"auto"` | Hybrid triage mode. Values: auto (default, dynamic triage), full (skip triage, all pages to backend) |
+| `hybrid_url` | `str` | `None` | Hybrid backend server URL (overrides default) |
+| `hybrid_timeout` | `str` | `"0"` | Hybrid backend request timeout in milliseconds (0 = no timeout). Default: 0 |
+| `hybrid_fallback` | `bool` | `False` | Opt in to Java fallback on hybrid backend error (default: disabled) |
+<!-- END SYNCED PARAMS TABLE -->


🧹 Nitpick | 🔵 Trivial

Minor formatting inconsistency in default values.

Some string defaults are quoted ("default", "xycut", "png") while others are not (text, off). Consider making this consistent for readability.

✏️ Suggested consistency fix

-| `format` | `str` | `text` | Output format. Values: text, markdown, json, html. Default: text | +| `format` | `str` | `"text"` | Output format. Values: text, markdown, json, html. Default: text | ... -| `image_output` | `str` | `off` | Image output mode. Values: off (no images), embedded (Base64), external (file references). Default: off | +| `image_output` | `str` | `"off"` | Image output mode. Values: off (no images), embedded (Base64), external (file references). Default: off |

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@README.md` around lines 238 - 260, The table shows inconsistent quoting for default string values; standardize all string defaults to use consistent quoting (either quote all string defaults or none). Update the default cells for parameters like `format` (text), `reading_order` (xycut), `image_format` (png), `hybrid` (off), `hybrid_mode` (auto) and any other string defaults so they follow the chosen style across the entire synced params table (e.g., change text → "text", xycut → "xycut", png → "png" or remove quotes everywhere), keeping the parameter names (e.g., `format`, `reading_order`, `image_format`, `hybrid`, `hybrid_mode`) unchanged.

tests/snapshots/lorem_markdown.md

coderabbitai · 2026-04-02T03:47:59Z

tests/test_integration.py

+    def test_load_with_directory_path(self):
+        """Test loading with a directory path."""
+        loader = OpenDataLoaderPDFLoader(
+            file_path=str(SAMPLES_DIR),
+            format="text",
+            quiet=True,
+            split_pages=False,
+        )
+        documents = loader.load()
+        # Should load all PDFs in the directory
+        assert len(documents) >= 1
+


🧹 Nitpick | 🔵 Trivial

Strengthen directory-path validation assertion.

Line 333 (>= 1) is too weak for a behavior test that claims directory loading. With split_pages=False, assert the loaded document count matches the number of PDFs in SAMPLES_DIR.

Proposed fix

def test_load_with_directory_path(self): """Test loading with a directory path.""" + expected_pdf_count = len(list(SAMPLES_DIR.glob("*.pdf"))) + if expected_pdf_count == 0: + pytest.skip("No sample PDFs found in samples directory") + loader = OpenDataLoaderPDFLoader( file_path=str(SAMPLES_DIR), format="text", quiet=True, split_pages=False, ) documents = loader.load() - # Should load all PDFs in the directory - assert len(documents) >= 1 + assert len(documents) == expected_pdf_count

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def test_load_with_directory_path(self):

"""Test loading with a directory path."""

loader = OpenDataLoaderPDFLoader(

file_path=str(SAMPLES_DIR),

format="text",

quiet=True,

split_pages=False,

)

documents = loader.load()

# Should load all PDFs in the directory

assert len(documents) >= 1

def test_load_with_directory_path(self):

"""Test loading with a directory path."""

expected_pdf_count = len(list(SAMPLES_DIR.glob("*.pdf")))

if expected_pdf_count == 0:

pytest.skip("No sample PDFs found in samples directory")

loader = OpenDataLoaderPDFLoader(

file_path=str(SAMPLES_DIR),

format="text",

quiet=True,

split_pages=False,

)

documents = loader.load()

assert len(documents) == expected_pdf_count

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/test_integration.py` around lines 323 - 334, The test test_load_with_directory_path currently asserts only ">= 1"; change it to compute the expected number of PDF files in SAMPLES_DIR and assert equality with the loaded documents length: after creating OpenDataLoaderPDFLoader (with split_pages=False) call loader.load(), compute expected_count by listing files in SAMPLES_DIR that match the PDF pattern (e.g. using pathlib.Path(SAMPLES_DIR).glob("*.pdf")) and replace the assertion with assert len(documents) == expected_count so the test verifies directory-loading loads exactly all PDFs.

coderabbitai · 2026-04-02T03:47:59Z

tests/test_regression.py

+def java_available() -> bool:
+    try:
+        result = subprocess.run(["java", "-version"], capture_output=True, text=True)
+        return result.returncode == 0
+    except FileNotFoundError:
+        return False


🧹 Nitpick | 🔵 Trivial

Add explicit check=False to subprocess.run.

The check parameter should be explicitly set for clarity. While the current behavior is correct (checking returncode manually), making it explicit improves readability.

✏️ Proposed fix

def java_available() -> bool: try: - result = subprocess.run(["java", "-version"], capture_output=True, text=True) + result = subprocess.run(["java", "-version"], capture_output=True, text=True, check=False) return result.returncode == 0 except FileNotFoundError: return False

🧰 Tools

🪛 Ruff (0.15.7)

[warning] 20-20: subprocess.run without explicit check argument

Add explicit check=False

(PLW1510)

[error] 20-20: Starting a process with a partial executable path

(S607)

[warning] 21-21: Consider moving this statement to an else block

(TRY300)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/test_regression.py` around lines 18 - 23, In the java_available() function, explicitly set check=False on the subprocess.run(...) call so the invocation does not raise CalledProcessError and your manual returncode check remains the authority; update the subprocess.run call in java_available to include the check=False argument while keeping capture_output=True and text=True.

tests/test_regression.py

claude · 2026-04-02T03:50:51Z

Code Review

Overall this is a solid PR. The sync infrastructure, test coverage expansion, and regression snapshots are all valuable additions. A few issues worth addressing before merge:

Bugs

1. Optional[str] on format will crash at runtime if None is passed

format was changed from str = "text" to Optional[str] = "text". But self.format = format.lower() is called unconditionally — passing None raises AttributeError. Either keep str = "text" or add a null guard. Same issue applies to image_output: Optional[str] = "off".

2. Shell injection risk in sync-upstream.yml PR body construction

The PR body is built via shell string concatenation and passed to gh pr create --body BODY_VAR. If env.VERSION contains shell metacharacters (e.g. 2.3.0-beta+1), this can break the command. Prefer writing the body to a temp file and passing --body-file instead.

3. create-pr job logic bug with always()

The condition needs.generate.outputs.changed == 'true' && always() has a subtle issue: always() can override the changed check in certain failure scenarios. Use needs.test.result != 'cancelled' instead of always().

Correctness / Misleading tests

4. test_parallel_loaders_no_conflict does not test parallelism

The test runs two loaders sequentially. Either rename it to test_sequential_loaders_no_conflict or use threading/concurrent.futures to exercise real concurrency.

5. test_split_pages_no_separators has a contradictory docstring

The docstring says "should yield nothing (no preamble match)" but the assertion is assert len(docs) == 1. Update the docstring to match the actual expected behavior.

6. Regression snapshot metadata assertion is fragile

assert doc.metadata["source"] == "lorem.pdf" assumes the source is always a bare filename. Prefer Path(doc.metadata["source"]).name == "lorem.pdf".

Code quality

7. split_pages positional ordering change

Moving split_pages changes the parameter order in init — a breaking change for callers using positional arguments. A * keyword-only separator would prevent this.

8. Breaking change detection regex is fragile

The grep/sed pipeline in sync-upstream.yml depends on exact code formatting. Any refactoring silently produces an empty diff. Consider parsing the Python AST or at minimum adding a warning comment.

9. subprocess.run in regression tests lacks a timeout

subprocess.run(["java", "-version"], capture_output=True, text=True) will hang the entire test suite if Java hangs. Add timeout=10.

Documentation

10. README format parameter default is self-contradictory

In the synced table, the Default column shows text but the Description says "Default: json". The code default is "text". The description text needs correction.

11. docs/plans/ design doc is entirely in Korean

Consider adding an English summary or noting the language in the file header for international contributors.

Minor

git add .sync-report/ || true silently ignores directory creation failures.
The min-dep-test job pins langchain-core==1.0.0 — verify this matches the declared minimum in pyproject.toml.
Snapshot files lack a trailing newline, causing noisy one-line diffs.

The regression snapshot approach and the CI/CD generate to test to create-pr separation are well-designed. Priority fixes before merge: 1 (Optional crash), 2 (shell injection), 3 (always() logic).

Generated with Claude Code https://claude.com/claude-code

claude · 2026-04-02T03:58:58Z

PR Review: feat: add v2.2.0 sync infrastructure, CI/CD, and comprehensive tests

Overall this is a well-structured PR with excellent test coverage and a thoughtful CI/CD design. The three-job sync workflow (generate → test → create-pr) is particularly clean. A few issues worth addressing before merge:

Bugs / Correctness Issues

1. Temp directory is never cleaned up (document_loaders.py:248)

tempfile.mkdtemp() creates a directory that persists after the loader runs. Individual files are deleted (line 325), but the directory itself is not. Over time this leaks temp directories, especially in long-running services.

# Current
output_dir = tempfile.mkdtemp(dir=tempfile.gettempdir())
# ... individual files deleted, but output_dir stays

# Fix: use a context manager
import shutil
output_dir = tempfile.mkdtemp()
try:
    ...
finally:
    shutil.rmtree(output_dir, ignore_errors=True)

2. Silent failure on conversion error swallows errors from callers (document_loaders.py:283-287)

When self.hybrid is falsy, any exception in the convert() call is only logged and the iterator silently returns empty. Users calling loader.load() receive [] with no indication that anything went wrong. This makes debugging difficult.

except Exception as e:
    if self.hybrid:
        raise
    logger.error(f"Error during conversion: {e}")
    return  # caller gets no documents and no exception

Consider always re-raising, or at minimum raising after logging so callers can handle the failure.

3. detect_strikethrough may break the minimum-dependency test in integration

detect_strikethrough is new in v2.1.0 but the min-dep test pins opendataloader-pdf==2.0.0. The unit tests mock convert() so they pass, but if integration tests are ever run against minimum deps, passing detect_strikethrough=False to a v2.0.0 convert() that doesn't accept it will raise a TypeError. Worth adding a note in the min-dep test or conditionally omitting unknown kwargs.

Security / Permissions

4. Missing pull-requests: write permission in create-pr job (sync-upstream.yml:167)

The create-pr job runs gh pr create using secrets.GITHUB_TOKEN but there's no pull-requests: write permission declared. The workflow-level permissions block is absent, so this inherits the repo default. If the repo default is read-only (GitHub's hardened default for org repos), the gh pr create call will fail with a 403. Add an explicit permission:

create-pr:
  needs: [generate, test]
  permissions:
    contents: read
    pull-requests: write

Code Quality

5. ext variable could theoretically be unbound (document_loaders.py:290-296)

The format is validated at line 237, so in practice ext is always assigned — but the if/elif chain has no else branch. A future maintainer adding a new format type might not notice. A dict lookup is more robust:

ext_map = {"json": "json", "text": "txt", "html": "html", "markdown": "md"}
ext = ext_map[self.format]

6. pytest-socket is installed twice in test.yml

Line 33 runs pip install -e ".[dev]" pytest-socket, but pytest-socket>=0.7.0 is now in the [dev] dependencies in pyproject.toml. The explicit pytest-socket after .[dev] is redundant. Minor, but worth cleaning up for consistency.

# Current
run: pip install -e ".[dev]" pytest-socket
# Fix
run: pip install -e ".[dev]"

Test Quality

7. TestConcurrency class name is misleading (test_regression.py:142)

The class is named TestConcurrency and its docstring says "Verify multiple loader instances don't interfere", but the test only runs two loaders sequentially. This doesn't test actual concurrent execution. Either rename it to TestMultipleLoaders / TestSequentialLoaders, or add a note that it's a sequential isolation test rather than a true concurrency test.

8. JSON regression test compares structure, not content (test_regression.py:111-117)

The comment says "minor text extraction improvements shouldn't break this", but comparing only top-level keys and kids count means content regressions won't be caught. For a regression suite, this may be intentionally loose, but it's worth being explicit in the comment about what would trigger a test failure (structural changes only, not text changes).

Minor Nits

sync-upstream.yml breaking-change detection (lines 58-60, 91-92): The grep | sed approach to extract param names is clever but fragile — any self.x = inside a method body would be captured. The sync markers (BEGIN/END SYNCED ASSIGNMENTS) suggest you could parse only the content between those markers. Low risk now but worth noting.
PR body template (sync-upstream.yml:195-201): The heredoc uses literal ${{ env.VERSION }} inside << 'ENDOFBODY' (single-quoted), which means the version won't be interpolated in the heredoc. The version string will appear literally as ${{ env.VERSION }} in the PR body. The echo lines after the heredoc use unquoted ${{ ... }} so those are fine.

What's Working Well

The three-job separation in sync-upstream.yml (generate → test → create-pr) is well-designed. Draft PR on test failure is a good UX choice.
Network isolation with pytest-socket in unit tests is excellent practice.
The _PAGE_SPLIT_SEPARATOR pattern is distinctive enough to avoid false collisions in real PDF content.
Snapshot test auto-creation pattern (create on first run, compare on subsequent runs) is clean.
fail-fast: false on all matrix jobs is the right call for a library that needs to know which versions fail.
The PyPI polling loop with 20 × 30s is a practical solution for release propagation delay.

claude · 2026-04-02T04:04:50Z

PR Review: feat: add v2.2.0 sync infrastructure, CI/CD, and comprehensive tests

This is a well-structured PR with solid test coverage and a thoughtful CI/CD design. The sync marker strategy is a particularly nice approach for maintainability. Below is feedback organized by area.

Bug: Temp directory is never cleaned up

langchain_opendataloader_pdf/document_loaders.py, lazy_load()

tempfile.mkdtemp() creates a directory that persists until the OS cleans it up. Individual output files are unlinked via file.unlink(), but the directory itself is never removed. Over many calls this leaks temp directories.

# Current: only files are deleted, not the directory
try:
    file.unlink()
except Exception as e:
    logger.error(f"Error deleting temp file '{file}': {e}")

Consider wrapping the temp directory in a try/finally or using tempfile.TemporaryDirectory() as a context manager:

with tempfile.TemporaryDirectory() as output_dir:
    opendataloader_pdf.convert(input_path=self.file_paths, output_dir=output_dir, ...)
    # process files...
    # directory auto-cleaned on exit

This would also make the file.unlink() calls unnecessary.

Design: Format validation is deferred to `lazy_load()`, not `init()`

Validation of format happens inside lazy_load(), which means an invalid format string won't raise until .load() is called — potentially long after construction. Users generally expect invalid constructor arguments to fail fast.

Consider moving the format validation into __init__():

VALID_FORMATS = {"json", "text", "html", "markdown"}
if self.format not in VALID_FORMATS:
    raise ValueError(f"Invalid format '{self.format}'. ...")

Security: Shell injection risk in `sync-upstream.yml`

.github/workflows/sync-upstream.yml

The VERSION value from github.event.client_payload.version (repository_dispatch) is used directly in shell commands without sanitization:

git commit -m "sync: update to opendataloader-pdf v${{ env.VERSION }}"
git checkout -b "$BRANCH"  # BRANCH derived from VERSION
sed -i "s/.../${{ env.VERSION }}/" pyproject.toml

While repository_dispatch requires repo write access (limiting the attack surface), it's still best practice to validate the version format before use. Consider adding a validation step:

# Validate VERSION looks like a semver (e.g., 2.3.0)
if ! echo "$VERSION" | grep -qE '^[0-9]+\.[0-9]+\.[0-9]+$'; then
  echo "::error::Invalid version format: $VERSION"
  exit 1
fi

CI/CD: `test-full.yml` missing a concurrency group

.github/workflows/test-full.yml

Unlike test.yml and sync-upstream.yml, the full test workflow has no concurrency group. Multiple manual dispatches could run simultaneously, burning GitHub Actions minutes on a 3-OS × 2-Python (6-job) matrix. Adding a concurrency group would cancel stale runs:

concurrency:
  group: test-full-${{ github.ref }}
  cancel-in-progress: true

Minor: `min-dep-test` installs unpinned `pytest`

.github/workflows/test.yml, min-dep-test job

pip install "langchain-core==1.0.0" "opendataloader-pdf==2.0.0" pytest pytest-socket

The minimum dependency job is intentionally testing with pinned core dependencies, but pytest itself is unpinned. A future breaking pytest release could cause spurious failures in this job. Consider pinning pytest to a minimum tested version, or pulling it from pyproject.toml's dev extras.

Observation: Committed regression snapshots are environment-sensitive

The snapshots in tests/snapshots/ are committed as part of this PR, meaning they were generated from a specific opendataloader-pdf version and JVM. TestRegressionJson already handles this gracefully (structural comparison only), but TestRegressionText and TestRegressionMarkdown do exact-string comparisons. If the upstream library produces slightly different output across JVM versions or platforms, CI on the test matrix (Ubuntu/Windows/macOS in test-full.yml) could produce false failures on the regression tests.

This is an acceptable trade-off since the snapshots are easy to update, but it's worth documenting in a comment in test_regression.py.

Nitpick: Redundant `ext_map` after format validation

In lazy_load(), the ext_map is built after the format has already been validated against the same key set. Since the map keys exactly match the valid format values, the lookup will always succeed — but the code reads as if there could be a missing key. This is harmless but slightly misleading. A comment clarifying this would help:

# ext_map keys match the validated formats above; lookup is always safe
ext_map = {"json": "json", "text": "txt", "html": "html", "markdown": "md"}

Summary

Area	Finding	Severity
`document_loaders.py`	Temp directory never deleted	Bug
`document_loaders.py`	Format validation deferred to load time	Design
`sync-upstream.yml`	VERSION not validated before use in shell	Security
`test-full.yml`	Missing concurrency group	Minor
`test.yml`	Unpinned pytest in min-dep job	Minor
`test_regression.py`	Snapshot sensitivity across environments	Observation

Overall this is good work — the sync marker strategy, breaking change detection, network isolation for unit tests, and minimum dependency test are all solid engineering choices. The temp directory leak is the most important item to address before merging.

claude · 2026-04-02T04:13:21Z

PR Review: feat: add v2.2.0 sync infrastructure, CI/CD, and comprehensive tests

Good overall structure. The job separation in sync-upstream.yml (generate → test → create-pr) is a solid pattern. The snapshot regression approach is practical. A few issues to address:

Bug: `test_invalid_format_raises_on_load` will fail

langchain_opendataloader_pdf/document_loaders.py

The refactor from if-elif to dict lookup changes the exception type for invalid formats:

# New code — raises KeyError, not ValueError
ext = ext_map[self.format]

But the new test expects ValueError:

# tests/test_document_loaders.py
def test_invalid_format_raises_on_load(self):
    with pytest.raises(ValueError, match="Invalid format"):

Fix by adding explicit validation, e.g.:

ext = ext_map.get(self.format)
if ext is None:
    raise ValueError(f"Invalid format: {self.format!r}. Must be one of: {list(ext_map)}")

Missing `hybrid_fallback` in auto-generated README table

The synced params table ends at hybrid_timeout — hybrid_fallback is gone from the documentation. If it's intentionally excluded from the sync (it's a LangChain-side option), it needs to be manually maintained in the table, similar to split_pages.

Duplicated `java_available()` function

java_available() is defined identically in both tests/test_integration.py and tests/test_regression.py. Move it to tests/conftest.py as a shared utility to avoid drift.

Security: CI runs generated code from upstream with write permissions

.github/workflows/sync-upstream.yml

The generate job checks out and executes upstream/scripts/generate-langchain.mjs from the upstream repo at a specific tag, then has permissions: contents: write. If the upstream repo is compromised (or a tag is moved), the script runs with write access to this repo. Consider:

Pinning actions/checkout to a commit SHA for the upstream checkout (the ref: v${{ env.VERSION }} tag can be moved)
Or at minimum, sandboxing the Node.js execution and doing a post-run diff review before committing

`replace_invalid_chars` default change is undocumented

The README now shows replace_invalid_chars default as " " (space) instead of None. If this is an actual behavior change in the upstream library (not just a doc fix), it should be called out in the CHANGELOG as a behavioral change, not just a docs fix.

Minor: snapshot files missing trailing newline

All three snapshot files (lorem_text.txt, lorem_markdown.md, lorem_json.json) are missing a trailing newline. This is a POSIX convention and some tools (diff, git) will warn about it. Consider adding newlines when writing snapshots:

self.SNAPSHOT_FILE.write_text(actual + "\n", encoding="utf-8")

Nit: README table formatting inconsistency

In the auto-generated table, some default values use backtick-quoted strings ("default", "png") while others don't (text, off). This looks like a templating inconsistency in the upstream generator and is worth fixing there.

Summary: The ext_map KeyError vs ValueError mismatch is the only blocker — that test will fail in CI. The missing hybrid_fallback in docs and the security note around upstream code execution are worth addressing before merge. Everything else is minor.

🤖 Generated with Claude Code

coderabbitai

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

langchain_opendataloader_pdf/document_loaders.py (1)
253-278: ⚠️ Potential issue | 🔴 Critical

Remove or gate detect_strikethrough when calling opendataloader_pdf.convert().

The code unconditionally passes detect_strikethrough to convert() (line 272), but this parameter only exists in opendataloader-pdf>=2.1.0. The minimum tested version in .github/workflows/test.yml is still 2.0.0, which lacks this parameter. Any conversion with version 2.0.0 will fail immediately with TypeError: convert() got an unexpected keyword argument 'detect_strikethrough'. Mocked tests in tests/test_document_loaders.py won't catch this. Either gate the parameter based on the installed convert() signature, or bump the minimum dependency requirement to 2.1+ and update the workflow accordingly.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@langchain_opendataloader_pdf/document_loaders.py` around lines 253 - 278, The
call to opendataloader_pdf.convert unconditionally passes detect_strikethrough
which doesn't exist in older opendataloader-pdf versions; update the call site
in document_loaders.py to only include detect_strikethrough when convert accepts
it (e.g., use inspect.signature(opendataloader_pdf.convert) or check
hasattr/parameter names) and fall back to calling convert without that kwarg on
older versions, ensuring the code path around opendataloader_pdf.convert (and
any wrapper function that builds kwargs) conditionally adds detect_strikethrough
rather than always passing it; alternatively, if you prefer to require the newer
library, update the package minimum to >=2.1.0 and the test/workflow configs
accordingly.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/sync-upstream.yml:
- Around line 181-215: The workflow's git diff HEAD~1 fails because
actions/checkout@v4 checks out a single commit by default; update the checkout
step (the actions/checkout@v4 entry) to fetch the parent commit by adding a
fetch-depth setting (e.g., set fetch-depth: 0 to fetch full history or
fetch-depth: 2 to include the immediate parent), or alternatively run a git
fetch --unshallow before the git diff in the Build PR body step so git diff
HEAD~1 can access the parent commit.
- Around line 105-114: The workflow reuses BRANCH="sync/v${{ env.VERSION }}" and
will fail on git push if that branch already exists; make the update idempotent
by replacing git checkout -b "$BRANCH" with a resetable command (e.g., use git
checkout -B "$BRANCH" to recreate/reset the local branch) and push with safety
(replace git push origin "$BRANCH" with git push --force-with-lease origin
"$BRANCH"); update the git commands around BRANCH, git checkout -b and git push
origin to use git checkout -B and git push --force-with-lease so reruns don’t
error on non-fast-forward.
- Around line 156-159: Change the pip install invocation in the "Install
dependencies" job so the generated code is tested against the exact
opendataloader-pdf version used to generate it: update the pip install command
that currently uses "opendataloader-pdf>=${{ env.VERSION }}" to pin with "=="
(i.e., use equality instead of a greater-than constraint) in the Install
dependencies step so reruns cannot pull a newer, incompatible API.

In `@tests/test_regression.py`:
- Around line 56-58: The test currently auto-creates missing snapshots by
calling self.SNAPSHOT_FILE.write_text(...) and then pytest.skip(...); change
this behavior so tests fail when a committed snapshot is missing instead of
regenerating it: remove the write_text + pytest.skip flow and replace with an
explicit assertion that self.SNAPSHOT_FILE.exists() (or raise pytest.fail with a
clear message) in the snapshot verification code paths (including the blocks
using self.SNAPSHOT_FILE at the ranges around 56-58, 80-82, and 103-107). If you
want to keep an opt-in update path, implement a separate explicit mechanism
(e.g. check an environment variable like UPDATE_SNAPSHOTS or a pytest CLI flag)
and only perform write_text when that opt-in flag is set; do not auto-create or
skip in the normal verification path.
- Around line 127-133: The test iterates over docs = loader.load() without
verifying any items exist, so add an explicit assertion that loader.load()
returned at least one document (e.g., assert docs or assert len(docs) > 0)
before the for loop to ensure the subsequent metadata checks (in the loop
iterating over doc.metadata) actually run; update the test in
tests/test_regression.py around the use of loader.load() and the docs variable
to include this non-empty check prior to iterating.

---

Outside diff comments:
In `@langchain_opendataloader_pdf/document_loaders.py`:
- Around line 253-278: The call to opendataloader_pdf.convert unconditionally
passes detect_strikethrough which doesn't exist in older opendataloader-pdf
versions; update the call site in document_loaders.py to only include
detect_strikethrough when convert accepts it (e.g., use
inspect.signature(opendataloader_pdf.convert) or check hasattr/parameter names)
and fall back to calling convert without that kwarg on older versions, ensuring
the code path around opendataloader_pdf.convert (and any wrapper function that
builds kwargs) conditionally adds detect_strikethrough rather than always
passing it; alternatively, if you prefer to require the newer library, update
the package minimum to >=2.1.0 and the test/workflow configs accordingly.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 514cd073-28ee-4d47-9560-80ea01ede0e5

📥 Commits

Reviewing files that changed from the base of the PR and between 8467f67 and c4461bc.

📒 Files selected for processing (6)

.github/workflows/sync-upstream.yml
.github/workflows/test-full.yml
.github/workflows/test.yml
langchain_opendataloader_pdf/document_loaders.py
tests/test_integration.py
tests/test_regression.py

.github/workflows/sync-upstream.yml

coderabbitai · 2026-04-02T04:21:04Z

.github/workflows/sync-upstream.yml

+      - name: Install dependencies
+        run: |
+          pip install -e . pytest pytest-socket
+          pip install "opendataloader-pdf>=${{ env.VERSION }}"


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's read the full workflow file to understand the context wc -l .github/workflows/sync-upstream.yml

Repository: opendataloader-project/langchain-opendataloader-pdf

Length of output: 134

🏁 Script executed:

# Read the entire workflow file to see the context around lines 156-159 cat -n .github/workflows/sync-upstream.yml

Repository: opendataloader-project/langchain-opendataloader-pdf

Length of output: 10333

Pin to exact version in test job.

The generated code (created in Job 1 from opendataloader-pdf v${VERSION}) should be tested against that exact version. Currently, line 159 uses >=, which allows a newer version to be installed on reruns—potentially testing against a different API than the code was generated for. Use == to ensure consistent testing.

Proposed fix

- name: Install dependencies run: | pip install -e . pytest pytest-socket - pip install "opendataloader-pdf>=${{ env.VERSION }}" + pip install "opendataloader-pdf==${{ env.VERSION }}"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

- name: Install dependencies

run: |

pip install -e . pytest pytest-socket

pip install "opendataloader-pdf>=${{ env.VERSION }}"

- name: Install dependencies

run: |

pip install -e . pytest pytest-socket

pip install "opendataloader-pdf==${{ env.VERSION }}"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In @.github/workflows/sync-upstream.yml around lines 156 - 159, Change the pip install invocation in the "Install dependencies" job so the generated code is tested against the exact opendataloader-pdf version used to generate it: update the pip install command that currently uses "opendataloader-pdf>=${{ env.VERSION }}" to pin with "==" (i.e., use equality instead of a greater-than constraint) in the Install dependencies step so reruns cannot pull a newer, incompatible API.

Fixed — changed to always() && needs.generate.result == 'success' so create-pr runs even when test fails (draft PR).

@hyunhee-jo, it looks like there may be a mismatch — the fix you described (always() && needs.generate.result == 'success' for create-pr) addresses a different concern than the one I raised.

My original comment was specifically about the test job's "Install dependencies" step (Line 159), where >= should be changed to ==:

- name: Install dependencies run: | pip install -e . pytest pytest-socket - pip install "opendataloader-pdf>=${{ env.VERSION }}" + pip install "opendataloader-pdf==${{ env.VERSION }}"

Using >= in the test job means reruns could pull a newer version of opendataloader-pdf than the one used to generate the code in Job 1, potentially testing against a different API than the generated code was built for.

Could you confirm whether this version pinning change was also applied, or is it still pending?

.github/workflows/sync-upstream.yml

tests/test_regression.py

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (4)

tests/test_regression.py (2)

117-123: ⚠️ Potential issue | 🟡 Minor

Assert non-empty docs before iterating metadata checks.

Without this, the loop body can be skipped and the test may pass without validating metadata.

Proposed fix

         docs = loader.load()
+        assert docs
         for doc in docs:
             assert set(doc.metadata.keys()) == {"source", "format", "page"}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/test_regression.py` around lines 117 - 123, The test iterates over docs
from loader.load() but never asserts docs is non-empty, allowing the loop to be
skipped; add an explicit assertion right after docs = loader.load() (e.g.,
assert docs or assert len(docs) > 0) to ensure the collection contains at least
one document before performing the metadata checks in the subsequent for-loop
that inspects doc.metadata and Path(doc.metadata["source"]).name.

46-49: ⚠️ Potential issue | 🟠 Major

Snapshot tests should fail (not auto-generate + skip) when baselines are missing.

Current behavior can keep CI green while silently dropping regression coverage.

Proposed fix pattern

-        if not self.SNAPSHOT_FILE.exists():
-            self.SNAPSHOT_FILE.write_text(actual, encoding="utf-8")
-            pytest.skip("Snapshot created, run again to verify")
+        assert self.SNAPSHOT_FILE.exists(), (
+            f"Missing snapshot: {self.SNAPSHOT_FILE}. "
+            "Regenerate snapshots explicitly in an opt-in flow."
+        )

Also applies to: 70-73, 93-98

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/test_regression.py` around lines 46 - 49, The tests currently
auto-create missing snapshot files (using SNAPSHOT_FILE.write_text) and call
pytest.skip, which masks missing baselines; instead, change the behavior in the
snapshot handling (occurrences around the SNAPSHOT_FILE check and the blocks at
the other two locations) to fail the test when the snapshot is absent: remove
the auto-write + pytest.skip path and replace it with a pytest.fail (or an
assert/raise) that reports the missing snapshot and instructs the developer how
to regenerate baselines; keep the SNAPSHOT_FILE existence check and use the same
clear failure message in all three places (the blocks referencing
SNAPSHOT_FILE.write_text/pytest.skip).

tests/test_integration.py (1)

310-320: ⚠️ Potential issue | 🟡 Minor

Directory-loading assertion is still too weak (>= 1).

This can pass even if some PDFs are skipped. Assert equality with actual SAMPLES_DIR.glob("*.pdf") count.

Proposed fix

     def test_load_with_directory_path(self):
         """Test loading with a directory path."""
+        expected_pdf_count = len(list(SAMPLES_DIR.glob("*.pdf")))
+        if expected_pdf_count == 0:
+            pytest.skip("No sample PDFs found in samples directory")
+
         loader = OpenDataLoaderPDFLoader(
             file_path=str(SAMPLES_DIR),
             format="text",
             quiet=True,
             split_pages=False,
         )
         documents = loader.load()
-        # Should load all PDFs in the directory
-        assert len(documents) >= 1
+        assert len(documents) == expected_pdf_count

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/test_integration.py` around lines 310 - 320, The test
test_load_with_directory_path uses a weak assertion; compute the expected number
of PDFs by counting SAMPLES_DIR.glob("*.pdf") (e.g. expected =
len(list(SAMPLES_DIR.glob("*.pdf")))) and replace the current assert
len(documents) >= 1 with an equality check assert len(documents) == expected so
the test verifies all PDFs in SAMPLES_DIR were loaded; keep references to
SAMPLES_DIR, loader.load(), and documents when making the change.

langchain_opendataloader_pdf/document_loaders.py (1)

69-69: ⚠️ Potential issue | 🟠 Major

hybrid_timeout default is still inconsistent between code and docs.

Line 114 documents default "0", but Line 69 still defaults to None. This keeps the runtime/docs contract mismatched.

Proposed fix

-        hybrid_timeout: Optional[str] = None,
+        hybrid_timeout: str = "0",

Also applies to: 114-114

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@langchain_opendataloader_pdf/document_loaders.py` at line 69, The function
signature currently sets the parameter hybrid_timeout: Optional[str] = None
while the docstring/documentation states the default is "0"; update the
parameter default to hybrid_timeout: Optional[str] = "0" (keeping the
Optional[str] type) so runtime and docs match, and ensure any internal logic
that checks for None handles the new "0" string as intended (adjust checks if
they rely on None); verify the docstring/default text at the documented location
remains "0".

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@langchain_opendataloader_pdf/document_loaders.py`:
- Around line 65-73: The constructor parameter order introduced new positional
params (hybrid, hybrid_mode, hybrid_url, hybrid_timeout, hybrid_fallback) after
detect_strikethrough and accidentally moved split_pages, causing positional-call
regressions; restore backward compatibility by moving split_pages back to its
original position in the constructor signature (place split_pages where it was
before the new hybrid-related params) or alternatively make the new hybrid_*
parameters keyword-only (add a * before them) so existing positional calls
mapping to split_pages remain unchanged; update the constructor signature in
document_loaders.py (the class/init that declares detect_strikethrough,
split_pages, hybrid, hybrid_mode, hybrid_url, hybrid_timeout, hybrid_fallback)
accordingly.
- Line 272: The code unconditionally forwards detect_strikethrough to convert(),
which doesn't exist in opendataloader-pdf 2.0.0; fix by either adding a version
guard or raising the minimum dependency: Option A - detect the installed
opendataloader_pdf version (or import attribute) and only include
detect_strikethrough in the convert() call when the package version is >= 2.1.0
(or when convert accepts that parameter); Option B - update the package
constraint to opendataloader-pdf>=2.1.0 so callers always have support;
reference the detect_strikethrough parameter and the convert() call to locate
the change.

---

Duplicate comments:
In `@langchain_opendataloader_pdf/document_loaders.py`:
- Line 69: The function signature currently sets the parameter hybrid_timeout:
Optional[str] = None while the docstring/documentation states the default is
"0"; update the parameter default to hybrid_timeout: Optional[str] = "0"
(keeping the Optional[str] type) so runtime and docs match, and ensure any
internal logic that checks for None handles the new "0" string as intended
(adjust checks if they rely on None); verify the docstring/default text at the
documented location remains "0".

In `@tests/test_integration.py`:
- Around line 310-320: The test test_load_with_directory_path uses a weak
assertion; compute the expected number of PDFs by counting
SAMPLES_DIR.glob("*.pdf") (e.g. expected = len(list(SAMPLES_DIR.glob("*.pdf"))))
and replace the current assert len(documents) >= 1 with an equality check assert
len(documents) == expected so the test verifies all PDFs in SAMPLES_DIR were
loaded; keep references to SAMPLES_DIR, loader.load(), and documents when making
the change.

In `@tests/test_regression.py`:
- Around line 117-123: The test iterates over docs from loader.load() but never
asserts docs is non-empty, allowing the loop to be skipped; add an explicit
assertion right after docs = loader.load() (e.g., assert docs or assert
len(docs) > 0) to ensure the collection contains at least one document before
performing the metadata checks in the subsequent for-loop that inspects
doc.metadata and Path(doc.metadata["source"]).name.
- Around line 46-49: The tests currently auto-create missing snapshot files
(using SNAPSHOT_FILE.write_text) and call pytest.skip, which masks missing
baselines; instead, change the behavior in the snapshot handling (occurrences
around the SNAPSHOT_FILE check and the blocks at the other two locations) to
fail the test when the snapshot is absent: remove the auto-write + pytest.skip
path and replace it with a pytest.fail (or an assert/raise) that reports the
missing snapshot and instructs the developer how to regenerate baselines; keep
the SNAPSHOT_FILE existence check and use the same clear failure message in all
three places (the blocks referencing SNAPSHOT_FILE.write_text/pytest.skip).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: a3e954c3-92a0-4e08-9155-d9d009e10e4d

📥 Commits

Reviewing files that changed from the base of the PR and between c4461bc and 91f442f.

📒 Files selected for processing (4)

langchain_opendataloader_pdf/document_loaders.py
tests/conftest.py
tests/test_integration.py
tests/test_regression.py

langchain_opendataloader_pdf/document_loaders.py

claude · 2026-04-02T06:36:48Z

Test comment

claude · 2026-04-02T08:15:05Z

PR Review — `feat: add v2.2.0 sync infrastructure, CI/CD, and comprehensive tests`

Great scope of work overall — the sync markers, CI/CD automation, and expanded test suite are all valuable additions. There are a few issues that need to be resolved before merge, ranging from a critical runtime bug to minor style fixes.

Critical

detect_strikethrough unconditionally passed to convert() breaks opendataloader-pdf==2.0.0
(langchain_opendataloader_pdf/document_loaders.py, convert() call)

detect_strikethrough was introduced in opendataloader-pdf>=2.1.0, but the minimum supported (and tested) version is 2.0.0. Passing this kwarg unconditionally will raise TypeError for any user on 2.0.0.

Options:

Bump the minimum requirement in pyproject.toml to opendataloader-pdf>=2.1.0, or

Gate the kwarg with a runtime version check:

import importlib.metadata, packaging.version
_odl_version = packaging.version.parse(importlib.metadata.version("opendataloader-pdf"))
# then in convert():
if _odl_version >= packaging.version.parse("2.1.0"):
    kwargs["detect_strikethrough"] = self.detect_strikethrough

Though bumping the minimum is simpler and more maintainable.

Major

split_pages positional order change is a breaking regression
(document_loaders.py, __init__ signature)

Moving split_pages to after the synced params block changes its positional index. Any caller using positional arguments (e.g. OpenDataLoaderPDFLoader(src, "text", True, True)) will silently pass the wrong value. Either:

Restore split_pages to its original position, or
Make the new synced params keyword-only by inserting * before them

hybrid_timeout default mismatch
(document_loaders.py line ~69)

The signature has Optional[str] = None but both the docstring and README say the default is "0". Pick one and make them consistent. If the actual runtime default really is None, the docs must say so; if it should be "0", fix the signature.

sync-upstream.yml: multiple CI reliability issues

git diff HEAD~1 in the create-pr job will fail because actions/checkout@v4 defaults to fetch-depth: 1. Add fetch-depth: 2 (or 0) to the checkout step in that job.
git checkout -b sync/v{VERSION} will error on reruns when the branch already exists. Use git checkout -B and git push --force-with-lease instead.
pip install "opendataloader-pdf>={VERSION}" in the test job can pull a newer, potentially incompatible version on reruns. Pin it: "opendataloader-pdf=={VERSION}".

Snapshot tests silently skip instead of fail on missing baselines
(tests/test_regression.py)

The auto-create + pytest.skip() pattern means a missing snapshot file produces a green (skipped) test result, masking a coverage gap. Prefer pytest.fail() with a message directing the developer to regenerate the snapshot intentionally.

Minor

test_load_with_directory_path assertion is too weak
(tests/test_integration.py)

assert len(docs) >= 1 passes even if only one page from one file loaded. This should assert the exact expected count:

expected = len(list(SAMPLES_DIR.glob("*.pdf")))
assert len(docs) == expected

Missing assert docs before metadata loop
(tests/test_regression.py, metadata tests)

If docs is empty the loop silently passes. Add assert docs (or assert len(docs) > 0) before iterating.

test-full.yml missing top-level permissions

Add an explicit permissions: contents: read at the workflow level to follow the principle of least privilege.

README parameters table has inconsistent quoting

Some string defaults are quoted ("default", "xycut") and some are not (text, xycut, off). Pick one convention and apply it consistently across all rows.

tests/snapshots/lorem_markdown.md missing trailing newline

Most text editors and linters (markdownlint MD047) expect a newline at the end of the file. Add one.

Design doc min-dep-test install command missing pytest-socket
(docs/plans/2026-04-02-pre-release-checklist-design.md line ~91)

The doc's example install command for the minimum-dependency test doesn't include pytest-socket, which is required by the actual test.yml workflow. Update the doc to match.

Nits / Style

conftest.java_available() should pass check=False to subprocess.run — without it, a non-zero exit code raises an exception rather than returning False.
The regression test method test_parallel_loaders_no_conflict is actually sequential, not parallel. Rename to test_multiple_loaders_no_conflict or test_sequential_loaders_no_conflict for clarity.

Summary

Severity	Count
Critical	1
Major	4
Minor	5
Nit	2

The critical detect_strikethrough compat bug and the split_pages positional regression are the most important to fix before merge. The CI reliability issues in sync-upstream.yml should also be addressed before the workflow is relied upon in production.

coderabbitai

♻️ Duplicate comments (3)

.github/workflows/sync-upstream.yml (3)

105-114: ⚠️ Potential issue | 🟠 Major

Sync branch creation is not idempotent — re-runs will fail.

This issue was previously flagged. The workflow reuses sync/v${VERSION} but uses git checkout -b (fails if branch exists) and git push without --force-with-lease (fails on non-fast-forward). Re-running the workflow for the same version will fail.

🔧 Proposed fix for idempotent branch handling

           BRANCH="sync/v${{ env.VERSION }}"
           echo "name=$BRANCH" >> $GITHUB_OUTPUT
           git config user.name "github-actions[bot]"
           git config user.email "github-actions[bot]@users.noreply.github.com"
-          git checkout -b "$BRANCH"
+          git fetch origin "$BRANCH" 2>/dev/null || true
+          git checkout -B "$BRANCH"
           git add langchain_opendataloader_pdf/document_loaders.py pyproject.toml README.md
           git add .sync-report/ || true
           git commit -m "sync: update to opendataloader-pdf v${{ env.VERSION }}"
-          git push origin "$BRANCH"
+          git push --force-with-lease origin "$BRANCH"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In @.github/workflows/sync-upstream.yml around lines 105 - 114, The workflow
currently always runs git checkout -b "$BRANCH" and git push origin "$BRANCH",
which fails if the branch already exists or when non-fast-forward updates occur;
change the checkout to try existing branch first (e.g., git fetch and git
checkout "$BRANCH" or fall back to git checkout -b "$BRANCH") and make the push
idempotent by using git push --force-with-lease origin "$BRANCH" so re-runs
won’t error; update references to BRANCH and the checkout/push lines
accordingly.

183-186: ⚠️ Potential issue | 🔴 Critical

Shallow checkout breaks git diff HEAD~1 in PR body generation.

This issue was previously flagged. The default fetch-depth: 1 doesn't include the parent commit, causing git diff HEAD~1 on line 217 to fail. Add fetch-depth: 2 to include the parent.

🔧 Proposed fix to include parent commit

       - uses: actions/checkout@v4
         with:
           ref: ${{ needs.generate.outputs.branch }}
+          fetch-depth: 2

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In @.github/workflows/sync-upstream.yml around lines 183 - 186, The checkout
step using actions/checkout@v4 is doing a shallow fetch (default fetch-depth: 1)
which breaks the later git diff HEAD~1 used for PR body generation; update the
checkout step (the actions/checkout@v4 invocation) to set fetch-depth: 2 so the
parent commit is fetched and git diff HEAD~1 succeeds during the PR body
generation step.

158-161: ⚠️ Potential issue | 🟠 Major

Pin opendataloader-pdf to exact version for consistent testing.

This issue was previously flagged. The test job should use == instead of >= to ensure the generated code is tested against the exact version it was generated for. Using >= allows a newer version on re-runs, potentially testing against a different API.

🔧 Proposed fix to pin exact version

       - name: Install dependencies
         run: |
           pip install -e . pytest pytest-socket
-          pip install "opendataloader-pdf>=${{ env.VERSION }}"
+          pip install "opendataloader-pdf==${{ env.VERSION }}"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In @.github/workflows/sync-upstream.yml around lines 158 - 161, Change the pip
install line in the "Install dependencies" step to pin opendataloader-pdf to the
exact generated version: replace the "opendataloader-pdf>=${{ env.VERSION }}"
spec with an exact match using "==" so the workflow installs
opendataloader-pdf==${{ env.VERSION }} to ensure tests run against the exact
generated package version.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In @.github/workflows/sync-upstream.yml:
- Around line 105-114: The workflow currently always runs git checkout -b
"$BRANCH" and git push origin "$BRANCH", which fails if the branch already
exists or when non-fast-forward updates occur; change the checkout to try
existing branch first (e.g., git fetch and git checkout "$BRANCH" or fall back
to git checkout -b "$BRANCH") and make the push idempotent by using git push
--force-with-lease origin "$BRANCH" so re-runs won’t error; update references to
BRANCH and the checkout/push lines accordingly.
- Around line 183-186: The checkout step using actions/checkout@v4 is doing a
shallow fetch (default fetch-depth: 1) which breaks the later git diff HEAD~1
used for PR body generation; update the checkout step (the actions/checkout@v4
invocation) to set fetch-depth: 2 so the parent commit is fetched and git diff
HEAD~1 succeeds during the PR body generation step.
- Around line 158-161: Change the pip install line in the "Install dependencies"
step to pin opendataloader-pdf to the exact generated version: replace the
"opendataloader-pdf>=${{ env.VERSION }}" spec with an exact match using "==" so
the workflow installs opendataloader-pdf==${{ env.VERSION }} to ensure tests run
against the exact generated package version.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 6ea29446-26d0-4d08-8fd9-cf3840bb4be5

📥 Commits

Reviewing files that changed from the base of the PR and between 91f442f and 9743ebc.

📒 Files selected for processing (3)

.github/workflows/sync-upstream.yml
.github/workflows/test-full.yml
.github/workflows/test.yml

claude · 2026-04-02T08:20:07Z

test body - will delete

claude · 2026-04-02T08:21:02Z

PR Review: feat: add v2.2.0 sync infrastructure, CI/CD, and comprehensive tests

Overall this is a well-structured PR with good automation design. The sync marker approach is clever and the CI/CD job separation (generate → test → create-pr) is solid. A few issues to address below.

Bug: Format validation is unreachable dead code

langchain_opendataloader_pdf/document_loaders.py

The new ValueError for invalid format is placed in the second try block, after convert() is already called. If an invalid format is passed (e.g., "xml"), opendataloader_pdf.convert() will raise first — and that exception is silently caught and causes an early return. The ValueError is never reached.

This also makes test_invalid_format_raises_on_load unreliable — without a mock, convert() raises first (file doesn't exist, or invalid format), the exception is swallowed, and the test gets [] instead of ValueError.

Fix: Validate format in __init__, before any I/O:

VALID_FORMATS = {"text", "markdown", "json", "html"}

def __init__(self, ..., format: str = "text", ...):
    fmt = format.lower()
    if fmt not in VALID_FORMATS:
        raise ValueError(f"Invalid format '{format}'. Valid options: {', '.join(sorted(VALID_FORMATS))}")
    self.format = fmt

Minor: `sync-upstream.yml` — breaking change detection is regex-on-source-code (fragile)

The grep/sed pipeline used to detect removed params parses Python source with regex. This relies on a specific indentation pattern in the source file. If the generator reformats assignments or adds nested logic, detection silently breaks. Consider having the generator script write a machine-readable manifest (e.g., .sync-report/params.json) that the detection step can diff reliably instead.

Minor: Snapshot files missing trailing newlines

tests/snapshots/lorem_text.txt, lorem_markdown.md, and lorem_json.json all lack a trailing newline. This can cause noisy diffs in some editors and tools. Consider writing actual + "\n" rather than bare actual when saving snapshots.

Minor: `min-dep-test` version inconsistency in design doc

The docs/plans/ design doc specifies opendataloader-pdf==2.0.0 as the minimum dep test version, but the actual workflow correctly uses 2.1.0 (since detect_strikethrough requires it). The doc is already stale — worth a quick update to avoid future confusion.

Looks good

Moving split_pages outside the sync block is the right call — it is a LangChain-specific concern the upstream generator should not touch.
ext_map dict replacing the if/elif chain is cleaner.
force-with-lease instead of --force in the push step is the safer choice.
concurrency: cancel-in-progress: false on the sync workflow is correct — you don't want a mid-flight sync to be cancelled.
Test coverage for Path objects, JSON edge cases, and temp cleanup is genuinely useful.
conftest.py refactoring of java_available() is a good DRY improvement.

Generated with Claude Code

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (4)

tests/test_regression.py (2)

127-132: ⚠️ Potential issue | 🟡 Minor

Assert that load() returned documents before iterating.

If loader.load() returns an empty list, the loop body never executes and the test passes without validating any metadata. Add an explicit assertion to catch this regression.

🛡️ Proposed fix

         docs = loader.load()
+        assert docs, "Expected at least one document from loader"
         for doc in docs:
             assert set(doc.metadata.keys()) == {"source", "format", "page"}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/test_regression.py` around lines 127 - 132, The test iterates over docs
= loader.load() but never asserts that docs is non-empty, allowing a silent
pass; add an explicit assertion like assert docs or assert len(docs) > 0
immediately after calling loader.load() to ensure load() returned at least one
document before the for-loop that checks each doc's metadata (refer to
loader.load() and the docs variable).

103-107: ⚠️ Potential issue | 🟠 Major

JSON snapshot test missing UPDATE_SNAPSHOTS guard.

Unlike TestRegressionText and TestRegressionMarkdown, this test auto-creates missing snapshots and skips without checking UPDATE_SNAPSHOTS. This inconsistency allows CI to pass silently when the snapshot is missing.

🐛 Proposed fix to align with other snapshot tests

         if not self.SNAPSHOT_FILE.exists():
-            self.SNAPSHOT_FILE.write_text(
-                json.dumps(actual_data, ensure_ascii=False, indent=2), encoding="utf-8"
-            )
-            pytest.skip("Snapshot created, run again to verify")
+            if UPDATE_SNAPSHOTS:
+                self.SNAPSHOT_FILE.write_text(
+                    json.dumps(actual_data, ensure_ascii=False, indent=2), encoding="utf-8"
+                )
+                pytest.skip("Snapshot created, run again to verify")
+            else:
+                pytest.fail(f"Snapshot missing: {self.SNAPSHOT_FILE}. Run with UPDATE_SNAPSHOTS=1 to create.")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/test_regression.py` around lines 103 - 107, The snapshot creation block
in TestRegression (the code using self.SNAPSHOT_FILE, actual_data and
pytest.skip) must be guarded the same way as in
TestRegressionText/TestRegressionMarkdown: only auto-create and skip when the
UPDATE_SNAPSHOTS flag is set. Modify the block around
self.SNAPSHOT_FILE.write_text(...) to check the UPDATE_SNAPSHOTS
environment/pytest flag (e.g., os.environ.get("UPDATE_SNAPSHOTS") or the
project’s existing fixture/flag), otherwise fail the test instead of creating
the snapshot and skipping; keep references to self.SNAPSHOT_FILE and pytest.skip
so behavior matches the other snapshot tests.

.github/workflows/sync-upstream.yml (2)

158-161: ⚠️ Potential issue | 🟠 Major

Test against the exact upstream version you generated from.

Line 161 uses >=, so a rerun can validate code generated from v${VERSION} against a newer API than the one that produced the files.

🎯 Suggested fix

       - name: Install dependencies
         run: |
           pip install -e . pytest pytest-socket
-          pip install "opendataloader-pdf>=${{ env.VERSION }}"
+          pip install "opendataloader-pdf==$VERSION"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In @.github/workflows/sync-upstream.yml around lines 158 - 161, The workflow
step named "Install dependencies" currently installs opendataloader-pdf using a
range (">=${{ env.VERSION }}"), which can let tests run against a newer upstream
API than the one that generated the files; change that install to pin the
package to the exact generated version (use "==${{ env.VERSION }}" or equivalent
exact-match spec) in the pip install for opendataloader-pdf so tests run against
the same upstream version that produced the code.

102-114: ⚠️ Potential issue | 🟠 Major

Fetch the remote branch before relying on --force-with-lease.

This checkout never fetches origin/sync/v..., so git push --force-with-lease still has no lease target on reruns. If the branch already exists remotely, the push can reject and the sync stays non-idempotent.

♻️ Suggested fix

       - name: Create branch and push
         if: steps.changes.outputs.changed == 'true'
         id: branch
         run: |
-          BRANCH="sync/v${{ env.VERSION }}"
+          BRANCH="sync/v$VERSION"
           echo "name=$BRANCH" >> $GITHUB_OUTPUT
           git config user.name "github-actions[bot]"
           git config user.email "github-actions[bot]@users.noreply.github.com"
-          git checkout -b "$BRANCH"
+          git fetch origin "$BRANCH":"refs/remotes/origin/$BRANCH" || true
+          git checkout -B "$BRANCH"
           git add langchain_opendataloader_pdf/document_loaders.py pyproject.toml README.md
           git add .sync-report/ || true
-          git commit -m "sync: update to opendataloader-pdf v${{ env.VERSION }}"
+          git commit -m "sync: update to opendataloader-pdf v$VERSION"
           git push origin "$BRANCH" --force-with-lease

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In @.github/workflows/sync-upstream.yml around lines 102 - 114, The push uses
--force-with-lease but never fetches the remote branch, so the lease target can
be missing; in the "Create branch and push" step capture BRANCH and run a fetch
for origin/$BRANCH first (e.g., git fetch origin "$BRANCH" || true) then if the
remote exists check out/reset to it (git checkout -B "$BRANCH" "origin/$BRANCH")
else create the new branch as before; finally push with --force-with-lease to
origin "$BRANCH". This updates the branch discovery around BRANCH and ensures
the --force-with-lease has a proper remote lease target.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/sync-upstream.yml:
- Around line 30-31: Add a dedicated validation step that checks the raw VERSION
input (env: VERSION) against a strict semver regex (e.g.,
^[0-9]+\.[0-9]+\.[0-9]+$), fail the job if it doesn't match, and expose the
validated value as an output (steps.version.outputs.value); then update all
later uses (the sed replacement, here-doc markdown expansions, gh pr create
title, and Slack JSON payload) to reference the validated output instead of ${
env.VERSION } or ${VERSION}. Specifically, create a step named e.g.
"validate-version" that reads VERSION, validates with the regex, sets an output
value, and replace all occurrences in the generate/test/create-pr jobs where
VERSION is interpolated (sed invocation, the here-doc blocks, gh pr create
title, and Slack payload) so they consume steps.version.outputs.value to avoid
injection and quoting issues.
- Around line 169-171: The create-pr job is currently skipped whenever any
dependency (including test) fails because it lists needs: [generate, test];
change create-pr to depend only on generate (needs: [generate]) and keep the if
condition checking needs.generate.result == 'success' &&
needs.generate.outputs.changed == 'true' so the job will run even when the test
job failed; then let the existing draft logic (which reads needs.test.result)
run inside the job's steps to decide draft behavior based on the test job
outcome.
- Around line 221-234: Add an idempotency guard before running gh pr create to
skip or update an existing PR for the deterministic branch (sync/v${{
env.VERSION }}): query GitHub for an open PR with --head "$BRANCH" (e.g., gh pr
list or gh pr view targeting the branch used earlier when creating/force-pushing
the branch) and if one exists, avoid calling gh pr create (optionally update its
body/labels via gh pr edit); if no PR is found, run gh pr create with --head
"$BRANCH" and the existing flags (keep the LABELS, DRAFT_FLAG logic intact).

---

Duplicate comments:
In @.github/workflows/sync-upstream.yml:
- Around line 158-161: The workflow step named "Install dependencies" currently
installs opendataloader-pdf using a range (">=${{ env.VERSION }}"), which can
let tests run against a newer upstream API than the one that generated the
files; change that install to pin the package to the exact generated version
(use "==${{ env.VERSION }}" or equivalent exact-match spec) in the pip install
for opendataloader-pdf so tests run against the same upstream version that
produced the code.
- Around line 102-114: The push uses --force-with-lease but never fetches the
remote branch, so the lease target can be missing; in the "Create branch and
push" step capture BRANCH and run a fetch for origin/$BRANCH first (e.g., git
fetch origin "$BRANCH" || true) then if the remote exists check out/reset to it
(git checkout -B "$BRANCH" "origin/$BRANCH") else create the new branch as
before; finally push with --force-with-lease to origin "$BRANCH". This updates
the branch discovery around BRANCH and ensures the --force-with-lease has a
proper remote lease target.

In `@tests/test_regression.py`:
- Around line 127-132: The test iterates over docs = loader.load() but never
asserts that docs is non-empty, allowing a silent pass; add an explicit
assertion like assert docs or assert len(docs) > 0 immediately after calling
loader.load() to ensure load() returned at least one document before the
for-loop that checks each doc's metadata (refer to loader.load() and the docs
variable).
- Around line 103-107: The snapshot creation block in TestRegression (the code
using self.SNAPSHOT_FILE, actual_data and pytest.skip) must be guarded the same
way as in TestRegressionText/TestRegressionMarkdown: only auto-create and skip
when the UPDATE_SNAPSHOTS flag is set. Modify the block around
self.SNAPSHOT_FILE.write_text(...) to check the UPDATE_SNAPSHOTS
environment/pytest flag (e.g., os.environ.get("UPDATE_SNAPSHOTS") or the
project’s existing fixture/flag), otherwise fail the test instead of creating
the snapshot and skipping; keep references to self.SNAPSHOT_FILE and pytest.skip
so behavior matches the other snapshot tests.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 10ed07be-549c-44ea-8ef7-d88ac526898a

📥 Commits

Reviewing files that changed from the base of the PR and between 9743ebc and cbb73b9.

📒 Files selected for processing (4)

.github/workflows/sync-upstream.yml
.github/workflows/test.yml
pyproject.toml
tests/test_regression.py

.github/workflows/sync-upstream.yml

coderabbitai · 2026-04-02T08:28:21Z

.github/workflows/sync-upstream.yml

+      - name: Create pull request
+        run: |
+          DRAFT_FLAG=""
+          [ "${{ steps.status.outputs.draft }}" == "true" ] && DRAFT_FLAG="--draft"
+
+          LABELS="sync,automated"
+          [ "${{ needs.generate.outputs.breaking }}" == "true" ] && LABELS="$LABELS,breaking-change"
+
+          gh pr create \
+            --title "sync: update to opendataloader-pdf v${{ env.VERSION }}" \
+            --body-file /tmp/pr-body.md \
+            --base main \
+            $DRAFT_FLAG \
+            --label "$LABELS"


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n .github/workflows/sync-upstream.yml | sed -n '200,250p'

Repository: opendataloader-project/langchain-opendataloader-pdf

Length of output: 2357

🏁 Script executed:

rg "sync/v|git.*branch|git.*checkout" .github/workflows/sync-upstream.yml

Repository: opendataloader-project/langchain-opendataloader-pdf

Length of output: 174

🏁 Script executed:

cat -n .github/workflows/sync-upstream.yml | sed -n '160,235p'

Repository: opendataloader-project/langchain-opendataloader-pdf

Length of output: 3365

🏁 Script executed:

cat -n .github/workflows/sync-upstream.yml | sed -n '1,160p' | grep -A 100 "generate:"

Repository: opendataloader-project/langchain-opendataloader-pdf

Length of output: 4548

🌐 Web query:

gh pr create behavior when PR already exists for branch

💡 Result:

When running gh pr create on a branch that already has an existing pull request (open or otherwise), the command fails with an error message indicating the PR already exists, including the URL to the existing PR. It does not create a duplicate PR or automatically update the existing one. Example error: a pull request for branch "head" into branch "base" already exists: https://github.com/OWNER/REPO/pull/NUMBER To update an existing PR after pushing new commits to the branch, use git push. There is no built-in "create-or-update" (upsert) behavior; scripts must check for existing PRs first using commands like gh pr list --head BRANCH before creating. Official documentation for gh pr create does not mention handling existing PRs beyond creation prompts and flags, confirming it errors on duplicates.

Citations:

1: gh pr create should detect and update already existing pull requests cli/cli#5792

2: pr create shoud fail faster if a PR already exists cli/cli#649

3: https://cli.github.com/manual/gh_pr_create

🏁 Script executed:

rg "gh pr create" .github/workflows/sync-upstream.yml -A 8

Repository: opendataloader-project/langchain-opendataloader-pdf

Length of output: 371

Add idempotency guard to handle PR reruns for the same version.

When the workflow reruns for an already-synced version, gh pr create will fail because a PR already exists for the deterministic branch sync/v${{ env.VERSION }}. The branch is created at line 106, force-pushed at line 114, and the current code attempts to create a new PR without checking if one already exists.

The suggested fix is needed: check for an existing open PR before creation, and optionally add --head "$BRANCH" to be explicit about the target branch.

♻️ Minimal idempotency guard

- name: Create pull request run: | + BRANCH="sync/v$VERSION" + EXISTING_PR="$(gh pr list --head "$BRANCH" --state open --json number --jq '.[0].number')" + if [ -n "$EXISTING_PR" ] && [ "$EXISTING_PR" != "null" ]; then + echo "Open PR #$EXISTING_PR already exists for $BRANCH" + exit 0 + fi + DRAFT_FLAG="" [ "${{ steps.status.outputs.draft }}" == "true" ] && DRAFT_FLAG="--draft" LABELS="sync,automated" [ "${{ needs.generate.outputs.breaking }}" == "true" ] && LABELS="$LABELS,breaking-change" gh pr create \ --title "sync: update to opendataloader-pdf v${{ env.VERSION }}" \ --body-file /tmp/pr-body.md \ --base main \ + --head "$BRANCH" \ $DRAFT_FLAG \ --label "$LABELS"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

- name: Create pull request

run: |

DRAFT_FLAG=""

[ "${{ steps.status.outputs.draft }}" == "true" ] && DRAFT_FLAG="--draft"

LABELS="sync,automated"

[ "${{ needs.generate.outputs.breaking }}" == "true" ] && LABELS="$LABELS,breaking-change"

gh pr create \

--title "sync: update to opendataloader-pdf v${{ env.VERSION }}" \

--body-file /tmp/pr-body.md \

--base main \

$DRAFT_FLAG \

--label "$LABELS"

- name: Create pull request

run: |

BRANCH="sync/v${{ env.VERSION }}"

EXISTING_PR="$(gh pr list --head "$BRANCH" --state open --json number --jq '.[0].number')"

if [ -n "$EXISTING_PR" ] && [ "$EXISTING_PR" != "null" ]; then

echo "Open PR #$EXISTING_PR already exists for $BRANCH"

exit 0

fi

DRAFT_FLAG=""

[ "${{ steps.status.outputs.draft }}" == "true" ] && DRAFT_FLAG="--draft"

LABELS="sync,automated"

[ "${{ needs.generate.outputs.breaking }}" == "true" ] && LABELS="$LABELS,breaking-change"

gh pr create \

--title "sync: update to opendataloader-pdf v${{ env.VERSION }}" \

--body-file /tmp/pr-body.md \

--base main \

--head "$BRANCH" \

$DRAFT_FLAG \

--label "$LABELS"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In @.github/workflows/sync-upstream.yml around lines 221 - 234, Add an idempotency guard before running gh pr create to skip or update an existing PR for the deterministic branch (sync/v${{ env.VERSION }}): query GitHub for an open PR with --head "$BRANCH" (e.g., gh pr list or gh pr view targeting the branch used earlier when creating/force-pushing the branch) and if one exists, avoid calling gh pr create (optionally update its body/labels via gh pr edit); if no PR is found, run gh pr create with --head "$BRANCH" and the existing flags (keep the LABELS, DRAFT_FLAG logic intact).

claude · 2026-04-02T08:40:17Z

PR Review posted via separate message due to shell escaping constraints. See next comment.

claude · 2026-04-03T00:59:14Z

Code Review

Overall this is a well-structured PR with clear goals: syncing the v2.2.0 upstream options, adding CI/CD pipelines, and expanding test coverage. The sync marker pattern is a smart approach to automating future syncs. A few issues worth addressing below.

Bug: Format validation fires after `convert()` runs

In document_loaders.py, the new ext_map validation:

ext_map = {"json": "json", "text": "txt", "html": "html", "markdown": "md"}
ext = ext_map.get(self.format)
if ext is None:
    raise ValueError(f"Invalid format '{self.format}'. ...")

This check is inside a second try block that runs after opendataloader_pdf.convert() has already been called with the invalid format. If the upstream library rejects the format (or silently produces no output), the ValueError from our code may never surface. The format should be validated in __init__, where format.lower() already runs — that's the right place to catch it early and give a clean error message before any I/O happens.

Potential issue: Slack notification failure fails the workflow

In both test-full.yml and sync-upstream.yml, the Slack notification step runs with if: always(), but if SLACK_WEBHOOK_URL is not configured, slackapi/slack-github-action will exit with an error and mark the job as failed. Consider adding a guard:

- name: Notify Slack
  if: always() && secrets.SLACK_WEBHOOK_URL != ''

Or use continue-on-error: true so a missing webhook doesn't block the workflow result.

Min-dep test pins wrong version

The PR description says:

Minimum dependency test: 79/79 pass (langchain-core 1.0.0 + opendataloader-pdf 2.0.0)

But test.yml actually installs opendataloader-pdf==2.1.0:

pip install "langchain-core==1.0.0" "opendataloader-pdf==2.1.0" pytest pytest-socket

The minimum is 2.1.0 now (correct, since detect_strikethrough was added there), but the PR description should be updated to reflect this. Also pyproject.toml bumps the bound to >=2.1.0 — this is a breaking change for anyone on opendataloader-pdf==2.0.x and should be called out explicitly in the changelog under ### Changed or noted in the PR body.

CI: Breaking change detection regex is fragile

In sync-upstream.yml:

grep -E "^\s+(self\.\w+ =|# --- )" langchain_opendataloader_pdf/document_loaders.py \
  | grep "self\." | sed 's/.*self\.\([a-z_]*\).*/\1/' | sort > /tmp/params-before.txt

This regex matches any self.x = assignment in the file, not just __init__ parameter assignments. Internal assignments (e.g., self.file_paths, self._something) or instance variables set in other methods would show up in the diff and produce false-positive breaking-change alerts, or mask real removals. Consider scoping it to the __init__ method only, or using the sync markers themselves as anchors.

Minor: `conftest.py` mixes fixtures with utility functions

java_available() is a plain function, not a pytest fixture. Placing utility functions in conftest.py isn't wrong, but it creates confusion since conftest.py is the conventional home for fixtures. Consider moving it to a tests/utils.py module and importing from there in both test_integration.py and test_regression.py.

Positive notes

The three-job structure in sync-upstream.yml (generate → test → create-pr) is clean and well-reasoned.
force-with-lease on the branch push is the right choice over --force.
Snapshot tests with UPDATE_SNAPSHOTS=1 env var is an ergonomic pattern.
--disable-socket on unit tests is a good guard against accidental network calls.
The ext_map dict replacing the if/elif chain is a real improvement.
detect_strikethrough test coverage is thorough (init default, explicit set, convert passthrough, all-options test).

Summary: The main actionable items are (1) move format validation to __init__, (2) handle missing Slack webhook gracefully, and (3) fix the PR description / changelog regarding the minimum dependency bump to 2.1.0.

bundolee

Review by benedict

Well-structured PR overall. Sync marker design, CI/CD security posture, and test coverage expansion (83→117) are all solid. Please fix the 3 items below, then I'll merge.

Must Fix

1. Shell injection risk in sync-upstream.yml

${{ env.VERSION }} is expanded by the GitHub Actions template engine before bash runs, so if a repository_dispatch payload contains shell metacharacters, injection occurs during the validation step itself. Use $VERSION (shell variable) instead of ${{ env.VERSION }} in all run: blocks.

# Before (injectable)
if ! echo "${{ env.VERSION }}" | grep -qE ...
# After (safe)
if ! echo "$VERSION" | grep -qE ...

2. Missing UPDATE_SNAPSHOTS guard on JSON snapshot test

test_json_structure_matches_snapshot creates the snapshot unconditionally when missing and skips. The text/markdown tests correctly gate on UPDATE_SNAPSHOTS — apply the same pattern here. Without it, CI silently skips instead of failing when the snapshot is absent.

3. PR body heredoc indentation issue

The heredoc content in the Build PR body step is indented along with the YAML block, so ## headings will have leading whitespace and render as preformatted text. Dedent the heredoc content or use <<- with tabs.

Suggestions

git diff HEAD~1 in the create-pr job is fragile if the branch has merge commits. Consider git diff main...HEAD --stat instead.
The design doc still references opendataloader-pdf==2.0.0 for the min-dep test, but the actual implementation correctly uses 2.1.0. Worth updating for consistency.
README params table has inconsistent quoting of defaults (some quoted, some not). Since this will be auto-generated going forward, worth normalizing.

claude · 2026-04-03T01:17:40Z

Additional Review Notes

Bug: test_invalid_format_raises_on_load is missing a mock

The test in TestOpenDataLoaderPDFLoaderFormatValidation calls lazy_load() without mocking opendataloader_pdf. This will not pass as written.

In lazy_load(), opendataloader_pdf.convert() runs first inside a try/except Exception. When test.pdf does not exist, convert throws, the exception is silently swallowed, and the function returns early - the second try block containing the ext_map format validation is never reached. pytest.raises(ValueError) then fails because no exception was raised.

The fix is to add mocks like other lazy_load() tests in the file use:

@patch("langchain_opendataloader_pdf.document_loaders.opendataloader_pdf")
@patch("langchain_opendataloader_pdf.document_loaders.tempfile.mkdtemp")
def test_invalid_format_raises_on_load(self, mock_mkdtemp, mock_odl):
mock_mkdtemp.return_value = "/tmp/test"
mock_odl.convert = MagicMock()
loader = OpenDataLoaderPDFLoader(file_path="test.pdf", format="xml")
with pytest.raises(ValueError, match="Invalid format"):
list(loader.lazy_load())

Note: The existing reviews already cover the shell injection in sync-upstream.yml and the heredoc indentation issue - those remain the highest priority items.

claude · 2026-04-03T01:22:46Z

Code Review

Overall this is a well-structured PR with clear purpose and good test coverage. Here are my observations:

Code Quality / Best Practices

Positive:

The sync markers (BEGIN/END SYNCED PARAMS/ASSIGNMENTS/CONVERT KWARGS) are a smart design for automated code generation — they make the boundary explicit and machine-parseable.
Replacing the if/elif format chain with a dict lookup (ext_map) is cleaner and makes the ValueError message self-documenting.
Moving split_pages after the sync block is the right organizational call since it's a LangChain-specific parameter, not an upstream option.
java_available() in conftest.py now includes timeout=10, which was missing in the old inline version — good catch.

Minor concerns:

Format validation fires at load time, not __init__.
test_invalid_format_raises_on_load explicitly captures this, so it's intentional. But constructing a loader with format="xml" and only discovering the error on lazy_load() is a worse UX than failing fast at __init__. Consider validating against ext_map keys in __init__ instead (or in addition to) the lazy_load path.
split_pages positional-argument breakage risk is zero here, since all parameters are keyword-only in practice, but it might be worth a note in CHANGELOG.md that the internal parameter ordering changed (it's already called out as "moved after synced params block" — good).

Potential Bugs

page_separator kwargs are moved outside the sync block — they're still passed to convert() after # --- END SYNCED CONVERT KWARGS ---, which is correct. But this depends on the upstream generator not touching lines below the END marker. If the generator ever truncates/replaces from BEGIN to end-of-call, these would be lost silently. Worth confirming the generator's replacement scope is bounded by the END marker.
min-dep-test floor is 2.1.0, not 2.0.0 — this is intentional (since detect_strikethrough requires 2.1.0 and pyproject.toml now requires >=2.1.0), but the dependency bump is a minor breaking change for anyone currently pinned to >=2.0.0. The CHANGELOG mentions this under [Unreleased] which is correct.

Performance

PyPI polling in sync-upstream.yml (20 × 30s = max 10 min) is acceptable. One note: pip install --dry-run resolves the full dependency tree, which can be slow. A lighter check would be pip index versions opendataloader-pdf | grep "$VERSION" or a curl to the PyPI JSON API endpoint, though the current approach works fine.

Security

GitHub Actions pinning — all third-party actions (actions/checkout@v4, actions/setup-node@v4, actions/setup-python@v5, slackapi/slack-github-action@v2.1.0) are pinned to mutable version tags, not immutable SHA hashes. A tag can be force-pushed to point to malicious code. Best practice is to pin to commit SHAs:
```
uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683  # v4.2.2
```
This is especially important in sync-upstream.yml which has contents: write permissions.
Version injection protection — the semver regex check (^[0-9]+\.[0-9]+\.[0-9]+$) before any $VERSION usage is good. No injection risk identified.
--force-with-lease on push — appropriate here since the branch is freshly created in the same job; this is the safer alternative to --force.

Test Coverage

Positive:

Regression snapshot tests with UPDATE_SNAPSHOTS=1 opt-in is a clean pattern.
Path object normalization tests (TestOpenDataLoaderPDFLoaderPathHandling) cover an easy-to-miss edge case.
Temp file cleanup test (test_temp_files_deleted_after_read) is a great addition.

TestConcurrency is sequential, not concurrent — the test name implies parallel execution but it runs two loaders back-to-back. Renaming to TestSequentialIsolation or similar would be more accurate, or actually test with threading.Thread / concurrent.futures for a true concurrency test.
test_format_stored_lowercase / test_format_mixed_case — these tests assert loader.format == "markdown" after passing format="MARKDOWN". If 79/79 unit tests pass, lowercase normalization must already be implemented in __init__. This is fine, just confirming the behavior is intentional and not an accidental test-passes-but-feature-missing scenario.

Summary

Area	Status
Feature implementation (`detect_strikethrough`)	✅ Clean
Sync infrastructure	✅ Well-designed
Test coverage	✅ Comprehensive
CI workflows	✅ Good structure
Action SHA pinning	⚠️ Recommend fixing
Format validation timing	⚠️ Minor UX gap
Concurrency test naming	ℹ️ Cosmetic

The action SHA pinning (#6) is the only item I'd recommend addressing before merge — everything else is either minor or informational.

🤖 Generated with Claude Code

claude · 2026-04-03T01:25:22Z

PR Review: feat: add v2.2.0 sync infrastructure, CI/CD, and comprehensive tests

Overall this is a solid, well-structured PR. The CI/CD automation, sync marker approach, and test additions are all high-quality work. A few issues worth addressing before merge.

Potential Breaking Change

split_pages parameter position shift (document_loaders.py)

split_pages was moved from between include_header_footer and hybrid to after hybrid_fallback. In Python, non-keyword-only parameters are positional, so any caller that passes arguments positionally would silently get wrong behavior (e.g., a True intended for split_pages now lands on hybrid_fallback). This is low risk in practice since keyword usage is overwhelmingly common, but it should be noted as a behavioral risk and ideally enforced keyword-only with *:

def __init__(
    self,
    file_path: Union[str, Path, List[Union[str, Path]]],
    *,  # everything after is keyword-only
    format: str = "text",
    ...

CI/CD Issues

test.yml: minimum version inconsistency

The workflow installs opendataloader-pdf==2.1.0 for the min-dep test, but the PR description states "minimum dependency test: 79/79 pass (langchain-core 1.0.0 + opendataloader-pdf 2.0.0)". These don't match. Since detect_strikethrough was added in v2.1.0, the min version should probably be bumped to 2.1.0 in pyproject.toml as well to reflect the true minimum.

sync-upstream.yml: breaking-change detection regex is fragile

grep -E "^\s+(self\.\w+ =|# --- )" langchain_opendataloader_pdf/document_loaders.py \
  | grep "self\." | sed 's/.*self\.\([a-z_]*\).*/\1/' | sort > /tmp/params-before.txt

This regex only matches simple single-line self.x = ... assignments and will miss:

multi-line assignments
assignments indented differently (e.g., inside an if branch)

A more robust approach would be to parse the __init__ signature directly, e.g., using Python's ast module in a small helper script.

sync-upstream.yml: force-with-lease on a newly created branch

git push origin "$BRANCH" --force-with-lease

--force-with-lease without --force-with-lease=<refname>:<expect> falls back to a simple force push on a branch that has no upstream tracking ref yet. On subsequent runs for the same version, this silently overwrites any manual fixes made to the sync branch. Consider either erroring when the branch already exists, or adding a step to check.

sync-upstream.yml: git add .sync-report/ || true

Silently ignoring this failure means the PR body could be empty/missing context if the sync report was not generated. Better to fail loudly or add a guard:

[ -d .sync-report ] && git add .sync-report/

test-full.yml: integration tests run without --disable-socket

run: pytest tests/ -v

The full test intentionally runs integration tests (which need the network/Java), but there's no separation between which test files need network and which don't. If test-full.yml ever runs against a restricted network, it will silently skip rather than fail. Consider marking integration tests explicitly with pytest.mark.integration and keeping the isolation boundary clear.

Minor Issues

conftest.py: timeout=10 on java_available() — Nice improvement over the original. The subprocess.TimeoutExpired is now correctly caught. Good fix.

test_temp_files_deleted_after_read: Creates a real temp directory and verifies individual file deletion. This is tightly coupled to the implementation's cleanup strategy. If cleanup is ever changed to shutil.rmtree(tmpdir) instead of per-file deletion, this test will fail even though the behavior is correct (or vice versa). Consider testing the observable outcome (files no longer accessible) rather than the mechanism.

docs/plans/2026-04-02-pre-release-checklist-design.md: The design doc is written primarily in Korean. For an open-source project, this limits accessibility for contributors who don't read Korean. Consider adding an English summary or translating key sections.

README: replace_invalid_chars default value change — The old README showed None, the new table shows " " (space). This is a documentation fix, but it should be verified against the actual upstream default to make sure they match.

What's Done Well

Sync markers (BEGIN/END SYNCED PARAMS/ASSIGNMENTS) are a clean, maintainable approach for automated code generation.
conftest.py refactor for java_available with timeout is a good improvement.
Regression snapshot tests are well-structured with the UPDATE_SNAPSHOTS=1 escape hatch.
detect_strikethrough tests are thorough — init, default, convert passthrough, and all-options tests are all covered.
test_split_json_pages_sorted and test_split_json_multiple_elements_same_page cover real edge cases.
fail-fast: false on all matrix jobs is the right call — seeing all failure modes simultaneously is more useful.
Concurrency controls on both test.yml (cancel-in-progress) and sync-upstream.yml (no cancel for releases) are correctly distinguished.

Summary

Severity	Issue
Medium	`split_pages` positional shift is a silent breaking change risk
Medium	min-dep version mismatch between workflow and PR description
Low	Breaking-change detection regex too fragile for production automation
Low	`force-with-lease` on repeated sync runs may overwrite manual fixes
Low	Design doc is Korean-only
Nit	`git add .sync-report/ \|\| true` silently swallows errors

The core feature (detect_strikethrough) and test infrastructure are solid. Resolving the min-dep version inconsistency and the split_pages risk would make this ready to merge.

claude · 2026-04-03T01:28:52Z

PR Review: feat: add v2.2.0 sync infrastructure, CI/CD, and comprehensive tests

Overall this is a well-designed PR with solid CI/CD infrastructure, good test coverage, and a clean sync mechanism. The breaking-change detection and regression snapshot system are particularly valuable additions. A few things worth addressing:

Bugs / Correctness Issues

1. secrets.* in if: conditions may not work as expected

In both sync-upstream.yml and test-full.yml:

if: always() && secrets.SLACK_WEBHOOK_URL != ''

GitHub Actions docs advise against using secrets.* directly in if: conditions. For fork-based contexts they are masked to '', and the behavior can be surprising. The recommended pattern is to expose the secret as an env: variable and check that instead:

env:
  SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}
steps:
  - name: Notify Slack
    if: always() && env.SLACK_WEBHOOK_URL != ''

2. Upstream script execution in sync-upstream.yml

The generate job fetches and runs a Node.js script directly from the upstream repo at the tagged version:

node upstream/scripts/generate-langchain.mjs ...

If the upstream repo or a tag is ever compromised, this executes arbitrary code in your CI with contents: write permission. Consider pinning the action checkout by commit SHA (not just tag) or adding a separate review step before the script runs in production contexts.

3. pip install -e . pytest pytest-socket in sync-upstream.yml test job

This installs dev dependencies manually rather than using pip install -e ".[dev]". If pytest-asyncio or other dev deps are added to pyproject.toml later, the sync test job will silently diverge from the standard test setup. Should be:

pip install -e ".[dev]"

Code Quality

4. Format validation deferred to lazy_load(), not __init__()

format is lowercased at init time but the ValueError for invalid formats is only raised during lazy_load(). Users can construct an invalid loader without immediate feedback:

loader = OpenDataLoaderPDFLoader(file_path="a.pdf", format="xml")  # no error
list(loader.lazy_load())  # ValueError here

Moving the ext_map lookup (and the ValueError) to __init__ would give earlier, clearer feedback. The test test_invalid_format_raises_on_load documents this current behavior but "raises on load" is arguably worse UX than "raises on init."

5. Direct import from tests.conftest import java_available

In test_integration.py and test_regression.py:

from tests.conftest import java_available

This creates a hard module path dependency. When running tests with pytest tests/test_integration.py from a directory where tests isn't on sys.path, this import will fail with ModuleNotFoundError. The conventional approach is to define this as a pytest fixture in conftest.py (or place utility helpers in tests/utils.py and import from there). Since java_available() is called at collection time via pytest.mark.skipif, a simple move to tests/utils.py would fix this without needing a fixture.

6. README parameter table has inconsistent quoting of defaults

Some defaults in the synced table are quoted, others aren't:

image_output shows off (unquoted) but table_method shows "default" (quoted)
format shows text (unquoted) but image_format shows "png" (quoted)

This looks like an artifact of the upstream code generator. Worth filing an issue or fixing in langchain-sync-config.json if the generator supports overriding display format.

Minor / Nits

7. CHANGELOG.md missing version comparison links

The Keep a Changelog format conventionally includes comparison links at the bottom:

[Unreleased]: https://github.com/.../compare/v2.0.0...HEAD
[2.0.0]: https://github.com/.../releases/tag/v2.0.0

Not a blocker, but completing this makes the changelog fully conformant.

8. test_split_json_pages_sorted — "value" key isn't a real schema field

{"page number": 3, "type": "text", "value": "page 3"},

The actual JSON schema uses "content" not "value". The test passes because sorting is done on "page number" regardless of other keys, but using the real schema field name would make the test more representative.

9. conftest.py — java_available isn't a pytest fixture

The function lives in conftest.py (conventionally the home for fixtures) but isn't decorated with @pytest.fixture. This is fine functionally since it's called directly, but it's slightly confusing placement. Either convert it to a fixture (with scope="session" to avoid repeated subprocess calls), or move it to a tests/utils.py helper module.

What's Done Well

Sync marker approach is elegant — a lightweight alternative to submodule or package vendoring.
Job separation in sync-upstream.yml (generate → test → create-pr) is the right design; draft PR on test failure is a nice touch.
Breaking change detection via parameter set diff is simple and effective.
--disable-socket in unit tests is the right call for a loader that shouldn't need network in its unit suite.
Regression snapshot pattern with UPDATE_SNAPSHOTS=1 env var is clean and familiar to developers.
Concurrency controls on both test.yml and sync-upstream.yml are well-considered.
Minimum dependency test targeting langchain-core==1.0.0 + opendataloader-pdf==2.1.0 is exactly the right approach for a library maintaining compatibility guarantees.

claude · 2026-04-03T01:31:38Z

Code Review

Overall this is a well-structured PR. The sync marker approach for automated code generation is smart, and the CI/CD design follows good practices (minimal permissions, network isolation, fail-fast: false). A few issues worth addressing:

Bugs / Issues

1. Format validation fires after convert() call

In document_loaders.py, the ext_map lookup and ValueError happen after opendataloader_pdf.convert() is already called. For an invalid format like "xml", the external tool runs first (and may error or produce no output), then the ValueError is raised:

# lazy_load() — convert() runs first, THEN validation
opendataloader_pdf.convert(...)   # ← already executed with invalid format
...
ext = ext_map.get(self.format)
if ext is None:
    raise ValueError(...)         # ← too late

Validation should be in __init__ to fail fast and avoid the wasted convert call. The existing test (test_invalid_format_raises_on_load) validates the current behavior, but mocks out convert, so it doesn't catch the real-world ordering issue.

2. README table shows wrong default for replace_invalid_chars

The auto-generated README table shows " " (space) as the default:

| `replace_invalid_chars` | `str` | `" "` | ...

But the code has replace_invalid_chars: Optional[str] = None. This discrepancy will mislead users.

3. sync-upstream.yml — PyPI polling can hit job timeout

The polling loop runs up to 20 iterations × 30s = 10 minutes, but timeout-minutes: 10 is set on the same job. In practice the loop may never complete 20 retries. Consider either increasing the timeout to 15 minutes or reducing the loop to 15 attempts.

timeout-minutes: 10  # ← can race with 20×30s=600s loop below
...
for i in $(seq 1 20); do
    ...
    sleep 30
done

4. sync-upstream.yml — no guard against duplicate PRs

If two upstream releases are triggered in quick succession, the create-pr job will attempt to create a PR for the same branch twice. Adding a check like gh pr list --head "$BRANCH" --json number -q '.[0].number' before gh pr create would prevent duplicate PR errors.

Minor Issues

5. Fragile breaking-change detection regex

grep -E "^\s+(self\.\w+ =|# --- )" langchain_opendataloader_pdf/document_loaders.py \
  | grep "self\." | sed 's/.*self\.\([a-z_]*\).*/\1/' | sort

This will false-negative on parameters assigned conditionally or in multiple lines, and false-positive on any self.x = inside helper methods. Consider using an AST-based approach or maintaining a separate manifest of public parameters.

6. TestConcurrency tests sequential loading, not concurrent

test_sequential_loaders_no_conflict runs two loaders one after the other. The class name implies concurrent testing — if tempdir isolation under actual concurrent access is the concern, the test should use threading.Thread or asyncio. Otherwise rename to TestSequentialLoaders to avoid confusion.

7. keep_line_breaks removed from the visible README parameter summary

In the old README the split_pages row appeared near the top (under file_path). After the sync, the auto-generated table puts it back, but the non-synced split_pages row at the top of the table is still there — creating a duplicate split_pages row in the rendered output. Please verify the rendered README doesn't have this.

Positives

Sync markers (BEGIN/END SYNCED PARAMS) are a clean approach for code generation — easy to grep and hard to accidentally break.
java_available() with @lru_cache() in conftest.py is the right call — avoids repeated subprocess spawns across the test suite.
Regression snapshots with an UPDATE_SNAPSHOTS=1 escape hatch is a well-known pattern; good to see it here.
Minimal GitHub Actions permissions (permissions: {} at top level, then per-job overrides) is a security best practice.
The test.yml concurrency block with cancel-in-progress: true correctly avoids duplicate CI runs on fast-push.

🤖 Generated with Claude Code

bundolee

All 3 must-fix items confirmed resolved. LGTM — merging.

claude · 2026-04-03T03:51:59Z

placeholder test

claude · 2026-04-03T03:52:21Z

Code Review — PR #16: feat: add v2.2.0 sync infrastructure, CI/CD, and comprehensive tests

Overall this is a well-thought-out PR with strong security hardening in the CI/CD layer and a good expansion of test coverage. The three-layer automation (sync → test → create-PR) is a solid design. Below are specific findings across all changed files.

High Priority

1. Format validation is unreachable for the advertised use case (document_loaders.py)

The new ValueError for invalid format is placed after opendataloader_pdf.convert() in lazy_load():

The flow is:

try: opendataloader_pdf.convert(format=[self.format], ...)
except Exception: return — silently swallows every exception and exits
try: ext = ext_map.get(self.format); if ext is None: raise ValueError(...) — only reachable if step 1 succeeded

If format="xml" is passed, convert() is called with format=["xml"]. If the upstream library rejects it (raises internally), the broad except Exception handler catches it and returns — the ValueError never surfaces. The error is only raised if convert() somehow succeeds with an unknown format.

The test test_invalid_format_raises_on_load passes format="xml" without mocking opendataloader_pdf. In CI with --disable-socket and no Java, convert() will fail silently and pytest.raises(ValueError) will itself fail.

Fix: Validate format in init so it fails fast and consistently — before any I/O is attempted.

2. README parameter defaults do not match Python code defaults

The auto-generated README table (from upstream options.json) shows defaults that differ from the Python wrapper:

hybrid: Python default is None, README shows "off"
hybrid_mode: Python default is None, README shows "auto"
replace_invalid_chars: Python default is None, README shows " " (space)

Users reading the README will expect replace_invalid_chars=None to behave like replace_invalid_chars=" ". If the upstream library does not treat None and "off" identically, callers relying on the README will get unexpected behaviour.

Either update the Python defaults to match, or add a note clarifying that None means "use library default."

Medium Priority

3. conftest.py direct import is fragile (test_integration.py)

The import from tests.conftest import java_available relies on conftest.py being importable as a regular module, which requires tests/ to be a Python package (with init.py) or to be on sys.path. This is not guaranteed across all invocation patterns. The idiomatic approach is either to expose java_available as a pytest fixture, or move it to a non-conftest helper module like tests/helpers.py.

4. Third-party GitHub Action not pinned to a commit SHA

slackapi/slack-github-action@v2.1.0 uses a mutable version tag. Tags can be repointed by the action author. GitHub's security hardening docs recommend pinning to a full commit SHA for third-party actions. This applies to the Slack action in both sync-upstream.yml and test-full.yml.

5. uv.lock shows dependency downgrades

The lockfile records langchain-core 1.2.23 -> 1.2.19 and requests 2.33.0 -> 2.32.5. These appear to be regressions rather than intentional pins. Please verify the lockfile was generated from the intended environment.

Low Priority / Nits

6. Slack webhook field bypasses the env-intermediary guard

The PR description calls out "Slack webhook guarded via env intermediary" as a security hardening measure, but the webhook: input field still references the secret directly. The if: env.SLACK_URL != '' guard is correct for preventing the step from running when the secret is absent, but the description is slightly misleading.

7. Design document in Korean

docs/plans/2026-04-02-pre-release-checklist-design.md is a planning document written primarily in Korean. For a public English-first repo, internal planning docs are usually better placed in issues, wikis, or internal channels. Worth a deliberate decision either way.

8. Test should be updated if format validation moves to init

If issue #1 is fixed, test_invalid_format_raises_on_load should assert the ValueError on construction, not on lazy_load().

What's well done

Security hardening in sync-upstream.yml is thorough: permissions: {} top-level with per-job overrides, shell variable VERSION instead of context expression in shell commands, cross-job outputs mapped through env: before shell evaluation, --force-with-lease for safe force-push.
fail-fast: false across all matrices — catches all breakages in one run.
concurrency: cancel-in-progress: false for sync-upstream is correct — an upstream sync should never be interrupted mid-commit.
PyPI availability polling in the sync workflow's test job is a nice touch; release propagation delay is a real problem.
Snapshot regression tests with UPDATE_SNAPSHOTS=1 env var are a clean, well-established pattern.
ext_map dict lookup replacing the if/elif chain is a clear improvement.
detect_strikethrough integration is complete: constructor, assignments, convert() kwargs, tests, README, CHANGELOG all updated consistently.

Objective: When opendataloader-pdf releases a new version, there is no automated way to sync new options into the langchain wrapper. Every update requires manual parameter copy-paste, README edits, and ad-hoc testing. The project also lacks CI to catch regressions on PRs. Approach: Build three layers of automation — sync markers with upstream workflow (sync-upstream.yml), PR and full test workflows (test.yml, test-full.yml), and comprehensive test expansion (34 new tests, 117 total). Add detect_strikethrough parameter, bump min dep to >=2.1.0, harden all workflow run: blocks against shell injection. Evidence: Loaded samples/pdf/lorem.pdf with the updated wrapper. detect_strikethrough=True loads without error (1 doc, 462 chars). split_pages keyword arg works in new position. format="xml" raises ValueError as expected. opendataloader-pdf 2.2.0 convert() confirms detect_strikethrough param exists. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

claude · 2026-04-03T04:01:34Z

Code Review

This is a well-structured PR that addresses real operational needs — automated sync, CI/CD, and expanded test coverage. The security hardening in the workflows is particularly thoughtful. A few items below warrant attention before merging.

Bug: `test_invalid_format_raises_on_load` will not catch the `ValueError`

File: tests/test_document_loaders.py — TestOpenDataLoaderPDFLoaderFormatValidation

def test_invalid_format_raises_on_load(self):
    loader = OpenDataLoaderPDFLoader(file_path="test.pdf", format="xml")
    with pytest.raises(ValueError, match="Invalid format"):
        list(loader.lazy_load())

This test has no mock. In lazy_load(), the opendataloader_pdf.convert() call happens before the ext_map format validation. When convert raises (file not found, Java unavailable, etc.), the non-hybrid exception is silently swallowed and return is executed — the ValueError check is never reached. The test will fail in CI with Failed: DID NOT RAISE.

The fix is straightforward: add the same @patch decorators used in similar tests.

@patch("langchain_opendataloader_pdf.document_loaders.opendataloader_pdf")
@patch("langchain_opendataloader_pdf.document_loaders.tempfile.mkdtemp")
def test_invalid_format_raises_on_load(self, mock_mkdtemp, mock_odl):
    mock_mkdtemp.return_value = "/tmp/test"
    mock_odl.convert = MagicMock()
    loader = OpenDataLoaderPDFLoader(file_path="test.pdf", format="xml")
    with pytest.raises(ValueError, match="Invalid format"):
        list(loader.lazy_load())

Breaking change detection is fragile (sync-upstream.yml)

The regex used to detect removed parameters:

grep -E "^\s+(self\.\w+ =|# --- )" langchain_opendataloader_pdf/document_loaders.py \
  | grep "self\." | sed 's/.*self\.\([a-z_]*\).*/\1/' | sort > /tmp/params-before.txt

This parses Python source with grep/sed, which breaks on:

Multi-line assignments (self.foo = (\n ...)
Any self.xxx = in method bodies other than __init__ (e.g., _split_into_pages)
Refactors that move assignments around

Since generate-langchain.mjs already has access to options.json, comparing option names from JSON against the generated output would be more reliable than grepping the Python source.

`sync-upstream.yml`: test job re-declares `VERSION` env redundantly

The VERSION env var in the test and create-pr jobs re-evaluates the same expression that generate already resolved. If a repository_dispatch payload is malformed between the two evaluations (unlikely but possible), they could diverge. Since generate already validates and uses the version, consider passing it as a job output instead:

outputs:
  version: ${{ env.VERSION }}  # add to generate job outputs

Then reference needs.generate.outputs.version in downstream jobs.

Minor: inconsistent quoting in the synced README params table

The auto-generated  block mixes quoted and unquoted defaults:

Parameter	Default in table
`format`	`text` (unquoted)
`image_output`	`off` (unquoted)
`table_method`	`"default"` (quoted)
`reading_order`	`"xycut"` (quoted)

String defaults should be consistently quoted or unquoted. This is an upstream generator issue, but worth noting for the generate-langchain.mjs implementation.

Minor: `docs/plans/2026-04-02-pre-release-checklist-design.md`

This design document is primarily in Korean and uses a date-stamped filename. A few considerations:

Non-Korean contributors won't be able to review or follow it.
Design docs in docs/plans/ with date-stamped names accumulate over time — consider whether a living document (e.g., docs/architecture.md) would serve better long-term.
The document duplicates workflow YAML from the PR itself — keeping two sources of truth in sync is ongoing overhead.

This doesn't block the PR, but it's worth discussing whether this belongs in the repo or in an issue/wiki.

What's done well

Security hardening in the workflows is correct and well-documented: $VERSION shell expansion over ${{ env.VERSION }}, --force-with-lease instead of --force, permissions: {} at top level with per-job overrides, Slack webhook via env intermediary.
ext_map refactor in lazy_load() is cleaner than the if/elif chain and adds proper ValueError messaging.
conftest.py extraction of java_available() with lru_cache is correct and avoids repeated subprocess calls in the test suite.
concurrency: cancel-in-progress: false on sync-upstream is the right call — a release sync should never be cancelled mid-run.
Test coverage for edge cases (_split_into_pages, _split_json_into_pages, path handling, temp cleanup) fills real gaps.
min-dep-test job in test.yml catching minimum-version regressions is a pattern LangChain uses and is valuable here.

github-advanced-security bot found potential problems Apr 2, 2026

View reviewed changes

hyunhee-jo changed the title ~~feat: v2.2.0 동기화 인프라, CI/CD, 테스트 전면 보강~~ feat: add v2.2.0 sync infrastructure, CI/CD, and comprehensive tests Apr 2, 2026

hyunhee-jo force-pushed the feat/v2.2.0-sync-infra branch from 9807e30 to a66ed7a Compare April 2, 2026 03:37

github-advanced-security bot found potential problems Apr 2, 2026

View reviewed changes

.github/workflows/sync-upstream.yml Fixed Show fixed Hide fixed

coderabbitai bot reviewed Apr 2, 2026

View reviewed changes

langchain_opendataloader_pdf/document_loaders.py Show resolved Hide resolved

langchain_opendataloader_pdf/document_loaders.py Show resolved Hide resolved

coderabbitai bot reviewed Apr 2, 2026

View reviewed changes

opendataloader-project deleted a comment from coderabbitai bot Apr 3, 2026

This comment was marked as off-topic.

Sign in to view

bundolee requested changes Apr 3, 2026

View reviewed changes

bundolee self-requested a review April 3, 2026 03:41

bundolee approved these changes Apr 3, 2026

View reviewed changes

hyunhee-jo force-pushed the feat/v2.2.0-sync-infra branch from 57bd397 to c997baa Compare April 3, 2026 03:45

hyunhee-jo force-pushed the feat/v2.2.0-sync-infra branch from c997baa to ee0bbd5 Compare April 3, 2026 03:57

hyunhee-jo merged commit f252560 into main Apr 3, 2026
10 checks passed

hyunhee-jo deleted the feat/v2.2.0-sync-infra branch April 3, 2026 04:21

hyunhee-jo mentioned this pull request Apr 6, 2026

fix: align docstrings and docs with pass-through default pattern #17

Merged

2 tasks

		hybrid_timeout: Optional[str] = None,
		hybrid_fallback: bool = False,

Conversation

hyunhee-jo commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Objective

Approach

Evidence

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

claude bot commented Apr 2, 2026

PR Review: feat: v2.2.0 동기화 인프라, CI/CD, 테스트 전면 보강

Bugs / Issues

Test Issues

CI/CD Concerns

Minor Observations

What's Done Well

Uh oh!

coderabbitai bot commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

Uh oh!

claude bot commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hyunhee-jo Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

hyunhee-jo Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

claude bot commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review

Bugs

Correctness / Misleading tests

Code quality

Documentation

Minor

Uh oh!

claude bot commented Apr 2, 2026

PR Review: feat: add v2.2.0 sync infrastructure, CI/CD, and comprehensive tests

hyunhee-jo commented Apr 2, 2026 •

edited

Loading

coderabbitai bot commented Apr 2, 2026 •

edited

Loading

claude bot commented Apr 2, 2026 •

edited

Loading

coderabbitai bot Apr 2, 2026 •

edited

Loading

coderabbitai bot Apr 2, 2026 •

edited

Loading

coderabbitai bot Apr 2, 2026 •

edited

Loading

coderabbitai bot Apr 2, 2026 •

edited

Loading

claude bot commented Apr 2, 2026 •

edited

Loading

Design: Format validation is deferred to `lazy_load()`, not `init()`

Security: Shell injection risk in `sync-upstream.yml`

CI/CD: `test-full.yml` missing a concurrency group

Minor: `min-dep-test` installs unpinned `pytest`

Nitpick: Redundant `ext_map` after format validation

Bug: `test_invalid_format_raises_on_load` will fail

Missing `hybrid_fallback` in auto-generated README table

Duplicated `java_available()` function

`replace_invalid_chars` default change is undocumented

coderabbitai bot Apr 2, 2026 •

edited

Loading

PR Review — `feat: add v2.2.0 sync infrastructure, CI/CD, and comprehensive tests`