feat: add v2.2.0 sync infrastructure, CI/CD, and comprehensive tests#16
feat: add v2.2.0 sync infrastructure, CI/CD, and comprehensive tests#16hyunhee-jo merged 1 commit intomainfrom
Conversation
PR Review: feat: v2.2.0 동기화 인프라, CI/CD, 테스트 전면 보강Overall this is a well-structured PR that adds meaningful automation infrastructure and test coverage. Here are my observations: Bugs / Issues1.
self.format = format.lower() # AttributeError: 'NoneType' object has no attribute 'lower'Either revert to self.format = format.lower() if format else "text"2. README The synced table shows -| `format` | `str` | `text` | Output formats (comma-separated). Values: json, text, html, pdf, markdown, markdown-with-html, markdown-with-images. Default: json |
+| `format` | `str` | `"text"` | Output format. Values: json, text, html, pdf, markdown, markdown-with-html, markdown-with-images. Default: text |Test Issues3. Contradictory test comment in def test_split_pages_no_separators(self):
"""Content with no separators should yield nothing (no preamble match)."""
...
assert len(docs) == 1 # ← contradicts the docstringThe docstring says "yield nothing" but asserts 1 document. Please align the docstring with the actual expected behavior. 4. def test_invalid_format_raises_on_load(self):
loader = OpenDataLoaderPDFLoader(file_path="test.pdf", format="xml")
with pytest.raises(ValueError, match="Invalid format"):
list(loader.lazy_load())This test asserts a CI/CD Concerns5. The PR body is constructed by concatenating shell variables and cat > /tmp/pr-body.md << 'EOF_BODY'
...
EOF_BODY
# Append dynamic content separately
git diff HEAD~1 --stat >> /tmp/pr-body.md
gh pr create --body-file /tmp/pr-body.md ...6. The Consider adding a guard: if: needs.generate.result == 'success' && needs.generate.outputs.changed == 'true'7. The PyPI poll loop exits with Minor Observations
What's Done Well
🤖 Generated with Claude Code |
9807e30 to
a66ed7a
Compare
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds three GitHub Actions workflows (upstream sync with generate→test→create-pr, per-PR unit/min-dep tests, full multi-OS manual tests), introduces changelog and a pre-release design doc, adds Changes
Sequence Diagram(s)sequenceDiagram
participant User as User
participant GitHub as GitHub Actions
participant Repo as Repository
participant PyPI as PyPI
participant Slack as Slack
User->>GitHub: Trigger sync (repository_dispatch / workflow_dispatch) with version
GitHub->>Repo: generate — checkout repo & upstream tag, run generator, detect diffs & breaking changes
alt Changes detected
Repo->>Repo: Create branch `sync/v${VERSION}` and push
GitHub->>Repo: test — checkout branch, run tests matrix
GitHub->>PyPI: Poll for `opendataloader-pdf==${VERSION}`
PyPI-->>GitHub: Version available
GitHub->>Repo: Run pytest matrix (3.10,3.13) with socket isolation
alt Tests pass
GitHub->>Repo: create-pr — draft=false
else Tests fail
GitHub->>Repo: create-pr — draft=true
end
Repo->>GitHub: Create PR (label `breaking-change` if applicable)
GitHub->>Slack: Send notification (status, breaking-change info, run link)
else No changes
GitHub->>Slack: Send notification (no changes)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
test pipe content |
There was a problem hiding this comment.
Actionable comments posted: 8
♻️ Duplicate comments (1)
.github/workflows/test-full.yml (1)
1-49:⚠️ Potential issue | 🟠 MajorAdd explicit
permissionsto scopeGITHUB_TOKENaccess.This workflow currently uses default token permissions. Define least-privilege permissions at workflow level (or per job) to reduce blast radius.
Proposed fix
name: Full Test (Multi-platform) on: workflow_dispatch: + +permissions: + contents: read jobs: full-test:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/test-full.yml around lines 1 - 49, Add an explicit permissions block to scope GITHUB_TOKEN to least privilege (e.g., at top-level of the workflow) instead of using defaults; update the workflow to include a permissions section such as "permissions: contents: read" (optionally also "actions: read" if any action queries workflows) so the jobs "full-test" and "notify-slack" run with limited token rights rather than full default permissions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/sync-upstream.yml:
- Around line 182-191: The Determine PR status step currently only checks
needs.test.result == "success" and treats any other value (including "skipped")
as failed; update the shell logic in that step to explicitly handle the
"skipped" state from needs.test.result (or check needs.generate.outputs.changed
to infer that tests were intentionally skipped) — either treat "skipped" the
same as "success" (set draft=false and set result to a message like "✅ All tests
passed or skipped") or add a separate branch for "skipped" that sets the desired
draft flag and a clear result message; modify the condition in the Determine PR
status step (referencing needs.test.result and optionally
needs.generate.outputs.changed) accordingly so skipped tests are handled
correctly and the result message reflects that decision.
In `@docs/plans/2026-04-02-pre-release-checklist-design.md`:
- Around line 91-95: Update the design doc to match the workflow implementation
by adding pytest-socket to the installation command; specifically edit the pip
install line that currently reads pip install "langchain-core==1.0.0"
"opendataloader-pdf==2.0.0" pytest so it includes pytest-socket (matching the
.github/workflows/test.yml), ensuring the two documents stay consistent.
In `@langchain_opendataloader_pdf/document_loaders.py`:
- Around line 69-70: Update the hybrid_timeout parameter to use a string default
of "0" and a concrete str type so runtime behavior matches the docs: change the
function/class signature's hybrid_timeout annotation from Optional[str] = None
to hybrid_timeout: str = "0", update any internal assignment that sets or
normalizes hybrid_timeout (the assignment around the previous default) to treat
it as a string, and ensure the value passed to opendataloader_pdf.convert(...)
is forwarded as this string (not None). Adjust the inline docstring/example text
if needed so the documented default ("0" meaning no timeout) matches the
implemented default.
In `@README.md`:
- Around line 238-260: The table shows inconsistent quoting for default string
values; standardize all string defaults to use consistent quoting (either quote
all string defaults or none). Update the default cells for parameters like
`format` (text), `reading_order` (xycut), `image_format` (png), `hybrid` (off),
`hybrid_mode` (auto) and any other string defaults so they follow the chosen
style across the entire synced params table (e.g., change text → "text", xycut →
"xycut", png → "png" or remove quotes everywhere), keeping the parameter names
(e.g., `format`, `reading_order`, `image_format`, `hybrid`, `hybrid_mode`)
unchanged.
In `@tests/snapshots/lorem_markdown.md`:
- Line 3: The snapshot ending with the long "Lorem ipsum..." paragraph must have
exactly one trailing newline at EOF to satisfy markdownlint MD047; open the file
containing that paragraph and ensure the file ends with a single '\n' (remove
any extra blank lines or missing newline), so the final byte is a single newline
character after the paragraph.
In `@tests/test_integration.py`:
- Around line 323-334: The test test_load_with_directory_path currently asserts
only ">= 1"; change it to compute the expected number of PDF files in
SAMPLES_DIR and assert equality with the loaded documents length: after creating
OpenDataLoaderPDFLoader (with split_pages=False) call loader.load(), compute
expected_count by listing files in SAMPLES_DIR that match the PDF pattern (e.g.
using pathlib.Path(SAMPLES_DIR).glob("*.pdf")) and replace the assertion with
assert len(documents) == expected_count so the test verifies directory-loading
loads exactly all PDFs.
In `@tests/test_regression.py`:
- Around line 140-160: Rename the test function
test_parallel_loaders_no_conflict (in class TestConcurrency) to
test_multiple_loaders_no_conflict and update its docstring to state it verifies
instance-level isolation (not concurrent execution), or alternatively update
only the docstring to clarify it tests separate loader instances using
OpenDataLoaderPDFLoader and their load() results are independent; keep the same
assertions and use of load() unchanged.
- Around line 18-23: In the java_available() function, explicitly set
check=False on the subprocess.run(...) call so the invocation does not raise
CalledProcessError and your manual returncode check remains the authority;
update the subprocess.run call in java_available to include the check=False
argument while keeping capture_output=True and text=True.
---
Duplicate comments:
In @.github/workflows/test-full.yml:
- Around line 1-49: Add an explicit permissions block to scope GITHUB_TOKEN to
least privilege (e.g., at top-level of the workflow) instead of using defaults;
update the workflow to include a permissions section such as "permissions:
contents: read" (optionally also "actions: read" if any action queries
workflows) so the jobs "full-test" and "notify-slack" run with limited token
rights rather than full default permissions.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 88436707-70e6-4793-ab44-e0ca4565b7d5
📒 Files selected for processing (14)
.github/workflows/sync-upstream.yml.github/workflows/test-full.yml.github/workflows/test.ymlCHANGELOG.mdREADME.mddocs/plans/2026-04-02-pre-release-checklist-design.mdlangchain_opendataloader_pdf/document_loaders.pypyproject.tomltests/snapshots/lorem_json.jsontests/snapshots/lorem_markdown.mdtests/snapshots/lorem_text.txttests/test_document_loaders.pytests/test_integration.pytests/test_regression.py
| hybrid_timeout: Optional[str] = None, | ||
| hybrid_fallback: bool = False, |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
python -m pip install --quiet "opendataloader-pdf>=2.0.0"
python - <<'PY'
import inspect
import opendataloader_pdf
sig = inspect.signature(opendataloader_pdf.convert)
p = sig.parameters.get("hybrid_timeout")
print("convert signature:", sig)
print("hybrid_timeout default:", p.default if p else "<missing>")
print("hybrid_timeout annotation:", p.annotation if p else "<missing>")
PYRepository: opendataloader-project/langchain-opendataloader-pdf
Length of output: 1439
🏁 Script executed:
cat -n langchain_opendataloader_pdf/document_loaders.py | head -150Repository: opendataloader-project/langchain-opendataloader-pdf
Length of output: 7802
🏁 Script executed:
cat -n langchain_opendataloader_pdf/document_loaders.py | sed -n '270,280p'Repository: opendataloader-project/langchain-opendataloader-pdf
Length of output: 715
Change hybrid_timeout default from None to "0" to match documented behavior.
Line 69 defaults to None while the documentation at line 114 explicitly states the default should be "0" (no timeout). This value is forwarded unchanged to opendataloader_pdf.convert(...) at line 276, causing the actual default behavior to diverge from the documented contract. Update the parameter default and type annotation to hybrid_timeout: str = "0" to align code with documentation.
Proposed fix
- hybrid_timeout: Optional[str] = None,
+ hybrid_timeout: str = "0",Also applies to: lines 114 (documentation), 142 (assignment), 273–277 (forwarding)
🧰 Tools
🪛 Ruff (0.15.7)
[warning] 70-70: Boolean-typed positional argument in function definition
(FBT001)
[warning] 70-70: Boolean default positional argument in function definition
(FBT002)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@langchain_opendataloader_pdf/document_loaders.py` around lines 69 - 70,
Update the hybrid_timeout parameter to use a string default of "0" and a
concrete str type so runtime behavior matches the docs: change the
function/class signature's hybrid_timeout annotation from Optional[str] = None
to hybrid_timeout: str = "0", update any internal assignment that sets or
normalizes hybrid_timeout (the assignment around the previous default) to treat
it as a string, and ensure the value passed to opendataloader_pdf.convert(...)
is forwarded as this string (not None). Adjust the inline docstring/example text
if needed so the documented default ("0" meaning no timeout) matches the
implemented default.
There was a problem hiding this comment.
Fixed — hybrid_timeout docstring updated to "0" (no timeout). Also bumped min dep to >=2.1.0.
There was a problem hiding this comment.
@hyunhee-jo, thanks for updating the docstring and bumping the minimum dependency. However, the core concern is still present in the current file: the actual Python default value for hybrid_timeout remains None (Line 69), while the docstring now documents it as "0". This means callers who don't pass hybrid_timeout will still have None forwarded to opendataloader_pdf.convert(...), not "0".
To fully close the gap, the parameter signature itself still needs to change:
- hybrid_timeout: Optional[str] = None,
+ hybrid_timeout: str = "0",If the upstream opendataloader_pdf.convert() treats None and "0" identically (i.e., both mean "no timeout"), then keeping None is functionally safe but still semantically inconsistent with the documented contract. Could you confirm whether that's the case? If so, the current state is acceptable as a doc-only fix, but it would still be cleaner to align the code default with the docs.
There was a problem hiding this comment.
Fixed — docstring updated to "0" (no timeout). Minimum dep bumped to >=2.1.0.
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
pip install --quiet "opendataloader-pdf>=2.1.0" 2>/dev/null
python - <<'PY'
import inspect
import opendataloader_pdf
sig = inspect.signature(opendataloader_pdf.convert)
p = sig.parameters.get("hybrid_timeout")
print("hybrid_timeout default:", repr(p.default) if p else "<missing>")
print("hybrid_timeout annotation:", p.annotation if p else "<missing>")
PY
# Also try to find source handling of hybrid_timeout=None
python - <<'PY'
import opendataloader_pdf, inspect, pathlib, ast, sys
src_file = pathlib.Path(inspect.getfile(opendataloader_pdf))
print("Source file:", src_file)
src = src_file.read_text(encoding="utf-8")
# Find lines referencing hybrid_timeout
for i, line in enumerate(src.splitlines(), 1):
if "hybrid_timeout" in line:
print(f" L{i}: {line}")
PY
Length of output: 171
@hyunhee-jo, I've now inspected the upstream package directly. The script confirms that opendataloader_pdf.convert() itself declares hybrid_timeout: Optional[str] = None — so passing None from this loader is exactly right and matches the upstream API's own default. The "0" (no timeout) in the docstring correctly describes the effective Java-engine behavior when no value is supplied, not a Python-level literal that needs to be mirrored in the signature.
My original suggestion to change to hybrid_timeout: str = "0" was incorrect — doing so would send an explicit "0" string instead of letting the upstream handle its own default, which could behave subtly differently. No code change is needed here; the current Optional[str] = None default is the right choice.
Apologies for the noise on this one — the docstring update you made is the correct and complete fix.
✏️ Learnings added
Learnt from: hyunhee-jo
Repo: opendataloader-project/langchain-opendataloader-pdf PR: 16
File: langchain_opendataloader_pdf/document_loaders.py:69-70
Timestamp: 2026-04-03T00:22:12.277Z
Learning: In `langchain_opendataloader_pdf/document_loaders.py`, `hybrid_timeout: Optional[str] = None` is intentionally correct. The upstream `opendataloader_pdf.convert()` also declares `hybrid_timeout: Optional[str] = None`. The docstring phrase `"0" (no timeout)` describes the upstream Java engine's effective behavior when `None` is passed — it does NOT mean the Python default should be changed to the string `"0"`. Do not flag this as a mismatch in future reviews.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
🧠 Learnings used
Learnt from: hyunhee-jo
Repo: opendataloader-project/langchain-opendataloader-pdf PR: 16
File: langchain_opendataloader_pdf/document_loaders.py:65-73
Timestamp: 2026-04-03T00:14:47.404Z
Learning: In `langchain_opendataloader_pdf/document_loaders.py`, the `OpenDataLoaderPDFLoader.__init__` constructor is documented and contracted as keyword-argument-only usage. All public examples use keyword args, so positional argument ordering (e.g., position of `split_pages` relative to `detect_strikethrough` or hybrid params) is intentionally not part of the public API contract.
| <!-- BEGIN SYNCED PARAMS TABLE --> | ||
| | `format` | `str` | `text` | Output format. Values: text, markdown, json, html. Default: text | | ||
| | `quiet` | `bool` | `False` | Suppress console logging output | | ||
| | `content_safety_off` | `List[str]` | `None` | Disable content safety filters. Values: all, hidden-text, off-page, tiny, hidden-ocg | | ||
| | `password` | `str` | `None` | Password for encrypted PDF files | | ||
| | `keep_line_breaks` | `bool` | `False` | Preserve original line breaks in extracted text | | ||
| | `replace_invalid_chars` | `str` | `" "` | Replacement character for invalid/unrecognized characters. Default: space | | ||
| | `use_struct_tree` | `bool` | `False` | Use PDF structure tree (tagged PDF) for reading order and semantic structure | | ||
| | `table_method` | `str` | `"default"` | Table detection method. Values: default (border-based), cluster (border + cluster). Default: default | | ||
| | `reading_order` | `str` | `"xycut"` | Reading order algorithm. Values: off, xycut. Default: xycut | | ||
| | `image_output` | `str` | `off` | Image output mode. Values: off (no images), embedded (Base64), external (file references). Default: off | | ||
| | `image_format` | `str` | `"png"` | Output format for extracted images. Values: png, jpeg. Default: png | | ||
| | `image_dir` | `str` | `None` | Directory for extracted images | | ||
| | `sanitize` | `bool` | `False` | Enable sensitive data sanitization. Replaces emails, phone numbers, IPs, credit cards, and URLs with placeholders | | ||
| | `pages` | `str` | `None` | Pages to extract (e.g., "1,3,5-7"). Default: all pages | | ||
| | `include_header_footer` | `bool` | `False` | Include page headers and footers in output | | ||
| | `content_safety_off` | `List[str]` | `None` | Disable safety filters: `"hidden-text"`, `"off-page"`, `"tiny"`, `"hidden-ocg"`, `"all"` | | ||
| | `replace_invalid_chars` | `str` | `None` | Replacement for invalid characters | | ||
| | `hybrid` | `str` | `None` | Hybrid AI backend: `"docling-fast"`. Requires running backend server | | ||
| | `hybrid_mode` | `str` | `None` | `"auto"` (route complex pages) or `"full"` (route all pages) | | ||
| | `hybrid_url` | `str` | `None` | Backend server URL. Default: `http://localhost:5002` | | ||
| | `hybrid_timeout` | `str` | `None` | Backend timeout in ms. Default: `"30000"` | | ||
| | `hybrid_fallback` | `bool` | `False` | Fall back to Java extraction on backend failure | | ||
| | `detect_strikethrough` | `bool` | `False` | Detect strikethrough text and wrap with ~~ in Markdown output (experimental) | | ||
| | `hybrid` | `str` | `"off"` | Hybrid backend (requires a running server). Quick start: pip install "opendataloader-pdf[hybrid]" && opendataloader-pdf-hybrid --port 5002. For remote servers use --hybrid-url. Values: off (default), docling-fast | | ||
| | `hybrid_mode` | `str` | `"auto"` | Hybrid triage mode. Values: auto (default, dynamic triage), full (skip triage, all pages to backend) | | ||
| | `hybrid_url` | `str` | `None` | Hybrid backend server URL (overrides default) | | ||
| | `hybrid_timeout` | `str` | `"0"` | Hybrid backend request timeout in milliseconds (0 = no timeout). Default: 0 | | ||
| | `hybrid_fallback` | `bool` | `False` | Opt in to Java fallback on hybrid backend error (default: disabled) | | ||
| <!-- END SYNCED PARAMS TABLE --> |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Minor formatting inconsistency in default values.
Some string defaults are quoted ("default", "xycut", "png") while others are not (text, off). Consider making this consistent for readability.
✏️ Suggested consistency fix
-| `format` | `str` | `text` | Output format. Values: text, markdown, json, html. Default: text |
+| `format` | `str` | `"text"` | Output format. Values: text, markdown, json, html. Default: text |
...
-| `image_output` | `str` | `off` | Image output mode. Values: off (no images), embedded (Base64), external (file references). Default: off |
+| `image_output` | `str` | `"off"` | Image output mode. Values: off (no images), embedded (Base64), external (file references). Default: off |🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@README.md` around lines 238 - 260, The table shows inconsistent quoting for
default string values; standardize all string defaults to use consistent quoting
(either quote all string defaults or none). Update the default cells for
parameters like `format` (text), `reading_order` (xycut), `image_format` (png),
`hybrid` (off), `hybrid_mode` (auto) and any other string defaults so they
follow the chosen style across the entire synced params table (e.g., change text
→ "text", xycut → "xycut", png → "png" or remove quotes everywhere), keeping the
parameter names (e.g., `format`, `reading_order`, `image_format`, `hybrid`,
`hybrid_mode`) unchanged.
| def test_load_with_directory_path(self): | ||
| """Test loading with a directory path.""" | ||
| loader = OpenDataLoaderPDFLoader( | ||
| file_path=str(SAMPLES_DIR), | ||
| format="text", | ||
| quiet=True, | ||
| split_pages=False, | ||
| ) | ||
| documents = loader.load() | ||
| # Should load all PDFs in the directory | ||
| assert len(documents) >= 1 | ||
|
|
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Strengthen directory-path validation assertion.
Line 333 (>= 1) is too weak for a behavior test that claims directory loading. With split_pages=False, assert the loaded document count matches the number of PDFs in SAMPLES_DIR.
Proposed fix
def test_load_with_directory_path(self):
"""Test loading with a directory path."""
+ expected_pdf_count = len(list(SAMPLES_DIR.glob("*.pdf")))
+ if expected_pdf_count == 0:
+ pytest.skip("No sample PDFs found in samples directory")
+
loader = OpenDataLoaderPDFLoader(
file_path=str(SAMPLES_DIR),
format="text",
quiet=True,
split_pages=False,
)
documents = loader.load()
- # Should load all PDFs in the directory
- assert len(documents) >= 1
+ assert len(documents) == expected_pdf_count📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def test_load_with_directory_path(self): | |
| """Test loading with a directory path.""" | |
| loader = OpenDataLoaderPDFLoader( | |
| file_path=str(SAMPLES_DIR), | |
| format="text", | |
| quiet=True, | |
| split_pages=False, | |
| ) | |
| documents = loader.load() | |
| # Should load all PDFs in the directory | |
| assert len(documents) >= 1 | |
| def test_load_with_directory_path(self): | |
| """Test loading with a directory path.""" | |
| expected_pdf_count = len(list(SAMPLES_DIR.glob("*.pdf"))) | |
| if expected_pdf_count == 0: | |
| pytest.skip("No sample PDFs found in samples directory") | |
| loader = OpenDataLoaderPDFLoader( | |
| file_path=str(SAMPLES_DIR), | |
| format="text", | |
| quiet=True, | |
| split_pages=False, | |
| ) | |
| documents = loader.load() | |
| assert len(documents) == expected_pdf_count |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/test_integration.py` around lines 323 - 334, The test
test_load_with_directory_path currently asserts only ">= 1"; change it to
compute the expected number of PDF files in SAMPLES_DIR and assert equality with
the loaded documents length: after creating OpenDataLoaderPDFLoader (with
split_pages=False) call loader.load(), compute expected_count by listing files
in SAMPLES_DIR that match the PDF pattern (e.g. using
pathlib.Path(SAMPLES_DIR).glob("*.pdf")) and replace the assertion with assert
len(documents) == expected_count so the test verifies directory-loading loads
exactly all PDFs.
tests/test_regression.py
Outdated
| def java_available() -> bool: | ||
| try: | ||
| result = subprocess.run(["java", "-version"], capture_output=True, text=True) | ||
| return result.returncode == 0 | ||
| except FileNotFoundError: | ||
| return False |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Add explicit check=False to subprocess.run.
The check parameter should be explicitly set for clarity. While the current behavior is correct (checking returncode manually), making it explicit improves readability.
✏️ Proposed fix
def java_available() -> bool:
try:
- result = subprocess.run(["java", "-version"], capture_output=True, text=True)
+ result = subprocess.run(["java", "-version"], capture_output=True, text=True, check=False)
return result.returncode == 0
except FileNotFoundError:
return False🧰 Tools
🪛 Ruff (0.15.7)
[warning] 20-20: subprocess.run without explicit check argument
Add explicit check=False
(PLW1510)
[error] 20-20: Starting a process with a partial executable path
(S607)
[warning] 21-21: Consider moving this statement to an else block
(TRY300)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/test_regression.py` around lines 18 - 23, In the java_available()
function, explicitly set check=False on the subprocess.run(...) call so the
invocation does not raise CalledProcessError and your manual returncode check
remains the authority; update the subprocess.run call in java_available to
include the check=False argument while keeping capture_output=True and
text=True.
Code ReviewOverall this is a solid PR. The sync infrastructure, test coverage expansion, and regression snapshots are all valuable additions. A few issues worth addressing before merge: Bugs1. Optional[str] on format will crash at runtime if None is passed
2. Shell injection risk in sync-upstream.yml PR body construction The PR body is built via shell string concatenation and passed to 3. create-pr job logic bug with always() The condition Correctness / Misleading tests4. test_parallel_loaders_no_conflict does not test parallelism The test runs two loaders sequentially. Either rename it to 5. test_split_pages_no_separators has a contradictory docstring The docstring says "should yield nothing (no preamble match)" but the assertion is 6. Regression snapshot metadata assertion is fragile
Code quality7. split_pages positional ordering change Moving 8. Breaking change detection regex is fragile The grep/sed pipeline in 9. subprocess.run in regression tests lacks a timeout
Documentation10. README format parameter default is self-contradictory In the synced table, the Default column shows 11. docs/plans/ design doc is entirely in Korean Consider adding an English summary or noting the language in the file header for international contributors. Minor
The regression snapshot approach and the CI/CD generate to test to create-pr separation are well-designed. Priority fixes before merge: 1 (Optional crash), 2 (shell injection), 3 (always() logic). Generated with Claude Code https://claude.com/claude-code |
PR Review: feat: add v2.2.0 sync infrastructure, CI/CD, and comprehensive testsOverall this is a well-structured PR with excellent test coverage and a thoughtful CI/CD design. The three-job sync workflow (generate → test → create-pr) is particularly clean. A few issues worth addressing before merge: Bugs / Correctness Issues1. Temp directory is never cleaned up (
# Current
output_dir = tempfile.mkdtemp(dir=tempfile.gettempdir())
# ... individual files deleted, but output_dir stays
# Fix: use a context manager
import shutil
output_dir = tempfile.mkdtemp()
try:
...
finally:
shutil.rmtree(output_dir, ignore_errors=True)2. Silent failure on conversion error swallows errors from callers ( When except Exception as e:
if self.hybrid:
raise
logger.error(f"Error during conversion: {e}")
return # caller gets no documents and no exceptionConsider always re-raising, or at minimum raising after logging so callers can handle the failure. 3.
Security / Permissions4. Missing The create-pr:
needs: [generate, test]
permissions:
contents: read
pull-requests: writeCode Quality5. The format is validated at line 237, so in practice ext_map = {"json": "json", "text": "txt", "html": "html", "markdown": "md"}
ext = ext_map[self.format]6. Line 33 runs # Current
run: pip install -e ".[dev]" pytest-socket
# Fix
run: pip install -e ".[dev]"Test Quality7. The class is named 8. JSON regression test compares structure, not content ( The comment says "minor text extraction improvements shouldn't break this", but comparing only top-level keys and Minor Nits
What's Working Well
|
PR Review: feat: add v2.2.0 sync infrastructure, CI/CD, and comprehensive testsThis is a well-structured PR with solid test coverage and a thoughtful CI/CD design. The sync marker strategy is a particularly nice approach for maintainability. Below is feedback organized by area. Bug: Temp directory is never cleaned up
# Current: only files are deleted, not the directory
try:
file.unlink()
except Exception as e:
logger.error(f"Error deleting temp file '{file}': {e}")Consider wrapping the temp directory in a with tempfile.TemporaryDirectory() as output_dir:
opendataloader_pdf.convert(input_path=self.file_paths, output_dir=output_dir, ...)
# process files...
# directory auto-cleaned on exitThis would also make the Design: Format validation is deferred to
|
| Area | Finding | Severity |
|---|---|---|
document_loaders.py |
Temp directory never deleted | Bug |
document_loaders.py |
Format validation deferred to load time | Design |
sync-upstream.yml |
VERSION not validated before use in shell | Security |
test-full.yml |
Missing concurrency group | Minor |
test.yml |
Unpinned pytest in min-dep job | Minor |
test_regression.py |
Snapshot sensitivity across environments | Observation |
Overall this is good work — the sync marker strategy, breaking change detection, network isolation for unit tests, and minimum dependency test are all solid engineering choices. The temp directory leak is the most important item to address before merging.
PR Review: feat: add v2.2.0 sync infrastructure, CI/CD, and comprehensive testsGood overall structure. The job separation in Bug:
|
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
langchain_opendataloader_pdf/document_loaders.py (1)
253-278:⚠️ Potential issue | 🔴 CriticalRemove or gate
detect_strikethroughwhen callingopendataloader_pdf.convert().The code unconditionally passes
detect_strikethroughtoconvert()(line 272), but this parameter only exists inopendataloader-pdf>=2.1.0. The minimum tested version in.github/workflows/test.ymlis still 2.0.0, which lacks this parameter. Any conversion with version 2.0.0 will fail immediately withTypeError: convert() got an unexpected keyword argument 'detect_strikethrough'. Mocked tests intests/test_document_loaders.pywon't catch this. Either gate the parameter based on the installedconvert()signature, or bump the minimum dependency requirement to 2.1+ and update the workflow accordingly.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@langchain_opendataloader_pdf/document_loaders.py` around lines 253 - 278, The call to opendataloader_pdf.convert unconditionally passes detect_strikethrough which doesn't exist in older opendataloader-pdf versions; update the call site in document_loaders.py to only include detect_strikethrough when convert accepts it (e.g., use inspect.signature(opendataloader_pdf.convert) or check hasattr/parameter names) and fall back to calling convert without that kwarg on older versions, ensuring the code path around opendataloader_pdf.convert (and any wrapper function that builds kwargs) conditionally adds detect_strikethrough rather than always passing it; alternatively, if you prefer to require the newer library, update the package minimum to >=2.1.0 and the test/workflow configs accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/sync-upstream.yml:
- Around line 181-215: The workflow's git diff HEAD~1 fails because
actions/checkout@v4 checks out a single commit by default; update the checkout
step (the actions/checkout@v4 entry) to fetch the parent commit by adding a
fetch-depth setting (e.g., set fetch-depth: 0 to fetch full history or
fetch-depth: 2 to include the immediate parent), or alternatively run a git
fetch --unshallow before the git diff in the Build PR body step so git diff
HEAD~1 can access the parent commit.
- Around line 105-114: The workflow reuses BRANCH="sync/v${{ env.VERSION }}" and
will fail on git push if that branch already exists; make the update idempotent
by replacing git checkout -b "$BRANCH" with a resetable command (e.g., use git
checkout -B "$BRANCH" to recreate/reset the local branch) and push with safety
(replace git push origin "$BRANCH" with git push --force-with-lease origin
"$BRANCH"); update the git commands around BRANCH, git checkout -b and git push
origin to use git checkout -B and git push --force-with-lease so reruns don’t
error on non-fast-forward.
- Around line 156-159: Change the pip install invocation in the "Install
dependencies" job so the generated code is tested against the exact
opendataloader-pdf version used to generate it: update the pip install command
that currently uses "opendataloader-pdf>=${{ env.VERSION }}" to pin with "=="
(i.e., use equality instead of a greater-than constraint) in the Install
dependencies step so reruns cannot pull a newer, incompatible API.
In `@tests/test_regression.py`:
- Around line 56-58: The test currently auto-creates missing snapshots by
calling self.SNAPSHOT_FILE.write_text(...) and then pytest.skip(...); change
this behavior so tests fail when a committed snapshot is missing instead of
regenerating it: remove the write_text + pytest.skip flow and replace with an
explicit assertion that self.SNAPSHOT_FILE.exists() (or raise pytest.fail with a
clear message) in the snapshot verification code paths (including the blocks
using self.SNAPSHOT_FILE at the ranges around 56-58, 80-82, and 103-107). If you
want to keep an opt-in update path, implement a separate explicit mechanism
(e.g. check an environment variable like UPDATE_SNAPSHOTS or a pytest CLI flag)
and only perform write_text when that opt-in flag is set; do not auto-create or
skip in the normal verification path.
- Around line 127-133: The test iterates over docs = loader.load() without
verifying any items exist, so add an explicit assertion that loader.load()
returned at least one document (e.g., assert docs or assert len(docs) > 0)
before the for loop to ensure the subsequent metadata checks (in the loop
iterating over doc.metadata) actually run; update the test in
tests/test_regression.py around the use of loader.load() and the docs variable
to include this non-empty check prior to iterating.
---
Outside diff comments:
In `@langchain_opendataloader_pdf/document_loaders.py`:
- Around line 253-278: The call to opendataloader_pdf.convert unconditionally
passes detect_strikethrough which doesn't exist in older opendataloader-pdf
versions; update the call site in document_loaders.py to only include
detect_strikethrough when convert accepts it (e.g., use
inspect.signature(opendataloader_pdf.convert) or check hasattr/parameter names)
and fall back to calling convert without that kwarg on older versions, ensuring
the code path around opendataloader_pdf.convert (and any wrapper function that
builds kwargs) conditionally adds detect_strikethrough rather than always
passing it; alternatively, if you prefer to require the newer library, update
the package minimum to >=2.1.0 and the test/workflow configs accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 514cd073-28ee-4d47-9560-80ea01ede0e5
📒 Files selected for processing (6)
.github/workflows/sync-upstream.yml.github/workflows/test-full.yml.github/workflows/test.ymllangchain_opendataloader_pdf/document_loaders.pytests/test_integration.pytests/test_regression.py
.github/workflows/sync-upstream.yml
Outdated
| - name: Install dependencies | ||
| run: | | ||
| pip install -e . pytest pytest-socket | ||
| pip install "opendataloader-pdf>=${{ env.VERSION }}" |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's read the full workflow file to understand the context
wc -l .github/workflows/sync-upstream.ymlRepository: opendataloader-project/langchain-opendataloader-pdf
Length of output: 134
🏁 Script executed:
# Read the entire workflow file to see the context around lines 156-159
cat -n .github/workflows/sync-upstream.ymlRepository: opendataloader-project/langchain-opendataloader-pdf
Length of output: 10333
Pin to exact version in test job.
The generated code (created in Job 1 from opendataloader-pdf v${VERSION}) should be tested against that exact version. Currently, line 159 uses >=, which allows a newer version to be installed on reruns—potentially testing against a different API than the code was generated for. Use == to ensure consistent testing.
Proposed fix
- name: Install dependencies
run: |
pip install -e . pytest pytest-socket
- pip install "opendataloader-pdf>=${{ env.VERSION }}"
+ pip install "opendataloader-pdf==${{ env.VERSION }}"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| - name: Install dependencies | |
| run: | | |
| pip install -e . pytest pytest-socket | |
| pip install "opendataloader-pdf>=${{ env.VERSION }}" | |
| - name: Install dependencies | |
| run: | | |
| pip install -e . pytest pytest-socket | |
| pip install "opendataloader-pdf==${{ env.VERSION }}" |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.github/workflows/sync-upstream.yml around lines 156 - 159, Change the pip
install invocation in the "Install dependencies" job so the generated code is
tested against the exact opendataloader-pdf version used to generate it: update
the pip install command that currently uses "opendataloader-pdf>=${{ env.VERSION
}}" to pin with "==" (i.e., use equality instead of a greater-than constraint)
in the Install dependencies step so reruns cannot pull a newer, incompatible
API.
There was a problem hiding this comment.
Fixed — changed to always() && needs.generate.result == 'success' so create-pr runs even when test fails (draft PR).
There was a problem hiding this comment.
@hyunhee-jo, it looks like there may be a mismatch — the fix you described (always() && needs.generate.result == 'success' for create-pr) addresses a different concern than the one I raised.
My original comment was specifically about the test job's "Install dependencies" step (Line 159), where >= should be changed to ==:
- name: Install dependencies
run: |
pip install -e . pytest pytest-socket
- pip install "opendataloader-pdf>=${{ env.VERSION }}"
+ pip install "opendataloader-pdf==${{ env.VERSION }}"Using >= in the test job means reruns could pull a newer version of opendataloader-pdf than the one used to generate the code in Job 1, potentially testing against a different API than the generated code was built for.
Could you confirm whether this version pinning change was also applied, or is it still pending?
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (4)
tests/test_regression.py (2)
117-123:⚠️ Potential issue | 🟡 MinorAssert non-empty
docsbefore iterating metadata checks.Without this, the loop body can be skipped and the test may pass without validating metadata.
Proposed fix
docs = loader.load() + assert docs for doc in docs: assert set(doc.metadata.keys()) == {"source", "format", "page"}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_regression.py` around lines 117 - 123, The test iterates over docs from loader.load() but never asserts docs is non-empty, allowing the loop to be skipped; add an explicit assertion right after docs = loader.load() (e.g., assert docs or assert len(docs) > 0) to ensure the collection contains at least one document before performing the metadata checks in the subsequent for-loop that inspects doc.metadata and Path(doc.metadata["source"]).name.
46-49:⚠️ Potential issue | 🟠 MajorSnapshot tests should fail (not auto-generate + skip) when baselines are missing.
Current behavior can keep CI green while silently dropping regression coverage.
Proposed fix pattern
- if not self.SNAPSHOT_FILE.exists(): - self.SNAPSHOT_FILE.write_text(actual, encoding="utf-8") - pytest.skip("Snapshot created, run again to verify") + assert self.SNAPSHOT_FILE.exists(), ( + f"Missing snapshot: {self.SNAPSHOT_FILE}. " + "Regenerate snapshots explicitly in an opt-in flow." + )Also applies to: 70-73, 93-98
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_regression.py` around lines 46 - 49, The tests currently auto-create missing snapshot files (using SNAPSHOT_FILE.write_text) and call pytest.skip, which masks missing baselines; instead, change the behavior in the snapshot handling (occurrences around the SNAPSHOT_FILE check and the blocks at the other two locations) to fail the test when the snapshot is absent: remove the auto-write + pytest.skip path and replace it with a pytest.fail (or an assert/raise) that reports the missing snapshot and instructs the developer how to regenerate baselines; keep the SNAPSHOT_FILE existence check and use the same clear failure message in all three places (the blocks referencing SNAPSHOT_FILE.write_text/pytest.skip).tests/test_integration.py (1)
310-320:⚠️ Potential issue | 🟡 MinorDirectory-loading assertion is still too weak (
>= 1).This can pass even if some PDFs are skipped. Assert equality with actual
SAMPLES_DIR.glob("*.pdf")count.Proposed fix
def test_load_with_directory_path(self): """Test loading with a directory path.""" + expected_pdf_count = len(list(SAMPLES_DIR.glob("*.pdf"))) + if expected_pdf_count == 0: + pytest.skip("No sample PDFs found in samples directory") + loader = OpenDataLoaderPDFLoader( file_path=str(SAMPLES_DIR), format="text", quiet=True, split_pages=False, ) documents = loader.load() - # Should load all PDFs in the directory - assert len(documents) >= 1 + assert len(documents) == expected_pdf_count🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_integration.py` around lines 310 - 320, The test test_load_with_directory_path uses a weak assertion; compute the expected number of PDFs by counting SAMPLES_DIR.glob("*.pdf") (e.g. expected = len(list(SAMPLES_DIR.glob("*.pdf")))) and replace the current assert len(documents) >= 1 with an equality check assert len(documents) == expected so the test verifies all PDFs in SAMPLES_DIR were loaded; keep references to SAMPLES_DIR, loader.load(), and documents when making the change.langchain_opendataloader_pdf/document_loaders.py (1)
69-69:⚠️ Potential issue | 🟠 Major
hybrid_timeoutdefault is still inconsistent between code and docs.Line 114 documents default
"0", but Line 69 still defaults toNone. This keeps the runtime/docs contract mismatched.Proposed fix
- hybrid_timeout: Optional[str] = None, + hybrid_timeout: str = "0",Also applies to: 114-114
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@langchain_opendataloader_pdf/document_loaders.py` at line 69, The function signature currently sets the parameter hybrid_timeout: Optional[str] = None while the docstring/documentation states the default is "0"; update the parameter default to hybrid_timeout: Optional[str] = "0" (keeping the Optional[str] type) so runtime and docs match, and ensure any internal logic that checks for None handles the new "0" string as intended (adjust checks if they rely on None); verify the docstring/default text at the documented location remains "0".
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@langchain_opendataloader_pdf/document_loaders.py`:
- Around line 65-73: The constructor parameter order introduced new positional
params (hybrid, hybrid_mode, hybrid_url, hybrid_timeout, hybrid_fallback) after
detect_strikethrough and accidentally moved split_pages, causing positional-call
regressions; restore backward compatibility by moving split_pages back to its
original position in the constructor signature (place split_pages where it was
before the new hybrid-related params) or alternatively make the new hybrid_*
parameters keyword-only (add a * before them) so existing positional calls
mapping to split_pages remain unchanged; update the constructor signature in
document_loaders.py (the class/init that declares detect_strikethrough,
split_pages, hybrid, hybrid_mode, hybrid_url, hybrid_timeout, hybrid_fallback)
accordingly.
- Line 272: The code unconditionally forwards detect_strikethrough to convert(),
which doesn't exist in opendataloader-pdf 2.0.0; fix by either adding a version
guard or raising the minimum dependency: Option A - detect the installed
opendataloader_pdf version (or import attribute) and only include
detect_strikethrough in the convert() call when the package version is >= 2.1.0
(or when convert accepts that parameter); Option B - update the package
constraint to opendataloader-pdf>=2.1.0 so callers always have support;
reference the detect_strikethrough parameter and the convert() call to locate
the change.
---
Duplicate comments:
In `@langchain_opendataloader_pdf/document_loaders.py`:
- Line 69: The function signature currently sets the parameter hybrid_timeout:
Optional[str] = None while the docstring/documentation states the default is
"0"; update the parameter default to hybrid_timeout: Optional[str] = "0"
(keeping the Optional[str] type) so runtime and docs match, and ensure any
internal logic that checks for None handles the new "0" string as intended
(adjust checks if they rely on None); verify the docstring/default text at the
documented location remains "0".
In `@tests/test_integration.py`:
- Around line 310-320: The test test_load_with_directory_path uses a weak
assertion; compute the expected number of PDFs by counting
SAMPLES_DIR.glob("*.pdf") (e.g. expected = len(list(SAMPLES_DIR.glob("*.pdf"))))
and replace the current assert len(documents) >= 1 with an equality check assert
len(documents) == expected so the test verifies all PDFs in SAMPLES_DIR were
loaded; keep references to SAMPLES_DIR, loader.load(), and documents when making
the change.
In `@tests/test_regression.py`:
- Around line 117-123: The test iterates over docs from loader.load() but never
asserts docs is non-empty, allowing the loop to be skipped; add an explicit
assertion right after docs = loader.load() (e.g., assert docs or assert
len(docs) > 0) to ensure the collection contains at least one document before
performing the metadata checks in the subsequent for-loop that inspects
doc.metadata and Path(doc.metadata["source"]).name.
- Around line 46-49: The tests currently auto-create missing snapshot files
(using SNAPSHOT_FILE.write_text) and call pytest.skip, which masks missing
baselines; instead, change the behavior in the snapshot handling (occurrences
around the SNAPSHOT_FILE check and the blocks at the other two locations) to
fail the test when the snapshot is absent: remove the auto-write + pytest.skip
path and replace it with a pytest.fail (or an assert/raise) that reports the
missing snapshot and instructs the developer how to regenerate baselines; keep
the SNAPSHOT_FILE existence check and use the same clear failure message in all
three places (the blocks referencing SNAPSHOT_FILE.write_text/pytest.skip).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: a3e954c3-92a0-4e08-9155-d9d009e10e4d
📒 Files selected for processing (4)
langchain_opendataloader_pdf/document_loaders.pytests/conftest.pytests/test_integration.pytests/test_regression.py
|
Test comment |
PR Review —
|
| Severity | Count |
|---|---|
| Critical | 1 |
| Major | 4 |
| Minor | 5 |
| Nit | 2 |
The critical detect_strikethrough compat bug and the split_pages positional regression are the most important to fix before merge. The CI reliability issues in sync-upstream.yml should also be addressed before the workflow is relied upon in production.
There was a problem hiding this comment.
♻️ Duplicate comments (3)
.github/workflows/sync-upstream.yml (3)
105-114:⚠️ Potential issue | 🟠 MajorSync branch creation is not idempotent — re-runs will fail.
This issue was previously flagged. The workflow reuses
sync/v${VERSION}but usesgit checkout -b(fails if branch exists) andgit pushwithout--force-with-lease(fails on non-fast-forward). Re-running the workflow for the same version will fail.🔧 Proposed fix for idempotent branch handling
BRANCH="sync/v${{ env.VERSION }}" echo "name=$BRANCH" >> $GITHUB_OUTPUT git config user.name "github-actions[bot]" git config user.email "github-actions[bot]@users.noreply.github.com" - git checkout -b "$BRANCH" + git fetch origin "$BRANCH" 2>/dev/null || true + git checkout -B "$BRANCH" git add langchain_opendataloader_pdf/document_loaders.py pyproject.toml README.md git add .sync-report/ || true git commit -m "sync: update to opendataloader-pdf v${{ env.VERSION }}" - git push origin "$BRANCH" + git push --force-with-lease origin "$BRANCH"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/sync-upstream.yml around lines 105 - 114, The workflow currently always runs git checkout -b "$BRANCH" and git push origin "$BRANCH", which fails if the branch already exists or when non-fast-forward updates occur; change the checkout to try existing branch first (e.g., git fetch and git checkout "$BRANCH" or fall back to git checkout -b "$BRANCH") and make the push idempotent by using git push --force-with-lease origin "$BRANCH" so re-runs won’t error; update references to BRANCH and the checkout/push lines accordingly.
183-186:⚠️ Potential issue | 🔴 CriticalShallow checkout breaks
git diff HEAD~1in PR body generation.This issue was previously flagged. The default
fetch-depth: 1doesn't include the parent commit, causinggit diff HEAD~1on line 217 to fail. Addfetch-depth: 2to include the parent.🔧 Proposed fix to include parent commit
- uses: actions/checkout@v4 with: ref: ${{ needs.generate.outputs.branch }} + fetch-depth: 2🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/sync-upstream.yml around lines 183 - 186, The checkout step using actions/checkout@v4 is doing a shallow fetch (default fetch-depth: 1) which breaks the later git diff HEAD~1 used for PR body generation; update the checkout step (the actions/checkout@v4 invocation) to set fetch-depth: 2 so the parent commit is fetched and git diff HEAD~1 succeeds during the PR body generation step.
158-161:⚠️ Potential issue | 🟠 MajorPin opendataloader-pdf to exact version for consistent testing.
This issue was previously flagged. The test job should use
==instead of>=to ensure the generated code is tested against the exact version it was generated for. Using>=allows a newer version on re-runs, potentially testing against a different API.🔧 Proposed fix to pin exact version
- name: Install dependencies run: | pip install -e . pytest pytest-socket - pip install "opendataloader-pdf>=${{ env.VERSION }}" + pip install "opendataloader-pdf==${{ env.VERSION }}"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/sync-upstream.yml around lines 158 - 161, Change the pip install line in the "Install dependencies" step to pin opendataloader-pdf to the exact generated version: replace the "opendataloader-pdf>=${{ env.VERSION }}" spec with an exact match using "==" so the workflow installs opendataloader-pdf==${{ env.VERSION }} to ensure tests run against the exact generated package version.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In @.github/workflows/sync-upstream.yml:
- Around line 105-114: The workflow currently always runs git checkout -b
"$BRANCH" and git push origin "$BRANCH", which fails if the branch already
exists or when non-fast-forward updates occur; change the checkout to try
existing branch first (e.g., git fetch and git checkout "$BRANCH" or fall back
to git checkout -b "$BRANCH") and make the push idempotent by using git push
--force-with-lease origin "$BRANCH" so re-runs won’t error; update references to
BRANCH and the checkout/push lines accordingly.
- Around line 183-186: The checkout step using actions/checkout@v4 is doing a
shallow fetch (default fetch-depth: 1) which breaks the later git diff HEAD~1
used for PR body generation; update the checkout step (the actions/checkout@v4
invocation) to set fetch-depth: 2 so the parent commit is fetched and git diff
HEAD~1 succeeds during the PR body generation step.
- Around line 158-161: Change the pip install line in the "Install dependencies"
step to pin opendataloader-pdf to the exact generated version: replace the
"opendataloader-pdf>=${{ env.VERSION }}" spec with an exact match using "==" so
the workflow installs opendataloader-pdf==${{ env.VERSION }} to ensure tests run
against the exact generated package version.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 6ea29446-26d0-4d08-8fd9-cf3840bb4be5
📒 Files selected for processing (3)
.github/workflows/sync-upstream.yml.github/workflows/test-full.yml.github/workflows/test.yml
|
test body - will delete |
PR Review: feat: add v2.2.0 sync infrastructure, CI/CD, and comprehensive testsOverall this is a well-structured PR with good automation design. The sync marker approach is clever and the CI/CD job separation (generate → test → create-pr) is solid. A few issues to address below. Bug: Format validation is unreachable dead code
The new This also makes Fix: Validate format in VALID_FORMATS = {"text", "markdown", "json", "html"}
def __init__(self, ..., format: str = "text", ...):
fmt = format.lower()
if fmt not in VALID_FORMATS:
raise ValueError(f"Invalid format '{format}'. Valid options: {', '.join(sorted(VALID_FORMATS))}")
self.format = fmtMinor:
|
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (4)
tests/test_regression.py (2)
127-132:⚠️ Potential issue | 🟡 MinorAssert that
load()returned documents before iterating.If
loader.load()returns an empty list, the loop body never executes and the test passes without validating any metadata. Add an explicit assertion to catch this regression.🛡️ Proposed fix
docs = loader.load() + assert docs, "Expected at least one document from loader" for doc in docs: assert set(doc.metadata.keys()) == {"source", "format", "page"}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_regression.py` around lines 127 - 132, The test iterates over docs = loader.load() but never asserts that docs is non-empty, allowing a silent pass; add an explicit assertion like assert docs or assert len(docs) > 0 immediately after calling loader.load() to ensure load() returned at least one document before the for-loop that checks each doc's metadata (refer to loader.load() and the docs variable).
103-107:⚠️ Potential issue | 🟠 MajorJSON snapshot test missing
UPDATE_SNAPSHOTSguard.Unlike
TestRegressionTextandTestRegressionMarkdown, this test auto-creates missing snapshots and skips without checkingUPDATE_SNAPSHOTS. This inconsistency allows CI to pass silently when the snapshot is missing.🐛 Proposed fix to align with other snapshot tests
if not self.SNAPSHOT_FILE.exists(): - self.SNAPSHOT_FILE.write_text( - json.dumps(actual_data, ensure_ascii=False, indent=2), encoding="utf-8" - ) - pytest.skip("Snapshot created, run again to verify") + if UPDATE_SNAPSHOTS: + self.SNAPSHOT_FILE.write_text( + json.dumps(actual_data, ensure_ascii=False, indent=2), encoding="utf-8" + ) + pytest.skip("Snapshot created, run again to verify") + else: + pytest.fail(f"Snapshot missing: {self.SNAPSHOT_FILE}. Run with UPDATE_SNAPSHOTS=1 to create.")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_regression.py` around lines 103 - 107, The snapshot creation block in TestRegression (the code using self.SNAPSHOT_FILE, actual_data and pytest.skip) must be guarded the same way as in TestRegressionText/TestRegressionMarkdown: only auto-create and skip when the UPDATE_SNAPSHOTS flag is set. Modify the block around self.SNAPSHOT_FILE.write_text(...) to check the UPDATE_SNAPSHOTS environment/pytest flag (e.g., os.environ.get("UPDATE_SNAPSHOTS") or the project’s existing fixture/flag), otherwise fail the test instead of creating the snapshot and skipping; keep references to self.SNAPSHOT_FILE and pytest.skip so behavior matches the other snapshot tests..github/workflows/sync-upstream.yml (2)
158-161:⚠️ Potential issue | 🟠 MajorTest against the exact upstream version you generated from.
Line 161 uses
>=, so a rerun can validate code generated fromv${VERSION}against a newer API than the one that produced the files.🎯 Suggested fix
- name: Install dependencies run: | pip install -e . pytest pytest-socket - pip install "opendataloader-pdf>=${{ env.VERSION }}" + pip install "opendataloader-pdf==$VERSION"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/sync-upstream.yml around lines 158 - 161, The workflow step named "Install dependencies" currently installs opendataloader-pdf using a range (">=${{ env.VERSION }}"), which can let tests run against a newer upstream API than the one that generated the files; change that install to pin the package to the exact generated version (use "==${{ env.VERSION }}" or equivalent exact-match spec) in the pip install for opendataloader-pdf so tests run against the same upstream version that produced the code.
102-114:⚠️ Potential issue | 🟠 MajorFetch the remote branch before relying on
--force-with-lease.This checkout never fetches
origin/sync/v..., sogit push --force-with-leasestill has no lease target on reruns. If the branch already exists remotely, the push can reject and the sync stays non-idempotent.♻️ Suggested fix
- name: Create branch and push if: steps.changes.outputs.changed == 'true' id: branch run: | - BRANCH="sync/v${{ env.VERSION }}" + BRANCH="sync/v$VERSION" echo "name=$BRANCH" >> $GITHUB_OUTPUT git config user.name "github-actions[bot]" git config user.email "github-actions[bot]@users.noreply.github.com" - git checkout -b "$BRANCH" + git fetch origin "$BRANCH":"refs/remotes/origin/$BRANCH" || true + git checkout -B "$BRANCH" git add langchain_opendataloader_pdf/document_loaders.py pyproject.toml README.md git add .sync-report/ || true - git commit -m "sync: update to opendataloader-pdf v${{ env.VERSION }}" + git commit -m "sync: update to opendataloader-pdf v$VERSION" git push origin "$BRANCH" --force-with-lease🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/sync-upstream.yml around lines 102 - 114, The push uses --force-with-lease but never fetches the remote branch, so the lease target can be missing; in the "Create branch and push" step capture BRANCH and run a fetch for origin/$BRANCH first (e.g., git fetch origin "$BRANCH" || true) then if the remote exists check out/reset to it (git checkout -B "$BRANCH" "origin/$BRANCH") else create the new branch as before; finally push with --force-with-lease to origin "$BRANCH". This updates the branch discovery around BRANCH and ensures the --force-with-lease has a proper remote lease target.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/sync-upstream.yml:
- Around line 30-31: Add a dedicated validation step that checks the raw VERSION
input (env: VERSION) against a strict semver regex (e.g.,
^[0-9]+\.[0-9]+\.[0-9]+$), fail the job if it doesn't match, and expose the
validated value as an output (steps.version.outputs.value); then update all
later uses (the sed replacement, here-doc markdown expansions, gh pr create
title, and Slack JSON payload) to reference the validated output instead of ${
env.VERSION } or ${VERSION}. Specifically, create a step named e.g.
"validate-version" that reads VERSION, validates with the regex, sets an output
value, and replace all occurrences in the generate/test/create-pr jobs where
VERSION is interpolated (sed invocation, the here-doc blocks, gh pr create
title, and Slack payload) so they consume steps.version.outputs.value to avoid
injection and quoting issues.
- Around line 169-171: The create-pr job is currently skipped whenever any
dependency (including test) fails because it lists needs: [generate, test];
change create-pr to depend only on generate (needs: [generate]) and keep the if
condition checking needs.generate.result == 'success' &&
needs.generate.outputs.changed == 'true' so the job will run even when the test
job failed; then let the existing draft logic (which reads needs.test.result)
run inside the job's steps to decide draft behavior based on the test job
outcome.
- Around line 221-234: Add an idempotency guard before running gh pr create to
skip or update an existing PR for the deterministic branch (sync/v${{
env.VERSION }}): query GitHub for an open PR with --head "$BRANCH" (e.g., gh pr
list or gh pr view targeting the branch used earlier when creating/force-pushing
the branch) and if one exists, avoid calling gh pr create (optionally update its
body/labels via gh pr edit); if no PR is found, run gh pr create with --head
"$BRANCH" and the existing flags (keep the LABELS, DRAFT_FLAG logic intact).
---
Duplicate comments:
In @.github/workflows/sync-upstream.yml:
- Around line 158-161: The workflow step named "Install dependencies" currently
installs opendataloader-pdf using a range (">=${{ env.VERSION }}"), which can
let tests run against a newer upstream API than the one that generated the
files; change that install to pin the package to the exact generated version
(use "==${{ env.VERSION }}" or equivalent exact-match spec) in the pip install
for opendataloader-pdf so tests run against the same upstream version that
produced the code.
- Around line 102-114: The push uses --force-with-lease but never fetches the
remote branch, so the lease target can be missing; in the "Create branch and
push" step capture BRANCH and run a fetch for origin/$BRANCH first (e.g., git
fetch origin "$BRANCH" || true) then if the remote exists check out/reset to it
(git checkout -B "$BRANCH" "origin/$BRANCH") else create the new branch as
before; finally push with --force-with-lease to origin "$BRANCH". This updates
the branch discovery around BRANCH and ensures the --force-with-lease has a
proper remote lease target.
In `@tests/test_regression.py`:
- Around line 127-132: The test iterates over docs = loader.load() but never
asserts that docs is non-empty, allowing a silent pass; add an explicit
assertion like assert docs or assert len(docs) > 0 immediately after calling
loader.load() to ensure load() returned at least one document before the
for-loop that checks each doc's metadata (refer to loader.load() and the docs
variable).
- Around line 103-107: The snapshot creation block in TestRegression (the code
using self.SNAPSHOT_FILE, actual_data and pytest.skip) must be guarded the same
way as in TestRegressionText/TestRegressionMarkdown: only auto-create and skip
when the UPDATE_SNAPSHOTS flag is set. Modify the block around
self.SNAPSHOT_FILE.write_text(...) to check the UPDATE_SNAPSHOTS
environment/pytest flag (e.g., os.environ.get("UPDATE_SNAPSHOTS") or the
project’s existing fixture/flag), otherwise fail the test instead of creating
the snapshot and skipping; keep references to self.SNAPSHOT_FILE and pytest.skip
so behavior matches the other snapshot tests.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 10ed07be-549c-44ea-8ef7-d88ac526898a
📒 Files selected for processing (4)
.github/workflows/sync-upstream.yml.github/workflows/test.ymlpyproject.tomltests/test_regression.py
| - name: Create pull request | ||
| run: | | ||
| DRAFT_FLAG="" | ||
| [ "${{ steps.status.outputs.draft }}" == "true" ] && DRAFT_FLAG="--draft" | ||
|
|
||
| LABELS="sync,automated" | ||
| [ "${{ needs.generate.outputs.breaking }}" == "true" ] && LABELS="$LABELS,breaking-change" | ||
|
|
||
| gh pr create \ | ||
| --title "sync: update to opendataloader-pdf v${{ env.VERSION }}" \ | ||
| --body-file /tmp/pr-body.md \ | ||
| --base main \ | ||
| $DRAFT_FLAG \ | ||
| --label "$LABELS" |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n .github/workflows/sync-upstream.yml | sed -n '200,250p'Repository: opendataloader-project/langchain-opendataloader-pdf
Length of output: 2357
🏁 Script executed:
rg "sync/v|git.*branch|git.*checkout" .github/workflows/sync-upstream.ymlRepository: opendataloader-project/langchain-opendataloader-pdf
Length of output: 174
🏁 Script executed:
cat -n .github/workflows/sync-upstream.yml | sed -n '160,235p'Repository: opendataloader-project/langchain-opendataloader-pdf
Length of output: 3365
🏁 Script executed:
cat -n .github/workflows/sync-upstream.yml | sed -n '1,160p' | grep -A 100 "generate:"Repository: opendataloader-project/langchain-opendataloader-pdf
Length of output: 4548
🌐 Web query:
gh pr create behavior when PR already exists for branch
💡 Result:
When running gh pr create on a branch that already has an existing pull request (open or otherwise), the command fails with an error message indicating the PR already exists, including the URL to the existing PR. It does not create a duplicate PR or automatically update the existing one. Example error: a pull request for branch "head" into branch "base" already exists: https://github.com/OWNER/REPO/pull/NUMBER To update an existing PR after pushing new commits to the branch, use git push. There is no built-in "create-or-update" (upsert) behavior; scripts must check for existing PRs first using commands like gh pr list --head BRANCH before creating. Official documentation for gh pr create does not mention handling existing PRs beyond creation prompts and flags, confirming it errors on duplicates.
Citations:
- 1: gh pr create should detect and update already existing pull requests cli/cli#5792
- 2:
pr createshoud fail faster if a PR already exists cli/cli#649 - 3: https://cli.github.com/manual/gh_pr_create
🏁 Script executed:
rg "gh pr create" .github/workflows/sync-upstream.yml -A 8Repository: opendataloader-project/langchain-opendataloader-pdf
Length of output: 371
Add idempotency guard to handle PR reruns for the same version.
When the workflow reruns for an already-synced version, gh pr create will fail because a PR already exists for the deterministic branch sync/v${{ env.VERSION }}. The branch is created at line 106, force-pushed at line 114, and the current code attempts to create a new PR without checking if one already exists.
The suggested fix is needed: check for an existing open PR before creation, and optionally add --head "$BRANCH" to be explicit about the target branch.
♻️ Minimal idempotency guard
- name: Create pull request
run: |
+ BRANCH="sync/v$VERSION"
+ EXISTING_PR="$(gh pr list --head "$BRANCH" --state open --json number --jq '.[0].number')"
+ if [ -n "$EXISTING_PR" ] && [ "$EXISTING_PR" != "null" ]; then
+ echo "Open PR #$EXISTING_PR already exists for $BRANCH"
+ exit 0
+ fi
+
DRAFT_FLAG=""
[ "${{ steps.status.outputs.draft }}" == "true" ] && DRAFT_FLAG="--draft"
LABELS="sync,automated"
[ "${{ needs.generate.outputs.breaking }}" == "true" ] && LABELS="$LABELS,breaking-change"
gh pr create \
--title "sync: update to opendataloader-pdf v${{ env.VERSION }}" \
--body-file /tmp/pr-body.md \
--base main \
+ --head "$BRANCH" \
$DRAFT_FLAG \
--label "$LABELS"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| - name: Create pull request | |
| run: | | |
| DRAFT_FLAG="" | |
| [ "${{ steps.status.outputs.draft }}" == "true" ] && DRAFT_FLAG="--draft" | |
| LABELS="sync,automated" | |
| [ "${{ needs.generate.outputs.breaking }}" == "true" ] && LABELS="$LABELS,breaking-change" | |
| gh pr create \ | |
| --title "sync: update to opendataloader-pdf v${{ env.VERSION }}" \ | |
| --body-file /tmp/pr-body.md \ | |
| --base main \ | |
| $DRAFT_FLAG \ | |
| --label "$LABELS" | |
| - name: Create pull request | |
| run: | | |
| BRANCH="sync/v${{ env.VERSION }}" | |
| EXISTING_PR="$(gh pr list --head "$BRANCH" --state open --json number --jq '.[0].number')" | |
| if [ -n "$EXISTING_PR" ] && [ "$EXISTING_PR" != "null" ]; then | |
| echo "Open PR #$EXISTING_PR already exists for $BRANCH" | |
| exit 0 | |
| fi | |
| DRAFT_FLAG="" | |
| [ "${{ steps.status.outputs.draft }}" == "true" ] && DRAFT_FLAG="--draft" | |
| LABELS="sync,automated" | |
| [ "${{ needs.generate.outputs.breaking }}" == "true" ] && LABELS="$LABELS,breaking-change" | |
| gh pr create \ | |
| --title "sync: update to opendataloader-pdf v${{ env.VERSION }}" \ | |
| --body-file /tmp/pr-body.md \ | |
| --base main \ | |
| --head "$BRANCH" \ | |
| $DRAFT_FLAG \ | |
| --label "$LABELS" |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.github/workflows/sync-upstream.yml around lines 221 - 234, Add an
idempotency guard before running gh pr create to skip or update an existing PR
for the deterministic branch (sync/v${{ env.VERSION }}): query GitHub for an
open PR with --head "$BRANCH" (e.g., gh pr list or gh pr view targeting the
branch used earlier when creating/force-pushing the branch) and if one exists,
avoid calling gh pr create (optionally update its body/labels via gh pr edit);
if no PR is found, run gh pr create with --head "$BRANCH" and the existing flags
(keep the LABELS, DRAFT_FLAG logic intact).
|
PR Review posted via separate message due to shell escaping constraints. See next comment. |
Code ReviewOverall this is a well-structured PR with clear goals: syncing the v2.2.0 upstream options, adding CI/CD pipelines, and expanding test coverage. The sync marker pattern is a smart approach to automating future syncs. A few issues worth addressing below. Bug: Format validation fires after
|
bundolee
left a comment
There was a problem hiding this comment.
Review by benedict
Well-structured PR overall. Sync marker design, CI/CD security posture, and test coverage expansion (83→117) are all solid. Please fix the 3 items below, then I'll merge.
Must Fix
1. Shell injection risk in sync-upstream.yml
${{ env.VERSION }} is expanded by the GitHub Actions template engine before bash runs, so if a repository_dispatch payload contains shell metacharacters, injection occurs during the validation step itself. Use $VERSION (shell variable) instead of ${{ env.VERSION }} in all run: blocks.
# Before (injectable)
if ! echo "${{ env.VERSION }}" | grep -qE ...
# After (safe)
if ! echo "$VERSION" | grep -qE ...2. Missing UPDATE_SNAPSHOTS guard on JSON snapshot test
test_json_structure_matches_snapshot creates the snapshot unconditionally when missing and skips. The text/markdown tests correctly gate on UPDATE_SNAPSHOTS — apply the same pattern here. Without it, CI silently skips instead of failing when the snapshot is absent.
3. PR body heredoc indentation issue
The heredoc content in the Build PR body step is indented along with the YAML block, so ## headings will have leading whitespace and render as preformatted text. Dedent the heredoc content or use <<- with tabs.
Suggestions
git diff HEAD~1in thecreate-prjob is fragile if the branch has merge commits. Considergit diff main...HEAD --statinstead.- The design doc still references
opendataloader-pdf==2.0.0for the min-dep test, but the actual implementation correctly uses2.1.0. Worth updating for consistency. - README params table has inconsistent quoting of defaults (some quoted, some not). Since this will be auto-generated going forward, worth normalizing.
|
Additional Review Notes Bug: test_invalid_format_raises_on_load is missing a mock The test in TestOpenDataLoaderPDFLoaderFormatValidation calls lazy_load() without mocking opendataloader_pdf. This will not pass as written. In lazy_load(), opendataloader_pdf.convert() runs first inside a try/except Exception. When test.pdf does not exist, convert throws, the exception is silently swallowed, and the function returns early - the second try block containing the ext_map format validation is never reached. pytest.raises(ValueError) then fails because no exception was raised. The fix is to add mocks like other lazy_load() tests in the file use: @patch("langchain_opendataloader_pdf.document_loaders.opendataloader_pdf") Note: The existing reviews already cover the shell injection in sync-upstream.yml and the heredoc indentation issue - those remain the highest priority items. |
Code ReviewOverall this is a well-structured PR with clear purpose and good test coverage. Here are my observations: Code Quality / Best PracticesPositive:
Minor concerns:
Potential Bugs
Performance
Security
Test CoveragePositive:
Summary
The action SHA pinning (#6) is the only item I'd recommend addressing before merge — everything else is either minor or informational. 🤖 Generated with Claude Code |
PR Review: feat: add v2.2.0 sync infrastructure, CI/CD, and comprehensive testsOverall this is a solid, well-structured PR. The CI/CD automation, sync marker approach, and test additions are all high-quality work. A few issues worth addressing before merge. Potential Breaking Change
def __init__(
self,
file_path: Union[str, Path, List[Union[str, Path]]],
*, # everything after is keyword-only
format: str = "text",
...CI/CD Issues
The workflow installs
grep -E "^\s+(self\.\w+ =|# --- )" langchain_opendataloader_pdf/document_loaders.py \
| grep "self\." | sed 's/.*self\.\([a-z_]*\).*/\1/' | sort > /tmp/params-before.txtThis regex only matches simple single-line
A more robust approach would be to parse the
git push origin "$BRANCH" --force-with-lease
Silently ignoring this failure means the PR body could be empty/missing context if the sync report was not generated. Better to fail loudly or add a guard: [ -d .sync-report ] && git add .sync-report/
run: pytest tests/ -vThe full test intentionally runs integration tests (which need the network/Java), but there's no separation between which test files need network and which don't. If Minor Issues
README: What's Done Well
Summary
The core feature ( |
PR Review: feat: add v2.2.0 sync infrastructure, CI/CD, and comprehensive testsOverall this is a well-designed PR with solid CI/CD infrastructure, good test coverage, and a clean sync mechanism. The breaking-change detection and regression snapshot system are particularly valuable additions. A few things worth addressing: Bugs / Correctness Issues1. In both if: always() && secrets.SLACK_WEBHOOK_URL != ''GitHub Actions docs advise against using env:
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}
steps:
- name: Notify Slack
if: always() && env.SLACK_WEBHOOK_URL != ''2. Upstream script execution in The generate job fetches and runs a Node.js script directly from the upstream repo at the tagged version: node upstream/scripts/generate-langchain.mjs ...If the upstream repo or a tag is ever compromised, this executes arbitrary code in your CI with 3. This installs dev dependencies manually rather than using pip install -e ".[dev]"Code Quality4. Format validation deferred to
loader = OpenDataLoaderPDFLoader(file_path="a.pdf", format="xml") # no error
list(loader.lazy_load()) # ValueError hereMoving the 5. Direct import In from tests.conftest import java_availableThis creates a hard module path dependency. When running tests with 6. README parameter table has inconsistent quoting of defaults Some defaults in the synced table are quoted, others aren't:
This looks like an artifact of the upstream code generator. Worth filing an issue or fixing in Minor / Nits7. The Keep a Changelog format conventionally includes comparison links at the bottom: Not a blocker, but completing this makes the changelog fully conformant. 8. {"page number": 3, "type": "text", "value": "page 3"},The actual JSON schema uses 9. The function lives in What's Done Well
|
Code ReviewOverall this is a well-structured PR. The sync marker approach for automated code generation is smart, and the CI/CD design follows good practices (minimal permissions, network isolation, fail-fast: false). A few issues worth addressing: Bugs / Issues1. Format validation fires after In # lazy_load() — convert() runs first, THEN validation
opendataloader_pdf.convert(...) # ← already executed with invalid format
...
ext = ext_map.get(self.format)
if ext is None:
raise ValueError(...) # ← too lateValidation should be in 2. README table shows wrong default for The auto-generated README table shows But the code has 3. The polling loop runs up to 20 iterations × 30s = 10 minutes, but timeout-minutes: 10 # ← can race with 20×30s=600s loop below
...
for i in $(seq 1 20); do
...
sleep 30
done4. If two upstream releases are triggered in quick succession, the Minor Issues5. Fragile breaking-change detection regex grep -E "^\s+(self\.\w+ =|# --- )" langchain_opendataloader_pdf/document_loaders.py \
| grep "self\." | sed 's/.*self\.\([a-z_]*\).*/\1/' | sortThis will false-negative on parameters assigned conditionally or in multiple lines, and false-positive on any 6.
7. In the old README the Positives
🤖 Generated with Claude Code |
bundolee
left a comment
There was a problem hiding this comment.
All 3 must-fix items confirmed resolved. LGTM — merging.
57bd397 to
c997baa
Compare
|
placeholder test |
Code Review — PR #16: feat: add v2.2.0 sync infrastructure, CI/CD, and comprehensive testsOverall this is a well-thought-out PR with strong security hardening in the CI/CD layer and a good expansion of test coverage. The three-layer automation (sync → test → create-PR) is a solid design. Below are specific findings across all changed files. High Priority1. Format validation is unreachable for the advertised use case (document_loaders.py)The new ValueError for invalid format is placed after opendataloader_pdf.convert() in lazy_load(): The flow is:
If format="xml" is passed, convert() is called with format=["xml"]. If the upstream library rejects it (raises internally), the broad except Exception handler catches it and returns — the ValueError never surfaces. The error is only raised if convert() somehow succeeds with an unknown format. The test test_invalid_format_raises_on_load passes format="xml" without mocking opendataloader_pdf. In CI with --disable-socket and no Java, convert() will fail silently and pytest.raises(ValueError) will itself fail. Fix: Validate format in init so it fails fast and consistently — before any I/O is attempted. 2. README parameter defaults do not match Python code defaultsThe auto-generated README table (from upstream options.json) shows defaults that differ from the Python wrapper:
Users reading the README will expect replace_invalid_chars=None to behave like replace_invalid_chars=" ". If the upstream library does not treat None and "off" identically, callers relying on the README will get unexpected behaviour. Either update the Python defaults to match, or add a note clarifying that None means "use library default." Medium Priority3. conftest.py direct import is fragile (test_integration.py)The import 4. Third-party GitHub Action not pinned to a commit SHAslackapi/slack-github-action@v2.1.0 uses a mutable version tag. Tags can be repointed by the action author. GitHub's security hardening docs recommend pinning to a full commit SHA for third-party actions. This applies to the Slack action in both sync-upstream.yml and test-full.yml. 5. uv.lock shows dependency downgradesThe lockfile records langchain-core 1.2.23 -> 1.2.19 and requests 2.33.0 -> 2.32.5. These appear to be regressions rather than intentional pins. Please verify the lockfile was generated from the intended environment. Low Priority / Nits6. Slack webhook field bypasses the env-intermediary guardThe PR description calls out "Slack webhook guarded via env intermediary" as a security hardening measure, but the webhook: input field still references the secret directly. The if: env.SLACK_URL != '' guard is correct for preventing the step from running when the secret is absent, but the description is slightly misleading. 7. Design document in Koreandocs/plans/2026-04-02-pre-release-checklist-design.md is a planning document written primarily in Korean. For a public English-first repo, internal planning docs are usually better placed in issues, wikis, or internal channels. Worth a deliberate decision either way. 8. Test should be updated if format validation moves to initIf issue #1 is fixed, test_invalid_format_raises_on_load should assert the ValueError on construction, not on lazy_load(). What's well done
|
Objective: When opendataloader-pdf releases a new version, there is no automated way to sync new options into the langchain wrapper. Every update requires manual parameter copy-paste, README edits, and ad-hoc testing. The project also lacks CI to catch regressions on PRs. Approach: Build three layers of automation — sync markers with upstream workflow (sync-upstream.yml), PR and full test workflows (test.yml, test-full.yml), and comprehensive test expansion (34 new tests, 117 total). Add detect_strikethrough parameter, bump min dep to >=2.1.0, harden all workflow run: blocks against shell injection. Evidence: Loaded samples/pdf/lorem.pdf with the updated wrapper. detect_strikethrough=True loads without error (1 doc, 462 chars). split_pages keyword arg works in new position. format="xml" raises ValueError as expected. opendataloader-pdf 2.2.0 convert() confirms detect_strikethrough param exists. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
c997baa to
ee0bbd5
Compare
Code ReviewThis is a well-structured PR that addresses real operational needs — automated sync, CI/CD, and expanded test coverage. The security hardening in the workflows is particularly thoughtful. A few items below warrant attention before merging. Bug:
|
| Parameter | Default in table |
|---|---|
format |
text (unquoted) |
image_output |
off (unquoted) |
table_method |
"default" (quoted) |
reading_order |
"xycut" (quoted) |
String defaults should be consistently quoted or unquoted. This is an upstream generator issue, but worth noting for the generate-langchain.mjs implementation.
Minor: docs/plans/2026-04-02-pre-release-checklist-design.md
This design document is primarily in Korean and uses a date-stamped filename. A few considerations:
- Non-Korean contributors won't be able to review or follow it.
- Design docs in
docs/plans/with date-stamped names accumulate over time — consider whether a living document (e.g.,docs/architecture.md) would serve better long-term. - The document duplicates workflow YAML from the PR itself — keeping two sources of truth in sync is ongoing overhead.
This doesn't block the PR, but it's worth discussing whether this belongs in the repo or in an issue/wiki.
What's done well
- Security hardening in the workflows is correct and well-documented:
$VERSIONshell expansion over${{ env.VERSION }},--force-with-leaseinstead of--force,permissions: {}at top level with per-job overrides, Slack webhook viaenvintermediary. ext_maprefactor inlazy_load()is cleaner than theif/elifchain and adds properValueErrormessaging.conftest.pyextraction ofjava_available()withlru_cacheis correct and avoids repeated subprocess calls in the test suite.concurrency: cancel-in-progress: falseonsync-upstreamis the right call — a release sync should never be cancelled mid-run.- Test coverage for edge cases (
_split_into_pages,_split_json_into_pages, path handling, temp cleanup) fills real gaps. min-dep-testjob intest.ymlcatching minimum-version regressions is a pattern LangChain uses and is valuable here.
Objective
When opendataloader-pdf releases a new version (e.g., v2.1.0 adding
detect_strikethrough), there is no automated way to sync the new options into the langchain wrapper. Every update requires manual parameter copy-paste, README edits, and ad-hoc testing — error-prone and slow. The project also lacks CI to catch regressions on PRs.Approach
Build three layers of automation — sync, test, release — so upstream releases flow into the wrapper with minimal manual work.
1. Sync markers + upstream workflow (
sync-upstream.yml)Insert
BEGIN/END SYNCED PARAMSmarkers indocument_loaders.pyand README so the upstream code generator (generate-langchain.mjs) can patch parameters without touching surrounding code. The workflow triggers onrepository_dispatchor manual input, generates code, runs tests, and creates a PR (draft on test failure, ready on success). Breaking changes (removed params) are auto-detected and labeled.2. PR + full test workflows (
test.yml,test-full.yml)Every PR runs unit tests across Python 3.10/3.12/3.13 with
--disable-socketnetwork isolation, plus a minimum dependency test (langchain-core 1.0.0 + opendataloader-pdf 2.1.0). Manual full tests cover 3 OS × 2 Python with Java integration.3. Comprehensive test expansion (34 new tests → 117 total)
Unit: Path objects,
detect_strikethrough, JSON edge cases, format validation, temp cleanup.Integration:
replace_invalid_chars,content_safety_off, image options, directory loading.Regression: snapshot comparison (text/markdown/json), metadata consistency, instance isolation.
Security hardening:
run:blocks use shell$VERSIONinstead of${{ env.VERSION }}to prevent injection fromrepository_dispatchpayloadsneeds.*) mapped to env vars before shell evaluationpermissions: {}with per-job overridesenvintermediary (secrets not accessible inif:conditions)Additional changes:
detect_strikethroughparameter (new in opendataloader-pdf v2.1.0)opendataloader-pdf>=2.1.0hybrid_timeoutdocstring default:"30000"→"0"Evidence
Loaded a real PDF (
samples/pdf/lorem.pdf) with the updated wrapper under user-identical conditions:detect_strikethrough=True, format=markdowndetect_strikethroughomitted (default)split_pages=True(keyword arg, moved position)split_pages=False(keyword arg)format="xml"(invalid)ValueErrorraisedValueError: Invalid format 'xml'. Valid options are: 'json', 'text', 'html', 'markdown'convert()signaturedetect_strikethroughparam existsTrueCannot verify
sync-upstream.ymlend-to-end — requires an actual upstream release to triggerrepository_dispatch. Workflow syntax and job outputs validated via CI dry-run.🤖 Generated with Claude Code