feat: add PageIndex SDK with local/cloud dual-mode support#207
Draft
KylinMountain wants to merge 22 commits intoVectifyAI:mainfrom
Draft
feat: add PageIndex SDK with local/cloud dual-mode support#207KylinMountain wants to merge 22 commits intoVectifyAI:mainfrom
KylinMountain wants to merge 22 commits intoVectifyAI:mainfrom
Conversation
f4ca4c5 to
1369cf1
Compare
- Critical: preserve text in markdown structure for fallback retrieval - Cloud: SSE response close, folder cache dict, truncate error body - Cloud: filter internal tools, async-safe streaming via to_thread - SQLite: multi-thread connection tracking, context manager - Security: collection name validation, parse_pages range cap - Polish: use count_tokens wrapper, _EXAMPLES_DIR naming, QueryStream public - Backend protocol: add @runtime_checkable
- Replace ConfigLoader + config.yaml with Pydantic IndexConfig - Use bool for config flags (if_add_node_summary etc.) instead of "yes"/"no" - Enable doc_description by default for better agent QA - Early API key validation on LocalClient init via litellm provider detection - Expose index_config parameter on LocalClient for advanced users - Remove config.yaml dependency from pip package
…aming
- Fix return type annotation: dict -> list (tree structure is a list)
- Fix not-found return: {} -> [] for consistency
- Cloud streaming: replace batch-then-yield with asyncio.Queue for
true real-time event delivery via background thread
…n type, legacy API fix - Remove client-side dedup in CloudBackend (server responsibility) - Cloud streaming: real-time via asyncio.Queue instead of batch-then-yield - Fix get_document_structure return type: dict -> list, not-found returns [] - Fix legacy page_index() API: use IndexConfig instead of deleted ConfigLoader - Add folder upgrade warning (once only) - Demo: always upload, no client-side caching
Local demo was missing LLM provider configuration, making it fail on first run without clear guidance.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
….pageindex Local-only params are now documented. Default storage_path changed from ~/.pageindex (global) to ./.pageindex (project-local) for better isolation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Was only defined on LocalClient but called from PageIndexClient._init_local(), causing AttributeError when using PageIndexClient directly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…PI improvements
- Extract images from PDF pages preserving text-image reading order
using pymupdf get_text("dict") blocks. Images saved to
files/{collection}/{doc_id}/images/ with relative paths in content.
- Add get_document_structure() and get_page_content() to Collection public API
- get_document() now returns structure; add include_text param to populate
node text from page cache (WARNING in docstring: not for agent/LLM use)
- delete_document() cleans up images directory
- Agent system prompt instructs LLM to preserve image references in answers
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Allows callers to specify where extracted PDF images are saved. Default behavior unchanged (internal .pageindex/files/.../images/). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Unified Python SDK for document indexing and retrieval, supporting both self-hosted (local) and fully-managed (cloud) modes.
Highlights
LocalClient(self-hosted, user LLM key) /CloudClient(fully managed, no LLM key)col.query(stream=True)returns async-iterable QueryStreamchat/completionsUsage