Skip to content

feat: add PageIndex SDK with local/cloud dual-mode support#207

Draft
KylinMountain wants to merge 22 commits intoVectifyAI:mainfrom
KylinMountain:feat/sdk
Draft

feat: add PageIndex SDK with local/cloud dual-mode support#207
KylinMountain wants to merge 22 commits intoVectifyAI:mainfrom
KylinMountain:feat/sdk

Conversation

@KylinMountain
Copy link
Copy Markdown
Collaborator

Summary

Unified Python SDK for document indexing and retrieval, supporting both self-hosted (local) and fully-managed (cloud) modes.

Highlights

  • Dual-mode client: LocalClient (self-hosted, user LLM key) / CloudClient (fully managed, no LLM key)
  • Collection-based multi-document management with SHA-256 dedup
  • Streaming query: col.query(stream=True) returns async-iterable QueryStream
  • Pluggable protocols: DocumentParser, StorageEngine (SQLite default)
  • Cloud backend: actual PageIndex API with SSE streaming via chat/completions

Usage

from pageindex import LocalClient, CloudClient

# Local
client = LocalClient()
col = client.collection()
col.add("paper.pdf")
col.query("What is this about?")

# Cloud
client = CloudClient(api_key="pi-xxx")
col = client.collection()
col.add("paper.pdf")
col.query("What is this about?", stream=True)

@KylinMountain KylinMountain force-pushed the feat/sdk branch 4 times, most recently from f4ca4c5 to 1369cf1 Compare April 1, 2026 09:47
- Critical: preserve text in markdown structure for fallback retrieval
- Cloud: SSE response close, folder cache dict, truncate error body
- Cloud: filter internal tools, async-safe streaming via to_thread
- SQLite: multi-thread connection tracking, context manager
- Security: collection name validation, parse_pages range cap
- Polish: use count_tokens wrapper, _EXAMPLES_DIR naming, QueryStream public
- Backend protocol: add @runtime_checkable
- Replace ConfigLoader + config.yaml with Pydantic IndexConfig
- Use bool for config flags (if_add_node_summary etc.) instead of "yes"/"no"
- Enable doc_description by default for better agent QA
- Early API key validation on LocalClient init via litellm provider detection
- Expose index_config parameter on LocalClient for advanced users
- Remove config.yaml dependency from pip package
…aming

- Fix return type annotation: dict -> list (tree structure is a list)
- Fix not-found return: {} -> [] for consistency
- Cloud streaming: replace batch-then-yield with asyncio.Queue for
  true real-time event delivery via background thread
…n type, legacy API fix

- Remove client-side dedup in CloudBackend (server responsibility)
- Cloud streaming: real-time via asyncio.Queue instead of batch-then-yield
- Fix get_document_structure return type: dict -> list, not-found returns []
- Fix legacy page_index() API: use IndexConfig instead of deleted ConfigLoader
- Add folder upgrade warning (once only)
- Demo: always upload, no client-side caching
KylinMountain and others added 8 commits April 3, 2026 17:27
Local demo was missing LLM provider configuration, making it fail
on first run without clear guidance.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
….pageindex

Local-only params are now documented. Default storage_path changed from
~/.pageindex (global) to ./.pageindex (project-local) for better isolation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Was only defined on LocalClient but called from PageIndexClient._init_local(),
causing AttributeError when using PageIndexClient directly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…PI improvements

- Extract images from PDF pages preserving text-image reading order
  using pymupdf get_text("dict") blocks. Images saved to
  files/{collection}/{doc_id}/images/ with relative paths in content.
- Add get_document_structure() and get_page_content() to Collection public API
- get_document() now returns structure; add include_text param to populate
  node text from page cache (WARNING in docstring: not for agent/LLM use)
- delete_document() cleans up images directory
- Agent system prompt instructs LLM to preserve image references in answers

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Allows callers to specify where extracted PDF images are saved.
Default behavior unchanged (internal .pageindex/files/.../images/).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant