feat: add PageIndex SDK with local/cloud dual-mode support by KylinMountain · Pull Request #207 · VectifyAI/PageIndex

KylinMountain · 2026-04-01T09:18:19Z

Summary

Unified Python SDK for document indexing and retrieval, supporting both self-hosted (local) and fully-managed (cloud) modes.

Highlights

Dual-mode client: LocalClient (self-hosted, user LLM key) / CloudClient (fully managed, no LLM key)
Collection-based multi-document management with SHA-256 dedup
Streaming query: col.query(stream=True) returns async-iterable QueryStream
Pluggable protocols: DocumentParser, StorageEngine (SQLite default)
Cloud backend: actual PageIndex API with SSE streaming via chat/completions

Usage

from pageindex import LocalClient, CloudClient

# Local
client = LocalClient()
col = client.collection()
col.add("paper.pdf")
col.query("What is this about?")

# Cloud
client = CloudClient(api_key="pi-xxx")
col = client.collection()
col.add("paper.pdf")
col.query("What is this about?", stream=True)

- Critical: preserve text in markdown structure for fallback retrieval - Cloud: SSE response close, folder cache dict, truncate error body - Cloud: filter internal tools, async-safe streaming via to_thread - SQLite: multi-thread connection tracking, context manager - Security: collection name validation, parse_pages range cap - Polish: use count_tokens wrapper, _EXAMPLES_DIR naming, QueryStream public - Backend protocol: add @runtime_checkable

- Replace ConfigLoader + config.yaml with Pydantic IndexConfig - Use bool for config flags (if_add_node_summary etc.) instead of "yes"/"no" - Enable doc_description by default for better agent QA - Early API key validation on LocalClient init via litellm provider detection - Expose index_config parameter on LocalClient for advanced users - Remove config.yaml dependency from pip package

…aming - Fix return type annotation: dict -> list (tree structure is a list) - Fix not-found return: {} -> [] for consistency - Cloud streaming: replace batch-then-yield with asyncio.Queue for true real-time event delivery via background thread

…n type, legacy API fix - Remove client-side dedup in CloudBackend (server responsibility) - Cloud streaming: real-time via asyncio.Queue instead of batch-then-yield - Fix get_document_structure return type: dict -> list, not-found returns [] - Fix legacy page_index() API: use IndexConfig instead of deleted ConfigLoader - Add folder upgrade warning (once only) - Demo: always upload, no client-side caching

…tignore

Local demo was missing LLM provider configuration, making it fail on first run without clear guidance.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

….pageindex Local-only params are now documented. Default storage_path changed from ~/.pageindex (global) to ./.pageindex (project-local) for better isolation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Was only defined on LocalClient but called from PageIndexClient._init_local(), causing AttributeError when using PageIndexClient directly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…PI improvements - Extract images from PDF pages preserving text-image reading order using pymupdf get_text("dict") blocks. Images saved to files/{collection}/{doc_id}/images/ with relative paths in content. - Add get_document_structure() and get_page_content() to Collection public API - get_document() now returns structure; add include_text param to populate node text from page cache (WARNING in docstring: not for agent/LLM use) - delete_document() cleans up images directory - Agent system prompt instructs LLM to preserve image references in answers Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Allows callers to specify where extracted PDF images are saved. Default behavior unchanged (internal .pageindex/files/.../images/). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

KylinMountain force-pushed the feat/sdk branch 4 times, most recently from f4ca4c5 to 1369cf1 Compare April 1, 2026 09:47

KylinMountain added 5 commits April 1, 2026 17:50

feat(sdk): add foundation — protocols, errors, events, config

b6b4b97

feat(sdk): add parser layer — PdfParser and MarkdownParser

c4e2cf8

feat(sdk): add SQLiteStorage with thread-safe connections

f37319c

feat(sdk): add unified index pipeline and migrate core utils

0e5028e

feat(sdk): add LocalClient, CloudClient, Collection, and AgentRunner

e011160

KylinMountain force-pushed the feat/sdk branch from 1369cf1 to 6786ce4 Compare April 1, 2026 09:50

KylinMountain added 2 commits April 1, 2026 17:52

feat(sdk): add local and cloud demo examples

bc72166

test(sdk): add unit tests for all SDK layers

92974d3

KylinMountain force-pushed the feat/sdk branch from 6786ce4 to 92974d3 Compare April 1, 2026 09:53

KylinMountain added 2 commits April 2, 2026 17:03

KylinMountain force-pushed the feat/sdk branch from 39ca529 to d77d967 Compare April 2, 2026 09:05

KylinMountain added 3 commits April 2, 2026 17:27

fix: replace ConfigLoader with IndexConfig in legacy page_index() API

6d547ab

feat: add document dedup by name in CloudBackend

6a22262

KylinMountain force-pushed the feat/sdk branch from 611eea6 to 6a22262 Compare April 2, 2026 09:57

KylinMountain force-pushed the feat/sdk branch from fe0a263 to 7eb9463 Compare April 2, 2026 11:12

KylinMountain and others added 8 commits April 3, 2026 17:27

fix: early API key check via litellm, remove cloud dedup, clean up gi…

11911ff

…tignore

fix: add API key and model setup to local demo

c8b397d

Local demo was missing LLM provider configuration, making it fail on first run without clear guidance.

add get document structure

aaa394d

add public api

45c8c61

feat: expose index_config parameter in PageIndexClient.__init__

236dcb2

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: move _validate_llm_provider to PageIndexClient base class

aaac970

Was only defined on LocalClient but called from PageIndexClient._init_local(), causing AttributeError when using PageIndexClient directly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: support custom images_dir via IndexConfig

965de9b

Allows callers to specify where extracted PDF images are saved. Default behavior unchanged (internal .pageindex/files/.../images/). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add PageIndex SDK with local/cloud dual-mode support#207

feat: add PageIndex SDK with local/cloud dual-mode support#207
KylinMountain wants to merge 22 commits intoVectifyAI:mainfrom
KylinMountain:feat/sdk

KylinMountain commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

KylinMountain commented Apr 1, 2026

Summary

Highlights

Usage

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant