CustomKB: Production-Ready AI Knowledgebase System

CustomKB transforms your documents into AI-powered, searchable knowledgebases with state-of-the-art embedding models, vector search, and language models to deliver contextually relevant answers from your data.

Key Features

Core Capabilities

Semantic Search: Find information by meaning, not just keywords
Multi-Provider AI: OpenAI, Anthropic, Google, xAI, and local models via Ollama
Universal Document Support: Process Markdown, HTML, code, PDFs, and plain text
27+ Language Support: Multi-language processing with automatic detection
Hybrid Search: Combines vector similarity with BM25 keyword matching
Cross-Encoder Reranking: Boosts accuracy by 20-40% with advanced models
Enterprise Security: Hardened against injection attacks, safe serialization (no pickle), input validation, path protection

Performance & Scale

Memory-Optimized Tiers: Automatically adapts from 4GB to 128GB+ systems
GPU Acceleration: CUDA support for faster reranking
Concurrent Processing: Batch operations with configurable thread pools
Smart Caching: Two-tier cache system with LRU eviction
Production Ready: Checkpoint saving, automatic retries, graceful error handling

How It Works

CustomKB follows a three-stage pipeline to transform your documents into an intelligent knowledgebase:

1. Document Processing
   ├─ Text extraction (Markdown, HTML, PDF, code, plain text)
   ├─ Language detection (27+ languages)
   ├─ Intelligent chunking (200-400 tokens, context-aware)
   └─ Metadata extraction (filenames, categories, timestamps)

2. Embedding Generation
   ├─ Vector embeddings via OpenAI, Google, or local models
   ├─ Batch processing with checkpoints
   ├─ FAISS index creation for fast similarity search
   └─ Optional BM25 index for hybrid search

3. Semantic Search & Query
   ├─ Query embedding generation
   ├─ Vector similarity search (k-NN via FAISS)
   ├─ Optional: Hybrid search (vector + BM25 keyword matching)
   ├─ Optional: Cross-encoder reranking for precision
   ├─ Context assembly from top results
   └─ LLM response generation with retrieved context

Why This Approach Works:

Semantic Understanding: Vector embeddings capture meaning, not just keywords
Hybrid Accuracy: Combining vector and keyword search catches both conceptual and exact matches
Reranking Precision: Cross-encoders evaluate query-document pairs for superior relevance
Efficient Retrieval: FAISS enables sub-millisecond search across millions of vectors

Prerequisites

Python: 3.12 or higher
SQLite: 3.45+ (usually included with Python)
RAM: 4GB+ (8GB+ recommended for optimal performance)
GPU (optional): NVIDIA GPU with CUDA 11 or 12 for acceleration
API Keys: For your chosen AI providers (OpenAI, Anthropic, Google, xAI)

Installation

1. Clone Repository

git clone https://github.com/Open-Technology-Foundation/customkb.git
cd customkb

2. Setup Virtual Environment

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

3. Install FAISS

# Automatic installation (detects GPU and CUDA version)
./setup/install_faiss.sh

# Or manual installation:
# CPU-only: pip install -r requirements-faiss-cpu.txt
# GPU (CUDA 12): pip install -r requirements-faiss-gpu-cu12.txt
# GPU (CUDA 11): pip install -r requirements-faiss-gpu-cu11.txt

# Force specific variant:
# FAISS_VARIANT=cpu ./setup/install_faiss.sh

4. Install NLTK Data

sudo ./setup/nltk_setup.py download cleanup

5. Setup Knowledgebase Directory

Choose between system-wide or user-local installation:

Option A: System-wide (requires sudo)

sudo mkdir -p /var/lib/vectordbs
sudo chown $USER:$USER /var/lib/vectordbs
export VECTORDBS="/var/lib/vectordbs"

Option B: User-local (no sudo required, recommended)

mkdir -p "$HOME/knowledgebases"
export VECTORDBS="$HOME/knowledgebases"

Add to your shell profile (~/.bashrc, ~/.zshrc, etc.):

export VECTORDBS="$HOME/knowledgebases"  # or /var/lib/vectordbs

6. Configure API Keys

export OPENAI_API_KEY="your-openai-key"
export ANTHROPIC_API_KEY="your-anthropic-key"
export GOOGLE_API_KEY="your-google-key"      # Optional
export XAI_API_KEY="your-xai-key"            # Optional

Add these to your shell profile for persistence.

Quick Start

Create Your First Knowledgebase

# 1. Create knowledgebase directory
mkdir -p "$VECTORDBS/myproject"

# 2. Create configuration
cat > "$VECTORDBS/myproject/myproject.cfg" << 'EOF'
[DEFAULT]
vector_model = text-embedding-3-small
query_model = gpt-4o-mini
db_min_tokens = 200
db_max_tokens = 400
EOF

# 3. Process documents (from your project directory)
customkb database myproject docs/*.md *.txt

# 4. Generate embeddings
customkb embed myproject

# 5. Query your knowledgebase
customkb query myproject "What are the main features?"

That's it! Your knowledgebase is ready to answer questions about your documents.

Core Commands

`database` - Import Documents

customkb database <kb_name> [files...] [options]

Process and store text files in the knowledgebase.

Options:

-l, --language: Stopwords language (en, fr, de, etc.)
--detect-language: Auto-detect language per file
-f, --force: Reprocess existing files
-v, --verbose: Detailed output

Examples:

# Process all markdown files
customkb database myproject ~/docs/**/*.md

# Auto-detect language for multilingual docs
customkb database myproject ~/docs/ --detect-language

# Force reprocess existing files
customkb database myproject ~/docs/*.md --force

`embed` - Generate Embeddings

customkb embed <kb_name> [options]

Create vector embeddings for all text chunks.

Options:

-r, --reset-database: Reset embedding status
-v, --verbose: Show progress

Examples:

# Generate embeddings with progress
customkb embed myproject --verbose

# Reset and regenerate all embeddings
customkb embed myproject --reset-database

`query` - Search & Ask Questions

customkb query <kb_name> "<question>" [options]

Perform semantic search and generate AI responses.

Options:

-c, --context-only: Return only context, no AI response
-m, --model: AI model to use
-k, --top-k: Number of results (default: 50)
-t, --temperature: Response creativity (0-2)
-f, --format: Output format (xml, json, markdown, plain)
-p, --prompt-template: Response style template

Examples:

# Simple query
customkb query myproject "How does authentication work?"

# Advanced query with specific model
customkb query myproject "Explain the architecture" \
  --model claude-sonnet-4-5 \
  --format json \
  --prompt-template technical

# Get context only (no LLM response)
customkb query myproject "Find authentication docs" --context-only

`categorize` - AI-Powered Document Categorization

customkb categorize <kb_name> [options]

Automatically categorize articles using AI models.

Options:

-S, --sample N: Process only N sample articles
-f, --full: Process all articles
--fresh: Ignore checkpoint, reprocess all articles
--import: Import categories to database after processing
--list: List existing categories and counts
-m, --model: AI model to use (default: claude-haiku-4-5)
-s, --sampling T-M-B: Chunk sampling config (e.g., 5-10-5)
-M, --max-concurrent: Maximum concurrent API requests (default: 5)
-c, --confidence-threshold: Minimum confidence (default: 0.5)
-D, --no-dedup: Disable category deduplication

Examples:

# Categorize with import to database
customkb categorize myproject --full --import

# Process 10 sample articles
customkb categorize myproject --sample 10

# List existing categories
customkb categorize myproject --list

`edit` - Edit KB Configuration

customkb edit <kb_name>

Open the knowledgebase configuration file in $EDITOR.

Examples:

customkb edit myproject

`convert-encoding` - Convert Files to UTF-8

customkb convert-encoding <files...> [options]

Convert text files to UTF-8 encoding in-place.

Options:

-r, --recursive: Process directories recursively
--dry-run: Preview changes without converting
--no-backup: Convert without creating backups
-v, --verbose: Detailed output

Examples:

# Convert all .txt files
customkb convert-encoding *.txt

# Recursive conversion with preview
customkb convert-encoding docs/ --recursive --dry-run

Configuration

CustomKB uses INI-style configuration with environment variable overrides.

Priority Order

Environment variables (highest)
Configuration file (.cfg)
Default values (lowest)

Example Configuration

[DEFAULT]
# Models
vector_model = text-embedding-3-small
query_model = gpt-4o-mini

# Text Processing
db_min_tokens = 200          # Minimum chunk size
db_max_tokens = 400          # Maximum chunk size

# Query Settings
query_max_tokens = 4096      # Max tokens in LLM response
query_top_k = 30             # Number of chunks to retrieve
query_temperature = 0.1      # LLM creativity (0=precise, 2=creative)
query_role = You are a helpful expert assistant.

# Output Format
reference_format = json      # xml, json, markdown, plain
query_prompt_template = technical  # Response style

[ALGORITHMS]
# Search Configuration
similarity_threshold = 0.6   # Minimum similarity score (0-1)
enable_hybrid_search = true  # Combine vector + keyword search
bm25_weight = 0.5           # Weight for BM25 in hybrid mode
bm25_max_results = 1000     # Max results from BM25

# Reranking
enable_reranking = true      # Use cross-encoder for precision
reranking_model = cross-encoder/ms-marco-MiniLM-L-6-v2
reranking_top_k = 30         # Rerank top N results

[PERFORMANCE]
# Optimization
embedding_batch_size = 100   # Chunks per batch
cache_thread_pool_size = 4   # Concurrent cache operations
memory_cache_size = 10000    # LRU cache entries
checkpoint_interval = 10      # Save every N batches

[API]
# Rate Limiting
api_call_delay_seconds = 0.05  # Delay between API calls
api_max_concurrency = 8        # Parallel API requests
api_max_retries = 20           # Retry attempts for failed calls

Configuration Tips

db_min_tokens/db_max_tokens: Controls chunk size. Smaller = more precise, larger = more context
similarity_threshold: Lower (0.5) for broader results, higher (0.7) for strict relevance
enable_hybrid_search: Enable for technical docs, disable for narrative content
query_temperature: 0.0-0.3 for factual, 0.7-1.0 for creative responses

Advanced Features

Supported Models

Language Models (LLMs)

OpenAI

GPT-5.x series (5, 5-mini, 5-nano, 5-pro, 5.1, 5.2)
GPT-4.1, GPT-4.1-mini, GPT-4.1-nano (1M context)
GPT-4o, GPT-4o-mini (128k context)
o3, o4-mini (advanced reasoning)

Anthropic

Claude Opus 4.5 (200k context, extended thinking)
Claude Sonnet 4.5, Haiku 4.5 (200k context)
Claude Opus 4.1 (200k context)

Google

Gemini 3.x Pro/Flash (preview)
Gemini 2.5 Pro/Flash/Flash-Lite (thinking models, 1M+ context)
Gemini 1.5 Pro/Flash-8B

xAI

Grok 4, Grok 4-fast (256k-2M context, reasoning)

Local (Ollama)

Llama 3.3 (8B-70B)
Gemma 3 (4B-27B)
DeepSeek R1, Qwen 2.5, Mistral, Phi-4

Embedding Models

OpenAI

text-embedding-3-large (3072 dims, best quality)
text-embedding-3-small (1536 dims, cost-effective)
text-embedding-ada-002 (1536 dims, legacy)

Google

gemini-embedding-001 (768/1536/3072 dims)
- 68% MTEB score vs 64.6% for OpenAI
- 30k token context vs 8k
- Matryoshka Representation Learning

Prompt Templates

Customize response styles:

customkb query myproject "question" --prompt-template <template>

Available templates:

default: Balanced, helpful responses
instructive: Step-by-step explanations
scholarly: Academic, citation-rich
concise: Brief, to-the-point
analytical: Deep analysis with reasoning
conversational: Friendly, approachable
technical: Precise, developer-focused

Output Formats

Control how results are formatted:

# JSON for APIs
customkb query myproject "search" --format json

# XML with structured references
customkb query myproject "search" --format xml

# Markdown for documentation
customkb query myproject "search" --format markdown

# Plain text
customkb query myproject "search" --format plain

Category Filtering

Organize and filter results by categories:

# Categorize documents
customkb categorize myproject --import

# Query with category filters
customkb query myproject "query" --categories "Technical,Legal"

Multi-Language Support

# Process with specific language
customkb database myproject docs/*.txt --language french

# Auto-detect languages (recommended for multilingual docs)
customkb database myproject docs/ --detect-language

Supported languages: English, French, German, Spanish, Italian, Portuguese, Dutch, Swedish, Norwegian, Danish, Finnish, Russian, Turkish, Arabic, Hebrew, Japanese, Chinese, Korean, and more.

Security

CustomKB implements enterprise-grade security measures to protect your data and systems.

Security Features

Safe Serialization

✓ Zero pickle deserialization vulnerabilities
✓ JSON format for reranking cache (human-readable, secure)
✓ JSON format for categorization checkpoints
✓ NPZ + JSON hybrid for BM25 indexes (efficient + secure)
✓ Automatic migration from legacy pickle formats

Injection Prevention

✓ SQL injection protection via table name validation
✓ Path traversal protection in file operations
✓ Input validation for all user-provided parameters
✓ Parameterized queries for database operations

API Security

✓ API key validation and secure storage
✓ Environment variable-based configuration
✓ No API keys in logs or error messages
✓ Secure credential handling

Data Protection

✓ Database integrity checks
✓ Atomic operations with rollback support
✓ Backup support for critical operations
✓ File permission validation

Security Best Practices

When deploying CustomKB in production:

API Keys: Store in environment variables, never in code or config files
File Permissions: Restrict knowledgebase directories to application user only
Network Access: Run on localhost or behind authentication proxy
Updates: Regularly check for security patches
Backups: Enable automatic backups before migrations

Reporting Security Issues

If you discover a security vulnerability:

Do not create a public GitHub issue
Email security concerns to: [Create issue for security contact]
Include:
- Steps to reproduce
- Potential impact assessment
- Suggested remediation (if any)
Allow reasonable time for patching before public disclosure

See commit history for detailed security update information.

Performance Optimization

Auto-Optimization

# Analyze system and show recommendations
customkb optimize --analyze

# Apply optimizations automatically
customkb optimize myproject

# Preview changes without applying
customkb optimize myproject --dry-run

Memory Tiers

CustomKB automatically configures based on available memory:

Memory	Tier	Features	Batch Size	Cache Size
<16GB	Low	Conservative, no hybrid search	50	5,000
16-64GB	Medium	Balanced, moderate caching	100	10,000
64-128GB	High	Large batches, hybrid search	200	20,000
>128GB	Very High	Maximum performance	300	50,000

Database Indexes

# Verify performance indexes
customkb verify-indexes myproject

# Build BM25 hybrid search index
customkb bm25 myproject

GPU Acceleration

CustomKB automatically detects and uses NVIDIA GPUs for:

Cross-encoder reranking (20-40% faster)
FAISS index search (GPU-enabled builds)

# Benchmark GPU vs CPU performance
./scripts/benchmark_gpu.py

# Monitor GPU usage during operations
./scripts/gpu_monitor.sh

Troubleshooting

Common Issues

"Knowledgebase not found"

# Verify KB exists
ls -la $VECTORDBS/

# Check for .cfg file
ls -la $VECTORDBS/myproject/myproject.cfg

# Error message shows available KBs
customkb query nonexistent "test"

"API rate limit exceeded"

# Increase delay between calls in config
api_call_delay_seconds = 0.1
api_max_concurrency = 4

"Out of memory during embedding"

# Run optimizer for your system
customkb optimize myproject

# Or manually reduce batch size in config
embedding_batch_size = 50

"Low similarity scores" or poor results

# Try lower threshold
similarity_threshold = 0.5

# Enable hybrid search
enable_hybrid_search = true

# Or use stronger embedding model
vector_model = text-embedding-3-large

"Import failed: unsupported file type"

# CustomKB supports: .md, .txt, .html, .pdf
# Convert other formats to supported types first

# For code files, use .txt extension or markdown fenced blocks

Debug Mode

# Enable verbose logging
customkb query myproject "test" -v

# Check detailed logs
tail -f $VECTORDBS/myproject/logs/myproject.log

# Run diagnostics
./scripts/diagnose_crashes.py myproject

Knowledgebase Structure

All knowledgebases live in $VECTORDBS:

$VECTORDBS/
├── myproject/
│   ├── myproject.cfg       # Configuration (required)
│   ├── myproject.db        # SQLite database with chunks
│   ├── myproject.faiss     # FAISS vector index
│   ├── myproject.bm25      # BM25 index (optional, for hybrid search)
│   ├── .rerank_cache/      # Reranking cache (optional)
│   └── logs/               # Runtime logs

Name Resolution

The system intelligently resolves KB names:

# All resolve to the same KB:
customkb query myproject "test"
customkb query myproject.cfg "test"
customkb query $VECTORDBS/myproject "test"
customkb query $VECTORDBS/myproject/myproject.cfg "test"
# → All use $VECTORDBS/myproject/myproject.cfg

Utility Scripts

Located in scripts/ directory:

Performance & Optimization

show_optimization_tiers.py - Display memory tier settings
emergency_optimize.py - Conservative recovery settings
clean_corrupted_cache.py - Clean corrupted cache files

GPU

benchmark_gpu.py - Compare GPU vs CPU performance
gpu_monitor.sh - Real-time GPU utilization monitoring
gpu_env.sh - GPU environment setup

Maintenance

rebuild_bm25_filtered.py - Rebuild BM25 indexes with filters
upgrade_bm25_tokens.py - Upgrade database for BM25 tokens
diagnose_crashes.py - Analyze crash logs and system state
update_dependencies.py - Check and update Python dependencies
security-check.sh - Run security validation checks
emergency_cleanup.sh - Emergency cleanup operations
test_cuda.sh - CUDA availability testing

Testing

# Install test dependencies
pip install -r requirements-test.txt

# Run all tests
python run_tests.py

# Run specific test suites
python run_tests.py --unit         # Unit tests only
python run_tests.py --integration  # Integration tests only

# Run with safety limits (recommended for CI)
python run_tests.py --safe --memory-limit 2048

# Generate coverage report
python run_tests.py --coverage

Frequently Asked Questions

General

Q: Can I use CustomKB without any API keys?

A: Yes! Use local Ollama models for both embeddings and queries. No external API calls required. Performance depends on your local hardware.

Q: How much does it cost to process documents?

A: Costs vary by provider and model:

OpenAI text-embedding-3-small: $0.02 per 1M tokens (~750k words)
Google gemini-embedding-001: $0.15 per 1M tokens
Local Ollama models: Free (just electricity)

Example: A 500-page technical manual (~250k tokens) costs about $0.005 to embed with OpenAI.

Q: Is my data private and secure?

A: Your documents stay local. Only text chunks are sent to API providers during embedding and query operations. The full document contents never leave your system. For maximum privacy, use local Ollama models.

Q: What's the difference between CustomKB and vector databases like Pinecone?

A: CustomKB is a complete RAG (Retrieval-Augmented Generation) system including:

Document processing pipeline
Embedding generation
Vector + hybrid search
LLM integration
Response generation

Vector databases only handle storage and retrieval. You'd need to build the rest yourself.

Technical

Q: Can I use multiple embedding models in one knowledgebase?

A: No, each knowledgebase uses one embedding model. To switch models, create a new KB or regenerate embeddings with --reset-database.

Q: How do I update my knowledgebase when documents change?

A: Re-run the database command with updated files:

customkb database myproject docs/*.md --force
customkb embed myproject

Only changed/new files are reprocessed.

Q: What's the maximum knowledgebase size?

A: Tested up to 10M+ chunks (~4GB database). FAISS scales to billions of vectors. Practical limits depend on your RAM and disk space.

Q: Can I run CustomKB in a Docker container?

A: Yes, though no official Docker image yet. Use a Python 3.12+ base image and install dependencies. Mount your $VECTORDBS directory as a volume.

Q: Does CustomKB support real-time document monitoring?

A: Not yet. You manually trigger document processing. Consider using filesystem watchers (inotify) to trigger updates automatically.

Contributing

We welcome contributions from the community! Whether you're fixing bugs, adding features, improving documentation, or sharing ideas, your help makes CustomKB better for everyone.

Ways to Contribute

Report Bugs: Open an issue
Suggest Features: Open an issue
Improve Documentation: Fix typos, clarify instructions, add examples
Submit Code: Bug fixes, new features, performance improvements
Share Knowledge: Answer questions, write tutorials, create examples

Quick Start for Contributors

Fork the repository on GitHub

Clone your fork

git clone https://github.com/YOUR-USERNAME/customkb.git
cd customkb

Create a feature branch
```
git checkout -b feature/amazing-feature
```

Set up development environment

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
pip install -r requirements-test.txt

Make your changes
- Write clean, documented code
- Follow existing code style
- Add tests for new features
- Update documentation as needed

Run tests

python run_tests.py
python run_tests.py --coverage

Commit your changes

git add .
git commit -m "Add amazing feature"

Push to your fork
```
git push origin feature/amazing-feature
```
Open a Pull Request
- Go to the original repository
- Click "New Pull Request"
- Select your branch
- Describe your changes clearly

Development Guidelines

Code Style: Follow PEP 8 for Python code
Type Hints: Use type annotations for function signatures
Testing: Maintain or improve test coverage
Documentation: Update README and docstrings
Commits: Write clear, descriptive commit messages

Code of Conduct

Be respectful and inclusive
Welcome newcomers and different perspectives
Focus on what's best for the community
Show empathy towards others

Need Help?

Join discussions in GitHub Discussions
Ask questions in issues (label with question)
Review existing PRs to see the process

Support & Community

Get Help

Documentation: You're reading it! Check the sections above
Issues: GitHub Issues
Discussions: GitHub Discussions

Stay Updated

Releases: Watch the repository for release notifications
Changelog: See commit history for version details
Security: Check Security section for vulnerability reporting

Connect

GitHub: Open-Technology-Foundation/customkb
Maintainer: Indonesian Open Technology Foundation
License: GPL-3.0 (see LICENSE)

Complete Example

Building a Production Knowledgebase

Here's a complete workflow for creating a production-ready knowledgebase:

# 1. Setup environment
export VECTORDBS="$HOME/knowledgebases"
export OPENAI_API_KEY="your-key-here"

# 2. Create KB directory
mkdir -p "$VECTORDBS/techbase"
cd "$VECTORDBS/techbase"

# 3. Create optimized configuration
cat > techbase.cfg << 'EOF'
[DEFAULT]
vector_model = text-embedding-3-small
query_model = gpt-4o-mini
db_min_tokens = 250
db_max_tokens = 500

[ALGORITHMS]
enable_hybrid_search = true
enable_reranking = true
similarity_threshold = 0.65
bm25_weight = 0.5

[PERFORMANCE]
embedding_batch_size = 100
memory_cache_size = 20000
checkpoint_interval = 10
EOF

# 4. Process documents with language detection
customkb database techbase ~/docs/**/*.md --detect-language --verbose

# 5. Generate embeddings with progress
customkb embed techbase --verbose

# 6. Build hybrid search index
customkb bm25 techbase

# 7. Optimize for your system
customkb optimize techbase

# 8. Verify everything is set up correctly
customkb verify-indexes techbase

# 9. Test with sample queries
customkb query techbase "What are the best practices?" \
  --prompt-template technical \
  --format markdown

# 10. Test context-only retrieval
customkb query techbase "authentication implementation" \
  --context-only \
  --top-k 10

Quick Reference

Environment Variables

OPENAI_API_KEY       # OpenAI API key
ANTHROPIC_API_KEY    # Anthropic API key
GOOGLE_API_KEY       # Google/Gemini API key
XAI_API_KEY          # xAI API key
VECTORDBS            # Knowledgebase base directory
NLTK_DATA            # NLTK data location (optional)

Model Aliases

# Embedding models
text-embedding-3-small   → OpenAI small (1536 dims)
text-embedding-3-large   → OpenAI large (3072 dims)
gemini-embedding-001     → Google Gemini (configurable dims)

# LLM models (examples)
gpt-5-mini               → OpenAI GPT-5 Mini (latest)
gpt-4o-mini              → OpenAI GPT-4o Mini (cost-effective)
claude-opus-4-5          → Anthropic Claude Opus 4.5
claude-sonnet-4-5        → Anthropic Claude Sonnet 4.5
gemini-2.5-flash         → Google Gemini 2.5 Flash
grok-4                   → xAI Grok 4

Performance Tips

Large datasets: Increase embedding_batch_size up to system limits
Technical content: Enable enable_hybrid_search = true
GPU available: Install FAISS GPU variant for 2-4x speedup
Low memory: Run customkb optimize to adjust for your system
Better accuracy: Enable reranking, lower similarity threshold
Faster queries: Increase cache size, disable reranking for speed

License

GPL-3.0 License - see LICENSE file for details.

Actively maintained by the Indonesian Open Technology Foundation

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
categorize		categorize
config		config
database		database
docs		docs
embedding		embedding
examples		examples
mcp_server		mcp_server
models		models
query		query
scripts		scripts
setup		setup
tests		tests
utils		utils
.coveragerc		.coveragerc
.editorconfig		.editorconfig
.flake8		.flake8
.pre-commit-config.yaml		.pre-commit-config.yaml
.ruff.toml		.ruff.toml
LICENSE		LICENSE
Models.json		Models.json
PURPOSE-FUNCTIONALITY-USAGE.md		PURPOSE-FUNCTIONALITY-USAGE.md
README.md		README.md
customkb		customkb
customkb.bash_completion		customkb.bash_completion
customkb.py		customkb.py
example.cfg		example.cfg
pyproject.toml		pyproject.toml
pytest-safe.ini		pytest-safe.ini
pytest.ini		pytest.ini
requirements-dev.txt		requirements-dev.txt
requirements-faiss-cpu.txt		requirements-faiss-cpu.txt
requirements-faiss-gpu-cu11.txt		requirements-faiss-gpu-cu11.txt
requirements-faiss-gpu-cu12.txt		requirements-faiss-gpu-cu12.txt
requirements-lock.txt		requirements-lock.txt
requirements-test.txt		requirements-test.txt
requirements.txt		requirements.txt
run_tests.py		run_tests.py
uv.lock		uv.lock
version.py		version.py
version.sh		version.sh

License

Open-Technology-Foundation/customkb

Folders and files

Latest commit

History

Repository files navigation

CustomKB: Production-Ready AI Knowledgebase System

Table of Contents

Key Features

Core Capabilities

Performance & Scale

How It Works

Prerequisites

Installation

1. Clone Repository

2. Setup Virtual Environment

3. Install FAISS

4. Install NLTK Data

5. Setup Knowledgebase Directory

6. Configure API Keys

Quick Start

Create Your First Knowledgebase

Core Commands

database - Import Documents

embed - Generate Embeddings

query - Search & Ask Questions

categorize - AI-Powered Document Categorization

edit - Edit KB Configuration

convert-encoding - Convert Files to UTF-8

Configuration

Priority Order

Example Configuration

Configuration Tips

Advanced Features

Supported Models

Language Models (LLMs)

Embedding Models

Prompt Templates

Output Formats

Category Filtering

Multi-Language Support

Security

Security Features

Security Best Practices

Reporting Security Issues

Performance Optimization

Auto-Optimization

Memory Tiers

Database Indexes

GPU Acceleration

Troubleshooting

Common Issues

Debug Mode

Knowledgebase Structure

Name Resolution

Utility Scripts

Performance & Optimization

GPU

Maintenance

Testing

Frequently Asked Questions

General

Technical

Contributing

Ways to Contribute

Quick Start for Contributors

Development Guidelines

Code of Conduct

Need Help?

Support & Community

Get Help

Stay Updated

Connect

Complete Example

Building a Production Knowledgebase

Quick Reference

Environment Variables

Model Aliases

Performance Tips

License

`database` - Import Documents

`embed` - Generate Embeddings

`query` - Search & Ask Questions

`categorize` - AI-Powered Document Categorization

`edit` - Edit KB Configuration

`convert-encoding` - Convert Files to UTF-8

Packages