LEGAL-AI-LLM

A Retrieval-Augmented Generation (RAG) legal assistant built with Flask, SambaNova's Llama-4-Maverick model, sentence-transformers embeddings, and ChromaDB. Users can either ask legal questions against a pre-indexed corpus of legal documents or upload their own PDFs, DOCX, or TXT files and have the system answer questions grounded in those uploads.

Overview

This project answers legal questions using Retrieval-Augmented Generation. Instead of relying solely on the parametric memory of a large language model, the system first retrieves relevant passages from a vector index of legal text, then conditions the LLM on those passages when generating its response. The two query modes are exposed through a Flask web UI: the "ask a question" mode hits the persistent corpus index, while the "upload and query" mode builds a transient per-request index from the uploaded files.

RAG is a sensible fit for legal assistance because legal answers depend heavily on the exact wording of statutes, case law, contract clauses, or filings. A general-purpose LLM hallucinates plausible-sounding citations and clauses; grounding the model in retrieved chunks reduces that risk and lets the answer cite document-level provenance. The project also implements a guard rail (is_legal_query in app.py:20) that uses the same LLM as a binary classifier to reject non-legal queries before any retrieval or final generation runs.

The scope is intentionally narrow. This is a student project demonstrating an end-to-end RAG pipeline: ingestion of mixed-format legal documents (JSON, CSV, PDF), sentence-aware sliding-window chunking, dense embedding with all-mpnet-base-v2, persistent vector storage in ChromaDB, retrieval through LlamaIndex, and streaming generation from a hosted Llama-4 model. It is not a production legal advice service and the README's Limitations section makes that clear.

Key Features

Two query paths: corpus-grounded Q&A and per-upload Q&A
Multi-file upload with grouped, source-attributed answers (app.py:168-177)
Sentence-aware sliding-window chunking with token overlap (embed_and_index.py:21-49)
Persistent ChromaDB vector store at index/chroma (embed_and_index.py:16, index_loader.py:12)
Streaming token output to the browser (app.py:94-97, llm_client.py:39-41)
LLM-as-classifier guard rail blocking non-legal queries (app.py:20-45)
Markdown-formatted, structured responses with per-document headings on multi-file queries (app.py:179-190)
Loaders for JSON, CSV, and PDF source data (load_all_data.py)
Configuration via a single .env file containing the SambaNova API key

How It Works

The pipeline has two phases: a one-time offline ingestion phase and a per-request online query phase.

Offline ingestion

load_all_data.py walks data/jsons, data/csvs, and data/pdfs and returns a list of (filename, raw_text) tuples. JSON arrays are flattened to key: value lines, CSVs are stringified through pandas, and PDFs are extracted with pdfplumber.
embed_and_index.py consumes that list. Each document is split into sentences with NLTK's sent_tokenize, then accumulated into chunks of up to max_tokens=500 words with overlap=100 words between adjacent chunks (embed_and_index.py:21). Chunks shorter than 30 characters are skipped (embed_and_index.py:63).
Chunks are encoded in batches of 96 with sentence-transformers/all-mpnet-base-v2 (embed_and_index.py:53, embed_and_index.py:78-89) and upserted into the legal_docs ChromaDB collection along with source_file and chunk_id metadata.

Online query (corpus mode, `/query`)

Browser POSTs the query to /query (app.py:80).
is_legal_query runs the same Llama-4 model as a one-shot LEGAL / NON-LEGAL classifier on the user's text (app.py:28-45). If the verdict is NON-LEGAL, the route returns a 422 with a guard error JSON (app.py:86-90).
index_loader.retriever (a LlamaIndex VectorStoreIndex.as_retriever(similarity_top_k=3) over the same Chroma collection, index_loader.py:27-28) returns the top three chunks for the query.
build_prompt joins those chunks into a numbered context block and wraps them in a system instruction telling the model to answer in Markdown (app.py:48-59).
run_llm_query opens a streaming chat completion against the SambaNova endpoint and yields tokens as they arrive (llm_client.py:16-43); Flask streams them to the browser through stream_with_context (app.py:94-97).

Online query (upload mode, `/upload_and_query`)

The route accepts a list of files via request.files.getlist("document") (app.py:103).
Each accepted file (.pdf, .docx, or .txt) is saved to a tempdir, text-extracted via PyMuPDF, python-docx, or plain read (app.py:61-73), and chunked with the same chunk_text_hybrid used at ingestion.
The guard classifier runs against the user's query plus a 500-character snippet from the first uploaded document (app.py:24-26, app.py:136).
A fresh in-memory ChromaDB collection (temp_upload_combined, app.py:145-156) is created, embedded, and queried for n_results=15 (app.py:159-163).
The 15 retrieved chunks are grouped by source_document (app.py:168-171) and rendered into a structured prompt that asks the model to produce a separate ### From <filename> section per relevant document (app.py:179-190).
Tokens stream back to the browser exactly as in corpus mode.

Architecture

                      +----------------------+
                      |   data/jsons         |
                      |   data/csvs          |
                      |   data/pdfs          |
                      +----------+-----------+
                                 |
                                 v
                      +----------------------+
                      |  load_all_data.py    |
                      |  (json/csv/pdf -> txt)|
                      +----------+-----------+
                                 |
                                 v
                      +-----------------------+
                      |  embed_and_index.py   |
                      |  - chunk_text_hybrid  |
                      |  - all-mpnet-base-v2  |
                      |  - upsert into Chroma |
                      +----------+------------+
                                 |
                                 v
                      +-----------------------+
                      |   index/chroma        |
                      |  (persistent vectors) |
                      +----------+------------+
                                 |
   browser <-- stream tokens --  |  --> retriever (top_k=3, LlamaIndex)
        ^                        |
        |                  +-----+--------+
        |                  |   app.py     |
        |                  |  Flask       |
        |                  |  /query      |
        |                  |  /upload_... |
        |                  +------+-------+
        |                         |
        |          guard          v
        |     +---------------------------+
        |     |  is_legal_query           |
        |     |  (LLM classifier prompt)  |
        |     +-------------+-------------+
        |                   |
        |                   v
        |          +-----------------+
        +----------+ llm_client.py   |
                   | SambaNova       |
                   | Llama-4-Maverick|
                   +-----------------+

Tech Stack

Layer	Component	Where it lives
Web framework	Flask 3.1	`app.py`
Templating / UI	Jinja2 + Tailwind (CDN) + marked.js	`templates/index.html`
LLM	SambaNova Llama-4-Maverick-17B-128E-Instruct	`llm_client.py:12`
LLM client	`openai` SDK pointed at `api.sambanova.ai/v1`	`llm_client.py:7-10`
Embedding model	`sentence-transformers/all-mpnet-base-v2`	`embed_and_index.py:12`, `index_loader.py:13`
Vector store	ChromaDB (persistent client)	`embed_and_index.py:16`, `index_loader.py:20`
Retrieval	LlamaIndex `VectorStoreIndex` with `similarity_top_k=3`	`index_loader.py:27-28`
Chunking	Sliding-window over NLTK sentences, 500 / 100 overlap	`embed_and_index.py:21-49`
Document parsing	PyMuPDF, python-docx, pdfplumber, pandas	`app.py:61-73`, `load_all_data.py`
Config	`python-dotenv`	`app.py:16`, `llm_client.py:5`

Project Structure

LEGAL-AI-LLM/
├── app.py                  # Flask routes: /, /query, /upload_and_query
├── embed_and_index.py      # Chunker, embedder, ingestion entry point
├── index_loader.py         # LlamaIndex retriever bound to persistent Chroma
├── llm_client.py           # SambaNova streaming chat client
├── load_all_data.py        # JSON / CSV / PDF loaders
├── requirements.txt        # Pinned dependencies
├── debug.py                # Empty scratchpad
├── templates/
│   ├── index.html          # Chat UI (Tailwind + vanilla JS)
│   └── static/
│       └── styles.css
├── data/                   # not committed; populate before indexing
│   ├── jsons/
│   ├── csvs/
│   └── pdfs/
├── index/                  # not committed; built by embed_and_index.py
│   └── chroma/
├── .env.example            # template; copy to .env and fill in
├── .gitignore
└── README.md

Prerequisites

Python 3.10 or newer (the source machine used a venv311, so 3.11 is known good)
A SambaNova Cloud account and API key
Roughly 1.5 GB of free disk for the Python environment plus model weights downloaded by sentence-transformers on first run
An internet connection on first launch (to download the embedding model from Hugging Face) and on every query (to call SambaNova)

Installation

# 1. Clone the repo
git clone https://github.com/AYON-ARYAN/LEGAL-AI-LLM.git
cd LEGAL-AI-LLM

# 2. Create and activate a virtual environment
python3 -m venv .venv
source .venv/bin/activate            # on Windows: .venv\Scripts\activate

# 3. Install dependencies
pip install --upgrade pip
pip install -r requirements.txt

# 4. Configure the API key
cp .env.example .env
# then edit .env and paste your real SambaNova key

The first time you run the embedder or retriever, sentence-transformers will download the all-mpnet-base-v2 weights (~420 MB) into your Hugging Face cache. NLTK will also download punkt on first run (embed_and_index.py:9).

Data and Index Setup

The data/ and index/chroma/ directories are deliberately excluded from this repository because they are large (hundreds of megabytes on the source machine) and easily regenerable. To rebuild them locally:

Place your source documents into the appropriate subfolders:
- data/jsons/ for JSON arrays of records
- data/csvs/ for CSV files (only the first 100 are loaded; see load_all_data.py:6)
- data/pdfs/ for raw PDFs

Optionally, sanity-check the loaders:

python load_all_data.py
# prints "Loaded N documents."

Build the persistent vector index:
```
python embed_and_index.py
```
This iterates the three loaders, chunks every document, encodes the chunks in batches of 96, and upserts them into index/chroma/ (embed_and_index.py:95-104). On a machine without a GPU, expect this to take several minutes per few hundred documents.

If you only intend to use the upload-and-query flow, you can skip the offline indexing step entirely; that path builds an in-memory Chroma collection per request and does not read index/chroma/ (app.py:145-156).

Usage

Start the Flask server:

python app.py

The app listens on port 5002 with the reloader disabled (app.py:198), so open http://127.0.0.1:5002 in your browser.

Asking a corpus question

Type a legal question in the input box.
The browser POSTs to /query (app.py:80).
The guard classifier runs first; if it returns NON-LEGAL, the UI shows the guard error message.
Otherwise the retriever pulls the top three chunks from index/chroma, the LLM is prompted with them, and tokens stream back into the chat bubble.

Uploading and asking

Attach one or more .pdf, .docx, or .txt files.
Type a question.
The browser POSTs to /upload_and_query (app.py:100).
Files are extracted, chunked, embedded into a per-request Chroma collection, and the top 15 chunks are retrieved and grouped by source filename.
The model returns a Markdown response with one ### From <filename> heading per document that contributed relevant context.

Configuration

Variable	Required	Default	Description
`SAMBA_API_KEY`	yes	(none)	SambaNova Cloud API key, read in `llm_client.py:8`

Other values are currently hard-coded in source and can be tuned by editing the file:

Setting	File:line	Default
Embedding model	`embed_and_index.py:12`	`all-mpnet-base-v2`
Chroma path	`embed_and_index.py:16`	`index/chroma`
Chunk size	`embed_and_index.py:21`	500 tokens
Chunk overlap	`embed_and_index.py:21`	100 tokens
Retriever top_k	`index_loader.py:28`	3
Upload top_k	`app.py:161`	15
LLM model name	`llm_client.py:12`	`Llama-4-Maverick-17B-128E-Instruct`
Temperature	`llm_client.py:34`	0.1
Top-p	`llm_client.py:35`	0.1
Max tokens	`llm_client.py:36`	1500
Flask port	`app.py:198`	5002

Guard Rails

Both /query and /upload_and_query invoke is_legal_query (app.py:20-45) before doing any retrieval or final generation. The function builds a short classifier prompt:

Analyze the following text. Is it related to legal matters, proceedings, or documents?
Respond with only the single word 'LEGAL' or 'NON-LEGAL'.

It feeds the user's query (and, in upload mode, a 500-character snippet from the first uploaded document) to the same Llama-4-Maverick endpoint, drains the streamed response, and checks whether the string LEGAL appears in the uppercased result. On a non-legal verdict the route returns HTTP 422 with is_guard_error: true so the frontend can render a distinct error bubble (app.py:86-90, app.py:137-140).

If the classifier itself raises an exception, the function fails open and treats the query as legal (app.py:43-45). This is a deliberate trade-off: a transient SambaNova outage should not silently block all traffic, but it does mean the guard cannot be relied on for hostile inputs while the LLM is unreachable.

Limitations and Future Work

Not legal advice. This is an academic project. Retrieved-and-summarised text is not a substitute for a licensed lawyer, and the model can still hallucinate even when grounded.
Single-collection index. The persistent index lives in one ChromaDB collection (legal_docs); there is no multi-tenant separation.
No auth. The Flask app exposes both routes openly on localhost. There is no login, rate limiting, or upload size cap beyond Flask's defaults.
Static top_k. Retrieval breadth is hard-coded; a follow-up could expose it as a query parameter and add a re-ranker.
Guard fails open. As noted above, classifier errors are treated as legal queries. A stricter mode would default to deny.
No citations in answer text. The corpus-mode prompt numbers chunks but does not ask the model to cite their numeric IDs, so the user cannot trace a sentence to a specific chunk. The upload mode partly addresses this with per-document headings.
PDF extraction is naive. Both pdfplumber (offline) and PyMuPDF (upload) lose table structure and footnote ordering on complex filings.
Chunking is word-count based. chunk_text_hybrid counts whitespace-split tokens (embed_and_index.py:31), which under-counts for sub-word tokenizers. A future revision could swap in a proper tokenizer.
Possible upgrades: hybrid BM25 + dense retrieval, a re-ranker (Cohere Rerank or a cross-encoder), per-jurisdiction collections, conversational memory across turns, and a Postgres-backed history store instead of the in-browser sidebar.

License

MIT License. See below.

MIT License

Copyright (c) 2026 AYON-ARYAN

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LEGAL-AI-LLM

Overview

Key Features

How It Works

Offline ingestion

Online query (corpus mode, `/query`)

Online query (upload mode, `/upload_and_query`)

Architecture

Tech Stack

Project Structure

Prerequisites

Installation

Data and Index Setup

Usage

Asking a corpus question

Uploading and asking

Configuration

Guard Rails

Limitations and Future Work

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
templates		templates
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
app.py		app.py
debug.py		debug.py
embed_and_index.py		embed_and_index.py
index_loader.py		index_loader.py
llm_client.py		llm_client.py
load_all_data.py		load_all_data.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

LEGAL-AI-LLM

Overview

Key Features

How It Works

Offline ingestion

Online query (corpus mode, /query)

Online query (upload mode, /upload_and_query)

Architecture

Tech Stack

Project Structure

Prerequisites

Installation

Data and Index Setup

Usage

Asking a corpus question

Uploading and asking

Configuration

Guard Rails

Limitations and Future Work

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Online query (corpus mode, `/query`)

Online query (upload mode, `/upload_and_query`)

Packages