A Retrieval-Augmented Generation (RAG) legal assistant built with Flask, SambaNova's Llama-4-Maverick model, sentence-transformers embeddings, and ChromaDB. Users can either ask legal questions against a pre-indexed corpus of legal documents or upload their own PDFs, DOCX, or TXT files and have the system answer questions grounded in those uploads.
This project answers legal questions using Retrieval-Augmented Generation. Instead of relying solely on the parametric memory of a large language model, the system first retrieves relevant passages from a vector index of legal text, then conditions the LLM on those passages when generating its response. The two query modes are exposed through a Flask web UI: the "ask a question" mode hits the persistent corpus index, while the "upload and query" mode builds a transient per-request index from the uploaded files.
RAG is a sensible fit for legal assistance because legal answers depend heavily on the exact wording of statutes, case law, contract clauses, or filings. A general-purpose LLM hallucinates plausible-sounding citations and clauses; grounding the model in retrieved chunks reduces that risk and lets the answer cite document-level provenance. The project also implements a guard rail (is_legal_query in app.py:20) that uses the same LLM as a binary classifier to reject non-legal queries before any retrieval or final generation runs.
The scope is intentionally narrow. This is a student project demonstrating an end-to-end RAG pipeline: ingestion of mixed-format legal documents (JSON, CSV, PDF), sentence-aware sliding-window chunking, dense embedding with all-mpnet-base-v2, persistent vector storage in ChromaDB, retrieval through LlamaIndex, and streaming generation from a hosted Llama-4 model. It is not a production legal advice service and the README's Limitations section makes that clear.
- Two query paths: corpus-grounded Q&A and per-upload Q&A
- Multi-file upload with grouped, source-attributed answers (
app.py:168-177) - Sentence-aware sliding-window chunking with token overlap (
embed_and_index.py:21-49) - Persistent ChromaDB vector store at
index/chroma(embed_and_index.py:16,index_loader.py:12) - Streaming token output to the browser (
app.py:94-97,llm_client.py:39-41) - LLM-as-classifier guard rail blocking non-legal queries (
app.py:20-45) - Markdown-formatted, structured responses with per-document headings on multi-file queries (
app.py:179-190) - Loaders for JSON, CSV, and PDF source data (
load_all_data.py) - Configuration via a single
.envfile containing the SambaNova API key
The pipeline has two phases: a one-time offline ingestion phase and a per-request online query phase.
load_all_data.pywalksdata/jsons,data/csvs, anddata/pdfsand returns a list of(filename, raw_text)tuples. JSON arrays are flattened tokey: valuelines, CSVs are stringified through pandas, and PDFs are extracted with pdfplumber.embed_and_index.pyconsumes that list. Each document is split into sentences with NLTK'ssent_tokenize, then accumulated into chunks of up tomax_tokens=500words withoverlap=100words between adjacent chunks (embed_and_index.py:21). Chunks shorter than 30 characters are skipped (embed_and_index.py:63).- Chunks are encoded in batches of 96 with
sentence-transformers/all-mpnet-base-v2(embed_and_index.py:53,embed_and_index.py:78-89) and upserted into thelegal_docsChromaDB collection along withsource_fileandchunk_idmetadata.
- Browser POSTs the query to
/query(app.py:80). is_legal_queryruns the same Llama-4 model as a one-shotLEGAL/NON-LEGALclassifier on the user's text (app.py:28-45). If the verdict isNON-LEGAL, the route returns a 422 with a guard error JSON (app.py:86-90).index_loader.retriever(a LlamaIndexVectorStoreIndex.as_retriever(similarity_top_k=3)over the same Chroma collection,index_loader.py:27-28) returns the top three chunks for the query.build_promptjoins those chunks into a numbered context block and wraps them in a system instruction telling the model to answer in Markdown (app.py:48-59).run_llm_queryopens a streaming chat completion against the SambaNova endpoint and yields tokens as they arrive (llm_client.py:16-43); Flask streams them to the browser throughstream_with_context(app.py:94-97).
- The route accepts a list of files via
request.files.getlist("document")(app.py:103). - Each accepted file (
.pdf,.docx, or.txt) is saved to a tempdir, text-extracted via PyMuPDF, python-docx, or plain read (app.py:61-73), and chunked with the samechunk_text_hybridused at ingestion. - The guard classifier runs against the user's query plus a 500-character snippet from the first uploaded document (
app.py:24-26,app.py:136). - A fresh in-memory ChromaDB collection (
temp_upload_combined,app.py:145-156) is created, embedded, and queried forn_results=15(app.py:159-163). - The 15 retrieved chunks are grouped by
source_document(app.py:168-171) and rendered into a structured prompt that asks the model to produce a separate### From <filename>section per relevant document (app.py:179-190). - Tokens stream back to the browser exactly as in corpus mode.
+----------------------+
| data/jsons |
| data/csvs |
| data/pdfs |
+----------+-----------+
|
v
+----------------------+
| load_all_data.py |
| (json/csv/pdf -> txt)|
+----------+-----------+
|
v
+-----------------------+
| embed_and_index.py |
| - chunk_text_hybrid |
| - all-mpnet-base-v2 |
| - upsert into Chroma |
+----------+------------+
|
v
+-----------------------+
| index/chroma |
| (persistent vectors) |
+----------+------------+
|
browser <-- stream tokens -- | --> retriever (top_k=3, LlamaIndex)
^ |
| +-----+--------+
| | app.py |
| | Flask |
| | /query |
| | /upload_... |
| +------+-------+
| |
| guard v
| +---------------------------+
| | is_legal_query |
| | (LLM classifier prompt) |
| +-------------+-------------+
| |
| v
| +-----------------+
+----------+ llm_client.py |
| SambaNova |
| Llama-4-Maverick|
+-----------------+
| Layer | Component | Where it lives |
|---|---|---|
| Web framework | Flask 3.1 | app.py |
| Templating / UI | Jinja2 + Tailwind (CDN) + marked.js | templates/index.html |
| LLM | SambaNova Llama-4-Maverick-17B-128E-Instruct | llm_client.py:12 |
| LLM client | openai SDK pointed at api.sambanova.ai/v1 |
llm_client.py:7-10 |
| Embedding model | sentence-transformers/all-mpnet-base-v2 |
embed_and_index.py:12, index_loader.py:13 |
| Vector store | ChromaDB (persistent client) | embed_and_index.py:16, index_loader.py:20 |
| Retrieval | LlamaIndex VectorStoreIndex with similarity_top_k=3 |
index_loader.py:27-28 |
| Chunking | Sliding-window over NLTK sentences, 500 / 100 overlap | embed_and_index.py:21-49 |
| Document parsing | PyMuPDF, python-docx, pdfplumber, pandas | app.py:61-73, load_all_data.py |
| Config | python-dotenv |
app.py:16, llm_client.py:5 |
LEGAL-AI-LLM/
├── app.py # Flask routes: /, /query, /upload_and_query
├── embed_and_index.py # Chunker, embedder, ingestion entry point
├── index_loader.py # LlamaIndex retriever bound to persistent Chroma
├── llm_client.py # SambaNova streaming chat client
├── load_all_data.py # JSON / CSV / PDF loaders
├── requirements.txt # Pinned dependencies
├── debug.py # Empty scratchpad
├── templates/
│ ├── index.html # Chat UI (Tailwind + vanilla JS)
│ └── static/
│ └── styles.css
├── data/ # not committed; populate before indexing
│ ├── jsons/
│ ├── csvs/
│ └── pdfs/
├── index/ # not committed; built by embed_and_index.py
│ └── chroma/
├── .env.example # template; copy to .env and fill in
├── .gitignore
└── README.md
- Python 3.10 or newer (the source machine used a
venv311, so 3.11 is known good) - A SambaNova Cloud account and API key
- Roughly 1.5 GB of free disk for the Python environment plus model weights downloaded by sentence-transformers on first run
- An internet connection on first launch (to download the embedding model from Hugging Face) and on every query (to call SambaNova)
# 1. Clone the repo
git clone https://github.com/AYON-ARYAN/LEGAL-AI-LLM.git
cd LEGAL-AI-LLM
# 2. Create and activate a virtual environment
python3 -m venv .venv
source .venv/bin/activate # on Windows: .venv\Scripts\activate
# 3. Install dependencies
pip install --upgrade pip
pip install -r requirements.txt
# 4. Configure the API key
cp .env.example .env
# then edit .env and paste your real SambaNova keyThe first time you run the embedder or retriever, sentence-transformers will download the all-mpnet-base-v2 weights (~420 MB) into your Hugging Face cache. NLTK will also download punkt on first run (embed_and_index.py:9).
The data/ and index/chroma/ directories are deliberately excluded from this repository because they are large (hundreds of megabytes on the source machine) and easily regenerable. To rebuild them locally:
-
Place your source documents into the appropriate subfolders:
data/jsons/for JSON arrays of recordsdata/csvs/for CSV files (only the first 100 are loaded; seeload_all_data.py:6)data/pdfs/for raw PDFs
-
Optionally, sanity-check the loaders:
python load_all_data.py # prints "Loaded N documents." -
Build the persistent vector index:
python embed_and_index.py
This iterates the three loaders, chunks every document, encodes the chunks in batches of 96, and upserts them into
index/chroma/(embed_and_index.py:95-104). On a machine without a GPU, expect this to take several minutes per few hundred documents.
If you only intend to use the upload-and-query flow, you can skip the offline indexing step entirely; that path builds an in-memory Chroma collection per request and does not read index/chroma/ (app.py:145-156).
Start the Flask server:
python app.pyThe app listens on port 5002 with the reloader disabled (app.py:198), so open http://127.0.0.1:5002 in your browser.
- Type a legal question in the input box.
- The browser POSTs to
/query(app.py:80). - The guard classifier runs first; if it returns
NON-LEGAL, the UI shows the guard error message. - Otherwise the retriever pulls the top three chunks from
index/chroma, the LLM is prompted with them, and tokens stream back into the chat bubble.
- Attach one or more
.pdf,.docx, or.txtfiles. - Type a question.
- The browser POSTs to
/upload_and_query(app.py:100). - Files are extracted, chunked, embedded into a per-request Chroma collection, and the top 15 chunks are retrieved and grouped by source filename.
- The model returns a Markdown response with one
### From <filename>heading per document that contributed relevant context.
| Variable | Required | Default | Description |
|---|---|---|---|
SAMBA_API_KEY |
yes | (none) | SambaNova Cloud API key, read in llm_client.py:8 |
Other values are currently hard-coded in source and can be tuned by editing the file:
| Setting | File:line | Default |
|---|---|---|
| Embedding model | embed_and_index.py:12 |
all-mpnet-base-v2 |
| Chroma path | embed_and_index.py:16 |
index/chroma |
| Chunk size | embed_and_index.py:21 |
500 tokens |
| Chunk overlap | embed_and_index.py:21 |
100 tokens |
| Retriever top_k | index_loader.py:28 |
3 |
| Upload top_k | app.py:161 |
15 |
| LLM model name | llm_client.py:12 |
Llama-4-Maverick-17B-128E-Instruct |
| Temperature | llm_client.py:34 |
0.1 |
| Top-p | llm_client.py:35 |
0.1 |
| Max tokens | llm_client.py:36 |
1500 |
| Flask port | app.py:198 |
5002 |
Both /query and /upload_and_query invoke is_legal_query (app.py:20-45) before doing any retrieval or final generation. The function builds a short classifier prompt:
Analyze the following text. Is it related to legal matters, proceedings, or documents?
Respond with only the single word 'LEGAL' or 'NON-LEGAL'.
It feeds the user's query (and, in upload mode, a 500-character snippet from the first uploaded document) to the same Llama-4-Maverick endpoint, drains the streamed response, and checks whether the string LEGAL appears in the uppercased result. On a non-legal verdict the route returns HTTP 422 with is_guard_error: true so the frontend can render a distinct error bubble (app.py:86-90, app.py:137-140).
If the classifier itself raises an exception, the function fails open and treats the query as legal (app.py:43-45). This is a deliberate trade-off: a transient SambaNova outage should not silently block all traffic, but it does mean the guard cannot be relied on for hostile inputs while the LLM is unreachable.
- Not legal advice. This is an academic project. Retrieved-and-summarised text is not a substitute for a licensed lawyer, and the model can still hallucinate even when grounded.
- Single-collection index. The persistent index lives in one ChromaDB collection (
legal_docs); there is no multi-tenant separation. - No auth. The Flask app exposes both routes openly on
localhost. There is no login, rate limiting, or upload size cap beyond Flask's defaults. - Static
top_k. Retrieval breadth is hard-coded; a follow-up could expose it as a query parameter and add a re-ranker. - Guard fails open. As noted above, classifier errors are treated as legal queries. A stricter mode would default to deny.
- No citations in answer text. The corpus-mode prompt numbers chunks but does not ask the model to cite their numeric IDs, so the user cannot trace a sentence to a specific chunk. The upload mode partly addresses this with per-document headings.
- PDF extraction is naive. Both
pdfplumber(offline) andPyMuPDF(upload) lose table structure and footnote ordering on complex filings. - Chunking is word-count based.
chunk_text_hybridcounts whitespace-split tokens (embed_and_index.py:31), which under-counts for sub-word tokenizers. A future revision could swap in a proper tokenizer. - Possible upgrades: hybrid BM25 + dense retrieval, a re-ranker (Cohere Rerank or a cross-encoder), per-jurisdiction collections, conversational memory across turns, and a Postgres-backed history store instead of the in-browser sidebar.
MIT License. See below.
MIT License
Copyright (c) 2026 AYON-ARYAN
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.