Skip to content

AYON-ARYAN/LEGAL-AI-LLM

Repository files navigation

LEGAL-AI-LLM

A Retrieval-Augmented Generation (RAG) legal assistant built with Flask, SambaNova's Llama-4-Maverick model, sentence-transformers embeddings, and ChromaDB. Users can either ask legal questions against a pre-indexed corpus of legal documents or upload their own PDFs, DOCX, or TXT files and have the system answer questions grounded in those uploads.

Overview

This project answers legal questions using Retrieval-Augmented Generation. Instead of relying solely on the parametric memory of a large language model, the system first retrieves relevant passages from a vector index of legal text, then conditions the LLM on those passages when generating its response. The two query modes are exposed through a Flask web UI: the "ask a question" mode hits the persistent corpus index, while the "upload and query" mode builds a transient per-request index from the uploaded files.

RAG is a sensible fit for legal assistance because legal answers depend heavily on the exact wording of statutes, case law, contract clauses, or filings. A general-purpose LLM hallucinates plausible-sounding citations and clauses; grounding the model in retrieved chunks reduces that risk and lets the answer cite document-level provenance. The project also implements a guard rail (is_legal_query in app.py:20) that uses the same LLM as a binary classifier to reject non-legal queries before any retrieval or final generation runs.

The scope is intentionally narrow. This is a student project demonstrating an end-to-end RAG pipeline: ingestion of mixed-format legal documents (JSON, CSV, PDF), sentence-aware sliding-window chunking, dense embedding with all-mpnet-base-v2, persistent vector storage in ChromaDB, retrieval through LlamaIndex, and streaming generation from a hosted Llama-4 model. It is not a production legal advice service and the README's Limitations section makes that clear.

Key Features

  • Two query paths: corpus-grounded Q&A and per-upload Q&A
  • Multi-file upload with grouped, source-attributed answers (app.py:168-177)
  • Sentence-aware sliding-window chunking with token overlap (embed_and_index.py:21-49)
  • Persistent ChromaDB vector store at index/chroma (embed_and_index.py:16, index_loader.py:12)
  • Streaming token output to the browser (app.py:94-97, llm_client.py:39-41)
  • LLM-as-classifier guard rail blocking non-legal queries (app.py:20-45)
  • Markdown-formatted, structured responses with per-document headings on multi-file queries (app.py:179-190)
  • Loaders for JSON, CSV, and PDF source data (load_all_data.py)
  • Configuration via a single .env file containing the SambaNova API key

How It Works

The pipeline has two phases: a one-time offline ingestion phase and a per-request online query phase.

Offline ingestion

  1. load_all_data.py walks data/jsons, data/csvs, and data/pdfs and returns a list of (filename, raw_text) tuples. JSON arrays are flattened to key: value lines, CSVs are stringified through pandas, and PDFs are extracted with pdfplumber.
  2. embed_and_index.py consumes that list. Each document is split into sentences with NLTK's sent_tokenize, then accumulated into chunks of up to max_tokens=500 words with overlap=100 words between adjacent chunks (embed_and_index.py:21). Chunks shorter than 30 characters are skipped (embed_and_index.py:63).
  3. Chunks are encoded in batches of 96 with sentence-transformers/all-mpnet-base-v2 (embed_and_index.py:53, embed_and_index.py:78-89) and upserted into the legal_docs ChromaDB collection along with source_file and chunk_id metadata.

Online query (corpus mode, /query)

  1. Browser POSTs the query to /query (app.py:80).
  2. is_legal_query runs the same Llama-4 model as a one-shot LEGAL / NON-LEGAL classifier on the user's text (app.py:28-45). If the verdict is NON-LEGAL, the route returns a 422 with a guard error JSON (app.py:86-90).
  3. index_loader.retriever (a LlamaIndex VectorStoreIndex.as_retriever(similarity_top_k=3) over the same Chroma collection, index_loader.py:27-28) returns the top three chunks for the query.
  4. build_prompt joins those chunks into a numbered context block and wraps them in a system instruction telling the model to answer in Markdown (app.py:48-59).
  5. run_llm_query opens a streaming chat completion against the SambaNova endpoint and yields tokens as they arrive (llm_client.py:16-43); Flask streams them to the browser through stream_with_context (app.py:94-97).

Online query (upload mode, /upload_and_query)

  1. The route accepts a list of files via request.files.getlist("document") (app.py:103).
  2. Each accepted file (.pdf, .docx, or .txt) is saved to a tempdir, text-extracted via PyMuPDF, python-docx, or plain read (app.py:61-73), and chunked with the same chunk_text_hybrid used at ingestion.
  3. The guard classifier runs against the user's query plus a 500-character snippet from the first uploaded document (app.py:24-26, app.py:136).
  4. A fresh in-memory ChromaDB collection (temp_upload_combined, app.py:145-156) is created, embedded, and queried for n_results=15 (app.py:159-163).
  5. The 15 retrieved chunks are grouped by source_document (app.py:168-171) and rendered into a structured prompt that asks the model to produce a separate ### From <filename> section per relevant document (app.py:179-190).
  6. Tokens stream back to the browser exactly as in corpus mode.

Architecture

                      +----------------------+
                      |   data/jsons         |
                      |   data/csvs          |
                      |   data/pdfs          |
                      +----------+-----------+
                                 |
                                 v
                      +----------------------+
                      |  load_all_data.py    |
                      |  (json/csv/pdf -> txt)|
                      +----------+-----------+
                                 |
                                 v
                      +-----------------------+
                      |  embed_and_index.py   |
                      |  - chunk_text_hybrid  |
                      |  - all-mpnet-base-v2  |
                      |  - upsert into Chroma |
                      +----------+------------+
                                 |
                                 v
                      +-----------------------+
                      |   index/chroma        |
                      |  (persistent vectors) |
                      +----------+------------+
                                 |
   browser <-- stream tokens --  |  --> retriever (top_k=3, LlamaIndex)
        ^                        |
        |                  +-----+--------+
        |                  |   app.py     |
        |                  |  Flask       |
        |                  |  /query      |
        |                  |  /upload_... |
        |                  +------+-------+
        |                         |
        |          guard          v
        |     +---------------------------+
        |     |  is_legal_query           |
        |     |  (LLM classifier prompt)  |
        |     +-------------+-------------+
        |                   |
        |                   v
        |          +-----------------+
        +----------+ llm_client.py   |
                   | SambaNova       |
                   | Llama-4-Maverick|
                   +-----------------+

Tech Stack

Layer Component Where it lives
Web framework Flask 3.1 app.py
Templating / UI Jinja2 + Tailwind (CDN) + marked.js templates/index.html
LLM SambaNova Llama-4-Maverick-17B-128E-Instruct llm_client.py:12
LLM client openai SDK pointed at api.sambanova.ai/v1 llm_client.py:7-10
Embedding model sentence-transformers/all-mpnet-base-v2 embed_and_index.py:12, index_loader.py:13
Vector store ChromaDB (persistent client) embed_and_index.py:16, index_loader.py:20
Retrieval LlamaIndex VectorStoreIndex with similarity_top_k=3 index_loader.py:27-28
Chunking Sliding-window over NLTK sentences, 500 / 100 overlap embed_and_index.py:21-49
Document parsing PyMuPDF, python-docx, pdfplumber, pandas app.py:61-73, load_all_data.py
Config python-dotenv app.py:16, llm_client.py:5

Project Structure

LEGAL-AI-LLM/
├── app.py                  # Flask routes: /, /query, /upload_and_query
├── embed_and_index.py      # Chunker, embedder, ingestion entry point
├── index_loader.py         # LlamaIndex retriever bound to persistent Chroma
├── llm_client.py           # SambaNova streaming chat client
├── load_all_data.py        # JSON / CSV / PDF loaders
├── requirements.txt        # Pinned dependencies
├── debug.py                # Empty scratchpad
├── templates/
│   ├── index.html          # Chat UI (Tailwind + vanilla JS)
│   └── static/
│       └── styles.css
├── data/                   # not committed; populate before indexing
│   ├── jsons/
│   ├── csvs/
│   └── pdfs/
├── index/                  # not committed; built by embed_and_index.py
│   └── chroma/
├── .env.example            # template; copy to .env and fill in
├── .gitignore
└── README.md

Prerequisites

  • Python 3.10 or newer (the source machine used a venv311, so 3.11 is known good)
  • A SambaNova Cloud account and API key
  • Roughly 1.5 GB of free disk for the Python environment plus model weights downloaded by sentence-transformers on first run
  • An internet connection on first launch (to download the embedding model from Hugging Face) and on every query (to call SambaNova)

Installation

# 1. Clone the repo
git clone https://github.com/AYON-ARYAN/LEGAL-AI-LLM.git
cd LEGAL-AI-LLM

# 2. Create and activate a virtual environment
python3 -m venv .venv
source .venv/bin/activate            # on Windows: .venv\Scripts\activate

# 3. Install dependencies
pip install --upgrade pip
pip install -r requirements.txt

# 4. Configure the API key
cp .env.example .env
# then edit .env and paste your real SambaNova key

The first time you run the embedder or retriever, sentence-transformers will download the all-mpnet-base-v2 weights (~420 MB) into your Hugging Face cache. NLTK will also download punkt on first run (embed_and_index.py:9).

Data and Index Setup

The data/ and index/chroma/ directories are deliberately excluded from this repository because they are large (hundreds of megabytes on the source machine) and easily regenerable. To rebuild them locally:

  1. Place your source documents into the appropriate subfolders:

    • data/jsons/ for JSON arrays of records
    • data/csvs/ for CSV files (only the first 100 are loaded; see load_all_data.py:6)
    • data/pdfs/ for raw PDFs
  2. Optionally, sanity-check the loaders:

    python load_all_data.py
    # prints "Loaded N documents."
  3. Build the persistent vector index:

    python embed_and_index.py

    This iterates the three loaders, chunks every document, encodes the chunks in batches of 96, and upserts them into index/chroma/ (embed_and_index.py:95-104). On a machine without a GPU, expect this to take several minutes per few hundred documents.

If you only intend to use the upload-and-query flow, you can skip the offline indexing step entirely; that path builds an in-memory Chroma collection per request and does not read index/chroma/ (app.py:145-156).

Usage

Start the Flask server:

python app.py

The app listens on port 5002 with the reloader disabled (app.py:198), so open http://127.0.0.1:5002 in your browser.

Asking a corpus question

  1. Type a legal question in the input box.
  2. The browser POSTs to /query (app.py:80).
  3. The guard classifier runs first; if it returns NON-LEGAL, the UI shows the guard error message.
  4. Otherwise the retriever pulls the top three chunks from index/chroma, the LLM is prompted with them, and tokens stream back into the chat bubble.

Uploading and asking

  1. Attach one or more .pdf, .docx, or .txt files.
  2. Type a question.
  3. The browser POSTs to /upload_and_query (app.py:100).
  4. Files are extracted, chunked, embedded into a per-request Chroma collection, and the top 15 chunks are retrieved and grouped by source filename.
  5. The model returns a Markdown response with one ### From <filename> heading per document that contributed relevant context.

Configuration

Variable Required Default Description
SAMBA_API_KEY yes (none) SambaNova Cloud API key, read in llm_client.py:8

Other values are currently hard-coded in source and can be tuned by editing the file:

Setting File:line Default
Embedding model embed_and_index.py:12 all-mpnet-base-v2
Chroma path embed_and_index.py:16 index/chroma
Chunk size embed_and_index.py:21 500 tokens
Chunk overlap embed_and_index.py:21 100 tokens
Retriever top_k index_loader.py:28 3
Upload top_k app.py:161 15
LLM model name llm_client.py:12 Llama-4-Maverick-17B-128E-Instruct
Temperature llm_client.py:34 0.1
Top-p llm_client.py:35 0.1
Max tokens llm_client.py:36 1500
Flask port app.py:198 5002

Guard Rails

Both /query and /upload_and_query invoke is_legal_query (app.py:20-45) before doing any retrieval or final generation. The function builds a short classifier prompt:

Analyze the following text. Is it related to legal matters, proceedings, or documents?
Respond with only the single word 'LEGAL' or 'NON-LEGAL'.

It feeds the user's query (and, in upload mode, a 500-character snippet from the first uploaded document) to the same Llama-4-Maverick endpoint, drains the streamed response, and checks whether the string LEGAL appears in the uppercased result. On a non-legal verdict the route returns HTTP 422 with is_guard_error: true so the frontend can render a distinct error bubble (app.py:86-90, app.py:137-140).

If the classifier itself raises an exception, the function fails open and treats the query as legal (app.py:43-45). This is a deliberate trade-off: a transient SambaNova outage should not silently block all traffic, but it does mean the guard cannot be relied on for hostile inputs while the LLM is unreachable.

Limitations and Future Work

  • Not legal advice. This is an academic project. Retrieved-and-summarised text is not a substitute for a licensed lawyer, and the model can still hallucinate even when grounded.
  • Single-collection index. The persistent index lives in one ChromaDB collection (legal_docs); there is no multi-tenant separation.
  • No auth. The Flask app exposes both routes openly on localhost. There is no login, rate limiting, or upload size cap beyond Flask's defaults.
  • Static top_k. Retrieval breadth is hard-coded; a follow-up could expose it as a query parameter and add a re-ranker.
  • Guard fails open. As noted above, classifier errors are treated as legal queries. A stricter mode would default to deny.
  • No citations in answer text. The corpus-mode prompt numbers chunks but does not ask the model to cite their numeric IDs, so the user cannot trace a sentence to a specific chunk. The upload mode partly addresses this with per-document headings.
  • PDF extraction is naive. Both pdfplumber (offline) and PyMuPDF (upload) lose table structure and footnote ordering on complex filings.
  • Chunking is word-count based. chunk_text_hybrid counts whitespace-split tokens (embed_and_index.py:31), which under-counts for sub-word tokenizers. A future revision could swap in a proper tokenizer.
  • Possible upgrades: hybrid BM25 + dense retrieval, a re-ranker (Cohere Rerank or a cross-encoder), per-jurisdiction collections, conversational memory across turns, and a Postgres-backed history store instead of the in-browser sidebar.

License

MIT License. See below.

MIT License

Copyright (c) 2026 AYON-ARYAN

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

About

Legal AI assistant with RAG: upload documents or ask legal questions. SambaNova Llama-4 LLM, sentence-transformers embeddings, ChromaDB vector store, Flask UI, and guard rails against off-topic queries.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors