Skip to content

HBrahmbhatt/scientific_paper_rag

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AskMyDocs: Building a Plug-and-Play RAG Pipeline

This guide transforms the AskMyDocs project into a modular blueprint for building your own Retrieval-Augmented Generation (RAG) applications using LangChain.

1. Project Goal and Context

1.1 What is AskMyDocs?

  • Goal: To create an intelligent Q&A assistant that can converse with your private or specialized documents (PDFs).
  • Core Technology: Built using Streamlit (UI) and the LangChain framework (pipeline).
  • Benefit: Allows you to upload documents and ask questions in natural language, eliminating the need for manual searching.

1.2 The Problem RAG Solves

  • LLM Limitation: Large Language Models (LLMs) like GPT-4o-mini have a knowledge cutoff and can't answer questions about current, private, or specialized data.
  • RAG Solution: Retrieval-Augmented Generation (RAG) connects the LLM to an external knowledge base (your documents).
  • Result: The LLM's answers are grounded in your specific source material, making them factual and highly relevant.

2. The Conventional RAG Pipeline Explained

RAG is a two-step process: Ingestion (data prep) and Retrieval & Generation (answering the question).


Phase A: Ingestion (Data Preparation)

Load & Clean

  • The pipeline reads raw data (PDFs).
  • Text is cleaned to remove noise like page numbers, headers, and footers.

Split (Chunking)

  • Documents are broken down into small, semantically meaningful chunks because LLMs have input size limits (context window).

Embeddings

  • Each text chunk is converted into a high-dimensional vector (numerical representation) that captures its meaning.

Vector Store

  • These vectors are stored in a specialized database (the Vector Store) for ultra-fast similarity searching.

Phase B: Retrieval & Generation (Answering the Query)

User Query

  • A user asks a question (e.g., "What were the Q3 results?").

Query Embedding

  • The question is converted into a vector using the same embedding model.

Retrieval (Similarity Search)

  • The query vector is matched against the Vector Store to find the top K (e.g., top 3) most relevant document chunks.

Augmentation

  • The LLM receives a prompt containing the original question, the conversation history, and the retrieved document chunks.

Generation

  • The LLM synthesizes a final answer based only on the provided context.
image

3. Implementation in app.ipynb

This project uses specific functions and LangChain components to map the RAG process:

RAG Step Function Logic
Load & Clean extract_pdf_text Uses PyPDF2 for extraction followed by RegEx cleaning to strip document artifacts.
Chunking create_text_chunks Uses RecursiveCharacterTextSplitter with chunk size 3000 and overlap 200. This preserves context across chunks.
Embeddings & Vector Store create_vector_db Generates embeddings via OpenAI's model and stores them in the in-memory FAISS index.
Orchestration & Memory converse_using_history Sets up ConversationalRetrievalChain and ConversationBufferMemory to maintain chat history.
Retrieval & Generation process_user_input Passes the augmented prompt to the LLM (GPT-4o-mini) and displays the grounded response.

4. Appendix: Plug-and-Play RAG Component Choices

The project is modular, allowing you to easily swap components to build a custom RAG pipeline.

  • RAG Component: LLM (Brain)

    • Current Choice: gpt-4o-mini
    • Plug-and-Play Alternatives (Examples): GPT-4o (Higher performance/cost), Llama 3, Mistral (Open-Source via HuggingFace).
  • RAG Component: Embeddings (Meaning)

    • Current Choice: text-embedding-ada-002
    • Plug-and-Play Alternatives (Examples): text-embedding-3-small (Newer/cheaper), all-mpnet-base-v2 (Open-Source).
  • RAG Component: Vector Store (Memory)

    • Current Choice: FAISS
    • Plug-and-Play Alternatives (Examples): Chroma (Lightweight), Qdrant, Milvus (Scalable production databases).
  • RAG Component: Chunking

    • Current Choice: RecursiveCharacterTextSplitter
    • Plug-and-Play Alternatives (Examples): CharacterTextSplitter (Simple), NLTKTextSplitter (Sentence-based).
  • RAG Component: Chain (Logic)

    • Current Choice: ConversationalRetrievalChain
    • Plug-and-Play Alternatives (Examples): Map-Reduce or Refine chains (For summarizing large contexts).

About

AskMyDocs helps you chat with your PDFs: upload, ask, and get cited, factual answers. Built with Streamlit and LangChain, featuring swap-in components for chunking, embeddings, and vector stores.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors