This guide transforms the AskMyDocs project into a modular blueprint for building your own Retrieval-Augmented Generation (RAG) applications using LangChain.
- Goal: To create an intelligent Q&A assistant that can converse with your private or specialized documents (PDFs).
- Core Technology: Built using Streamlit (UI) and the LangChain framework (pipeline).
- Benefit: Allows you to upload documents and ask questions in natural language, eliminating the need for manual searching.
- LLM Limitation: Large Language Models (LLMs) like GPT-4o-mini have a knowledge cutoff and can't answer questions about current, private, or specialized data.
- RAG Solution: Retrieval-Augmented Generation (RAG) connects the LLM to an external knowledge base (your documents).
- Result: The LLM's answers are grounded in your specific source material, making them factual and highly relevant.
RAG is a two-step process: Ingestion (data prep) and Retrieval & Generation (answering the question).
- The pipeline reads raw data (PDFs).
- Text is cleaned to remove noise like page numbers, headers, and footers.
- Documents are broken down into small, semantically meaningful chunks because LLMs have input size limits (context window).
- Each text chunk is converted into a high-dimensional vector (numerical representation) that captures its meaning.
- These vectors are stored in a specialized database (the Vector Store) for ultra-fast similarity searching.
- A user asks a question (e.g., "What were the Q3 results?").
- The question is converted into a vector using the same embedding model.
- The query vector is matched against the Vector Store to find the top K (e.g., top 3) most relevant document chunks.
- The LLM receives a prompt containing the original question, the conversation history, and the retrieved document chunks.
- The LLM synthesizes a final answer based only on the provided context.
This project uses specific functions and LangChain components to map the RAG process:
| RAG Step | Function | Logic |
|---|---|---|
| Load & Clean | extract_pdf_text |
Uses PyPDF2 for extraction followed by RegEx cleaning to strip document artifacts. |
| Chunking | create_text_chunks |
Uses RecursiveCharacterTextSplitter with chunk size 3000 and overlap 200. This preserves context across chunks. |
| Embeddings & Vector Store | create_vector_db |
Generates embeddings via OpenAI's model and stores them in the in-memory FAISS index. |
| Orchestration & Memory | converse_using_history |
Sets up ConversationalRetrievalChain and ConversationBufferMemory to maintain chat history. |
| Retrieval & Generation | process_user_input |
Passes the augmented prompt to the LLM (GPT-4o-mini) and displays the grounded response. |
The project is modular, allowing you to easily swap components to build a custom RAG pipeline.
-
RAG Component: LLM (Brain)
- Current Choice:
gpt-4o-mini - Plug-and-Play Alternatives (Examples):
GPT-4o(Higher performance/cost),Llama 3,Mistral(Open-Source via HuggingFace).
- Current Choice:
-
RAG Component: Embeddings (Meaning)
- Current Choice:
text-embedding-ada-002 - Plug-and-Play Alternatives (Examples):
text-embedding-3-small(Newer/cheaper),all-mpnet-base-v2(Open-Source).
- Current Choice:
-
RAG Component: Vector Store (Memory)
- Current Choice:
FAISS - Plug-and-Play Alternatives (Examples):
Chroma(Lightweight),Qdrant,Milvus(Scalable production databases).
- Current Choice:
-
RAG Component: Chunking
- Current Choice:
RecursiveCharacterTextSplitter - Plug-and-Play Alternatives (Examples):
CharacterTextSplitter(Simple),NLTKTextSplitter(Sentence-based).
- Current Choice:
-
RAG Component: Chain (Logic)
- Current Choice:
ConversationalRetrievalChain - Plug-and-Play Alternatives (Examples):
Map-ReduceorRefinechains (For summarizing large contexts).
- Current Choice: