A document question-answering app built with Python and Streamlit.
This project allows users to upload PDF or TXT documents, ask questions, retrieve the most relevant document sections, and optionally generate concise answers using the OpenAI API.
Long documents are often difficult to search manually, especially when users need quick answers from FAQs, reports, policies, study notes, or business documents. This project demonstrates a simple document chatbot workflow that retrieves relevant information from uploaded files and avoids answering when the uploaded document does not contain enough information.
- Upload PDF or TXT files
- Extract text from uploaded documents
- Split long documents into searchable text chunks
- Retrieve relevant sections using TF-IDF similarity
- Ask natural-language questions
- Show retrieved source sections
- Detect when a question is not supported by the uploaded document
- Optional OpenAI API support for generated answers
- Simple, clean, and modular Python project structure
- Python
- Streamlit
- scikit-learn
- pypdf
- OpenAI API
- Git and GitHub
ai_document_chatbot/
├── app.py
├── requirements.txt
├── README.md
├── assets/
│ ├── app_home.png
│ ├── correct_answer_demo.png
│ └── not_found_demo.png
├── sample_docs/
│ └── business_faq.txt
├── src/
│ ├── document_loader.py
│ ├── generator.py
│ ├── retriever.py
│ └── text_splitter.py
└── tests/
└── test_splitter.py
- The user uploads a PDF or TXT document.
- The document text is extracted.
- The text is split into smaller chunks.
- A TF-IDF retriever finds the most relevant chunks for the user question.
- If an OpenAI API key is available, the app generates an answer using the retrieved context.
- If no OpenAI API key is available, the app shows the most relevant document sections.
- If the question is not supported by the uploaded document, the app avoids inventing an answer.
Create and activate a virtual environment:
python -m venv .venvOn Windows PowerShell:
.venv\Scripts\Activate.ps1If PowerShell blocks script activation, run this once:
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUserThen activate the environment again:
.venv\Scripts\Activate.ps1Install dependencies:
pip install -r requirements.txtRun the app:
streamlit run app.pyThe app can work without an OpenAI API key by showing the most relevant document sections. To enable generated answers, set an OpenAI API key as an environment variable.
On Windows PowerShell:
$env:OPENAI_API_KEY="your_api_key_here"Then run:
streamlit run app.pyDo not save API keys directly inside the code or upload them to GitHub.
For the sample clinic FAQ document, users can ask:
What services does the clinic provide?
How can new patients book an appointment?
What are the opening hours?
Who is the CEO of the clinic?
The final question is not answered because the uploaded document does not contain CEO information.
This type of app can be adapted for:
- Company FAQ assistants
- Internal document search
- Customer support knowledge bases
- Student study assistants
- Research paper question-answering
- Policy and procedure document search
- This is a portfolio prototype, not a production-ready chatbot.
- TF-IDF retrieval is useful for simple search, but it does not capture meaning as deeply as embedding-based retrieval.
- Large document collections may require a vector database.
- PDF extraction quality depends on the structure and formatting of the uploaded PDF.
- OpenAI answer generation requires a valid API key.
- Add embedding-based semantic search
- Add ChromaDB or FAISS vector storage
- Add chat history
- Add source highlighting
- Add user authentication
- Deploy the app online
- Add support for DOCX files and web pages
Aun Ali
Applied AI, Machine Learning, and Computer Vision Developer
GitHub: https://github.com/aun151214