This project represents the final phase of LoRAfrica deployment, making the system accessible to users on the web through scalable serverless infrastructure using Modal and an interactive frontend built with Streamlit.
Deploy LoRAfrica via Modal and Streamlit for seamless web-based user interaction.
- Benchmark performance using Naive and vLLM approaches while measuring TTFT, ITL, TPOT, E2E, and TPS
- Deploy LoRAfrica using Modal serverless GPU infrastructure
- Enable logging and observability with LangSmith, including PII redaction
- Build and render a Streamlit-based frontend for user interaction
- Navigate to your desktop and create a new folder called
llm_gpu - Install Git and clone the repo using
git clone https://github.com/daniau23/LoRAfrica_Serverless_Modal_Compute_Streamlit.gitor just download the repo and unzip it
- Download and install Anaconda
- Once Conda is installed, open your CMD and run the following command
C:/Users/your_system_name/anaconda3/Scripts/activate - Should see something like
'(anaconda3)'C:\Users\your_system_name\Desktop\>as an output in your CMDNB: Do not close the CMD terminal, would be needed later on
- Once Conda is installed, open your CMD and run the following command
- On your cmd navigate into the
llm_cpufolder usingcd llm_gpu - Run
conda env create -f environment.yml -p ../llm_gpu/lorafrica_cloudon your cmd - Run
conda env liston your cmd to list all environments created using Anaconda - Run
conda activate C:\Users\your_system_name\Desktop\llm_gpu\lorafrica_cloudon your cmd to activate the environment- Should see something like
'(lorafrica_cloud)'C:\Users\your_system_name\Desktop\llm_gpu>as an output in your CMD
- Should see something like
- Run
conda liston your cmd to check if all dependencies have been installed
All benchmarks were ran using Runpod so kindly refer to this video to learn on to create a Pod instance and get started with Runpod
- Once instance is created, clone project into runpod workspace environment using
git clone https://github.com/daniau23/LoRAfrica_Serverless_Modal_Compute_Streamlit.gitand navigate into the benchmarks folder or just drag and drop files from the benchmarks folder into Runpod.- Once all files in the benchmarks folder are on Runpod
- Read the
requirements.txtfile and follow the instructions after installing the first round of dependencies usingpip install -r requirements.txt
-
To start the server for naive approach run it by using
python deploy_baseline_runpod.py. Once the server has started, open another CMD terminal and run the test to ping the server to be sure it is working withpython test_server_runpod.py. It should give a json response structure for 1 response as seen in thephi4_final_bench_20260327.jsonin the results folder -
All benchmarks tests are ran using the
inference_baseline_runpod.ipynbfile -
T kill the server, simply use
CTRL + Cin the CMD terminal of where the server was started
vLLM is a simpler approach in benchmarking LLM metrics
- Ensure all files from the vllm folder are on Runpod
- start the vllm server using;
The command can also be found in the
vllm serve microsoft/Phi-4-mini-instruct \ --enable-lora \ --lora-modules african-history=DannyAI/phi4_african_history_lora_ds2_axolotl \ --max-model-len 512 \ --gpu-memory-utilization 0.9 \
inference_vllm_runpod.ipynbfile - Run Each Benchmark using the codes in the notebook, example code is shown below
vllm bench serve \
--model african-history \
--tokenizer microsoft/Phi-4-mini-instruct \
--dataset-name custom \
--dataset-path ./data/history_messages_dataset.jsonl \
--num-prompts 125 \
--max-concurrency 4 \
--metric-percentiles 95,99 \
--save-result \
--result-dir ./results_african_history/run_c4Results would be saved in the results_african_history folder
To kill the vLLM server run stop_vllm_server() function in the notebok
Congrats! You have Benchmarked the LLM with Naive and vLLM Approaches. Now the Deployment Phase!!!!!!!!
- Ensure your enviroment is configured via
.yamlfile - navigate to the deployment folder
- The
server.pyis your file for server side deployment; your backend logic. Theclient.pyis your frontend logic for testing and interactions - Create your Modal account
- To delpoy your Modal app use
modal setupto authenticate yourself into Modal. Once Authenticated, runmodal deploy server.pyand your compute instance is created!
NB: Add same .env variables (LangSmith, HuggingFace) as Modal secrets to enable logging and avoid runtime errors..
- The Streamlit folder has your client file;
streamlit_app.py(frontend) for rendering user interface and usage. Simply run it usingstreamlit run streamlit_client.pyin the CMD.
That's it!! You have your working project all set and ready to GO! 🚀
Check out the CPU version of this project using Ollama and Llama.cpp
NB: Kindly check tests in modal/streamlit/tests
- Setting up LiteLLM as a proxy to monitor billing and token usage which would allow alerting messages on SLACK
- Trying to rent the needed GPU on Runpod to condut experimentation of naive and vLLM approaches
- Pattern recognition to redact PII's in LangSmith
LoRAfrica demonstrates an end-to-end pipeline for deploying a domain-specific Large Language Model in LLMOps, from fine-tuning and benchmarking to scalable production deployment. By combining LoRA fine-tuning with efficient inference strategies (naive and vLLM), the project evaluates key performance metrics such as TTFT, ITL, TPOT, E2E, and TPS to optimise real-world performance.
The system leverages Modal for serverless GPU-based deployment, enabling automatic scaling and simplified infrastructure management, while Streamlit provides an accessible web interface for user interaction. Additional integrations such as LangSmith ensure observability, logging, and PII redaction for safer and more reliable usage.
Overall, the project highlights how modern tools can be orchestrated to deliver a scalable, efficient, and production-ready AI application tailored to African history knowledge.
- Built a domain-specific LLM using LoRA fine-tuning for African history
- Benchmarked inference performance using both naive and vLLM approaches
- Measured and analysed key latency and throughput metrics (TTFT, ITL, TPOT, E2E, TPS)
- Deployed the model using serverless GPU infrastructure via Modal
- Developed an interactive frontend using Streamlit for real-time user interaction
- Integrated LangSmith for monitoring, logging, and PII redaction
Article Links
