Skip to content

a repository mantaining all my intro-to-text-analytics assignments and codes

Notifications You must be signed in to change notification settings

z-aqib/text-analytics

Repository files navigation

text analytics

a repository mantaining all my intro-to-text-analytics assignments and codes

Assignment 1 - Evaluation of NLPs

In this assignment for Introduction to Text Analytics, I evaluated three LLMs — Qwen/Qwen2.5-3B-Instruct, open-thoughts/OpenThinker-7B, and microsoft/phi-4 — across tasks like summarization, question answering, keyword extraction, and translation. Each model was scored based on conciseness, clarity, accuracy, completeness, fidelity, fluency, and consistency. Qwen (3B) emerged as the best overall model due to its speed, accuracy, and ability to handle all tasks effectively, making it the most suitable for deployment at IBA. OpenThinker (7B) performed best specifically in keyword extraction. Overall, the assignment highlighted the trade-offs between model size, response quality, and processing time.

Assignment 2 - Evaluation of K-means

In Assignment 02 for Introduction to Text Analytics, the task was to comprehensively evaluate K-Means clustering on text data using a wide range of preprocessing and vectorization combinations. The clustering was tested for three different values of k (5, 9, and 13) across 48 unique setups involving variations in stopword removal, stemming versus lemmatization, unigram versus bigram generation, and vectorization methods like BOW (using TP or TF), TF-IDF, and Truncated SVD with different components (50, 100, 200). Each configuration was assessed based on silhouette score and WSS to determine clustering quality. Results showed that LSA with 50 components consistently outperformed other methods, delivering the best silhouette scores and the lowest WSS values. Additionally, the analysis revealed that bigrams generally improved performance, stopword removal was always beneficial, and lemmatization slightly outperformed stemming. Overall, LSA 50 was found to be the most effective and efficient approach for clustering in this assignment.

Assignment 3 - Evaluation of K-means (part 2)

Assignment 3 for the Introduction to Text Analytics course focused on performing and analyzing K-Means clustering using Word2Vec and Doc2Vec embeddings across different hyperparameter settings for three cluster values (k = 5, 9, and 13). Students conducted ten experiments for each ‘k’, systematically tuning variables like vector size, window size, and epochs, and reported silhouette scores and WSS values. The goal was to determine the best-performing embedding and hyperparameter combinations. Through extensive experimentation, it was found that Word2Vec (CBOW) initially performed well, but Doc2Vec (DBOW) with larger vector sizes eventually produced the best clustering results, outperforming previous assignment methods by a large margin. The assignment concluded with a detailed analysis comparing Word2Vec and Doc2Vec performance and the impact of different hyperparameters on clustering quality.

Assignment 4 - RAG System

In Assignment 4, we built a Retrieval-Augmented Generation (RAG) based Question Answering system that combines TF-IDF retrieval with answer generation using two HuggingFace models: LLaMA-2-7B and Microsoft Phi-2. The system first retrieves the top-k relevant document chunks for each question and then generates answers conditioned on those documents. We evaluated the performance using BLEU and ROUGE metrics, comparing the generated answers with ground-truth answers from a synthetic SQuAD-style dataset. Our experiments showed that Microsoft Phi-2 consistently outperformed LLaMA-2-7B, and retrieving the top-3 documents gave the best balance between context relevance and answer quality. The assignment highlighted the effectiveness of combining lightweight models with even simple retrieval techniques like TF-IDF for factual QA tasks.

Assignment 5 - Fine Tuning an LLM

In this assignment, we explored how to fine-tune a small language model, TinyLlama-1.1B-Chat, using two advanced methods: LoRA (Low-Rank Adaptation) for supervised fine-tuning and DPO (Direct Preference Optimization) for aligning model outputs with human preferences. We first used the alpaca-cleaned dataset to train multiple LoRA configurations, experimenting with different hyperparameters like r, alpha, learning rates, and batch sizes to find the best-performing setup based on BLEU scores. Once the optimal LoRA model was found, we applied DPO using the orca_dpo_pairs dataset to further refine the model’s ability to produce helpful, safe, and instruction-aligned responses. Our results showed that LoRA improved the model’s base fluency and task alignment, while DPO fine-tuned the responses to better match human expectations—proving that combining both techniques yielded the best overall performance.

Releases

No releases published

Packages

No packages published