This project explores prevalent themes in Twitter discussions related to the 2024 U.S. Presidential Election. Two topic modeling techniques — the traditional Latent Dirichlet Allocation (LDA) and the transformer-based BERTopic — are implemented and compared in terms of performance, coherence, and interpretability.
- Source: Kaggle Dataset – 2024 US Presidential Elections Twitter Data
- Size: 25,652 tweets
- Format: CSV
- Language: Filtered to English using
langdetect
📝 Note: The Twitter API was not used due to scraping limitations. Instead, a publicly available dataset was selected to allow in-depth topic exploration.
- Removal of HTML tags, URLs, mentions, hashtags
- Lowercasing, tokenisation, and lemmatisation (for LDA)
- Stopword filtering (for LDA) and light cleaning (for BERTopic)
- Language filtering to retain only English tweets
- Top hashtags and mentions extracted and visualized
- Most active users and frequent words analyzed
- Provided context before deeper semantic modelling
- Embedding Model:
all-MiniLM-L6-v2 - UMAP: 40 components,
min_dist=0.01,n_neighbors=20 - HDBSCAN:
min_cluster_size=90,min_samples=50 - Topic Reduction: Reduced to 15 final topics
- Keyword Extraction: c-TF-IDF with
bm25_weighting=True - Output: Representative tweets and topic summaries
- Vectorisation: Bag-of-Words using Gensim
- Number of Topics: 15
- Training:
passes=40,alpha='auto',eta='auto' - Output: Top 10 keywords per topic, topic distribution plots
To evaluate topic modeling performance, I compared the two approaches — BERTopic and LDA — using both coherence scores and interpretability.
| Model | Coherence Score | Interpretation |
|---|---|---|
| BERTopic | 0.516 ✅ | High coherence — clear, semantically rich topics |
| LDA | 0.380 ✅ | Good — impressive performance for short-form text |
Coherence scores were calculated using the
c_vmetric from Gensim’sCoherenceModel, which evaluates how frequently top topic words appear together in the dataset.
- BERTopic produced more specific, focused clusters using contextual embeddings and clustering algorithms. Example topics included RFK Jr. support, border policy, and religious political discourse.
- LDA served as a strong baseline, producing broader themes around major figures like Trump, Biden, Kamala Harris, and economic issues. It was less granular but still insightful.
- BERTopic: Used UMAP and HDBSCAN to visualise topic distances and top keywords via bar charts.
- LDA: Explored via word clouds, topic-word tables, and topic frequency plots.
BERTopic emerged as the superior technique for this project, achieving a higher coherence score and generating more nuanced topics. LDA, with a score of 0.42, performed well given the brevity and noise in tweet data and served as a valuable benchmark for validation.