π Open the interactive Streamlit dashboard:
π streaming-user-behavior-eda-fpxbmrvygfyrcqvpflpykj.streamlit.app
Explore content performance, user segmentation, retention signals, and growth opportunities through a 4-page interactive dashboard built with Streamlit and Plotly.
An end-to-end exploratory data analysis project that uncovers actionable insights from streaming platform user behavior β simulating the kind of data-driven decision making used by growth and product teams at audio/video streaming platforms like NOICE.
Understanding how, when, and why users engage with content is critical for any streaming platform looking to grow. This project takes a storytelling-first approach to EDA β going beyond charts and statistics to deliver business insights that product, growth, and content teams can act on.
Key questions explored:
- What content formats and genres drive the highest engagement?
- When do users drop off, and what predicts early churn?
- Which user segments are most valuable β and most at risk?
- What does the data suggest about content strategy going forward?
- Perform comprehensive EDA on simulated streaming platform interaction data
- Translate raw patterns into clear, business-relevant narratives
- Identify retention risks and growth opportunities from behavioral signals
- Present findings in a structured, stakeholder-friendly format
streaming-eda-data-story/
βββ data/
β βββ raw/ # Raw simulated datasets
β βββ processed/ # Cleaned & merged datasets
β βββ generate_synthetic_data.py # Reproducible dataset generator
βββ notebooks/
β βββ 01_data_overview.ipynb # Dataset structure & quality check
β βββ 02_content_analysis.ipynb # Content consumption patterns
β βββ 03_user_segmentation.ipynb # User behavior segmentation
β βββ 04_retention_analysis.ipynb # Drop-off & churn signals
β βββ 05_growth_opportunities.ipynb # Actionable growth findings
β βββ 06_executive_summary.ipynb # Full data story (end-to-end)
βββ visuals/ # Exported charts & plots
βββ report/ # Stakeholder-ready summary report
βββ requirements.txt
βββ .gitignore
βββ README.md
Preview snapshots from the full-scale synthetic dataset: 2,000 users Β· 500 content items Β· 50,000 interaction events.
| Dataset | Source | Description |
|---|---|---|
| Last.fm User Listening History | Kaggle | User-artist play counts & timestamps |
| Spotify Podcast Metadata | Kaggle | Episode descriptions, ratings, categories |
| KKBox User Behavior | Kaggle | Session logs, subscription data |
| Synthetic Data (custom) | data/generate_synthetic_data.py |
Simulated NOICE-style interaction logs |
Demo scale: 2,000 users Β· 500 content items Β· 50,000 interaction events (synthetic)
All datasets used for educational and portfolio purposes only.
"What are users actually listening to, and for how long?"
- Top genres by total playtime vs. by unique listener count
- Content length sweet spot β what duration drives highest completion rate?
- Format comparison β podcast episodes vs. music vs. live streams
- Discovery channels β how users find new content
π‘ Key Finding: Content under 25 minutes achieves higher completion rates than long-form content β suggesting a strong case for short-form audio strategy.
"Not all users are equal β who are your most valuable listeners?"
- RFM Segmentation to identify user tiers
- Power users vs. casual listeners and content preferences
- Time-of-day patterns by day and segment
- Device & platform breakdown for mobile vs. desktop behavior
π‘ Key Finding: Power Listeners account for a disproportionate share of total platform playtime β retention of this segment is critical.
"When do users leave, and what are the warning signs?"
- Session funnel analysis for content journey drop-off
- Day-1, Day-7, Day-30 retention curves by cohort and acquisition channel
- Early churn indicators from first sessions
- Skip rate analysis by content type
π‘ Key Finding: High first-session skip behavior is a strong early warning signal for 30-day churn risk.
"What does the data say about where to invest next?"
- Underserved genres with high search demand and low content supply
- Creator performance analysis by repeat listener rate
- Cross-content journey between podcasts, music, and live streams
- Seasonal & trending patterns behind content spikes
π‘ Key Finding: Talk Show and Tech content categories are underrepresented relative to search demand β a clear gap for content acquisition strategy.
"If you had to act on this data tomorrow, what would you do?"
| Priority | Insight | Recommended Action | Expected Impact |
|---|---|---|---|
| π΄ High | High churn in first session | Improve onboarding recommendation flow | +15% Day-7 retention |
| π΄ High | Power users drive most playtime | Build loyalty program / exclusive content | Reduce power user churn |
| π‘ Medium | Short-form content outperforms | Commission more sub-25min content | +10% avg completion rate |
| π‘ Medium | Underserved Talk Show/Tech genres | Prioritize creator acquisition in these niches | Capture high-intent users |
| π’ Low | Mobile peak hours 7β9PM | Schedule content drops and push notifications | +8% notification CTR |
| Visual | Description |
|---|---|
| Genre Heatmap | Listening volume by genre Γ time of day |
| Retention Cohort Chart | Day-1/7/30 retention curves by cohort |
| RFM Segment Bubble Chart | User segments plotted by engagement value |
| Session Funnel | Drop-off at each stage of content consumption |
| Skip Rate by Category | Bar chart of skip rates across content types |
| Content Duration vs Completion | Scatter plot showing optimal content length |
| Creator Loyalty Index | Top creators ranked by repeat listener rate |
| Tool | Purpose |
|---|---|
Pandas |
Data wrangling & aggregation |
Matplotlib / Seaborn |
Static visualizations |
Plotly |
Interactive charts |
Streamlit (optional) |
Interactive dashboard version |
Jupyter Notebook |
Narrative + code storytelling format |
git clone https://github.com/LuthfiMirza/streaming-user-behavior-eda.git
cd streaming-eda-data-story
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install -r requirements.txtpython data/generate_synthetic_data.py
jupyter notebook
# Or run the full story in one notebook
jupyter notebook notebooks/06_executive_summary.ipynbMost ML portfolios jump straight to models. This project demonstrates something rarer and equally valuable: the ability to look at raw data, ask the right business questions, and communicate findings that non-technical stakeholders can act on.
For a streaming platform like NOICE, this translates directly to:
- Informing content acquisition strategy
- Improving recommendation engine inputs
- Guiding growth and retention campaigns
- Prioritizing product roadmap decisions
- Synthetic dataset generation pipeline
- Content consumption analysis scaffolding
- User segmentation scaffolding
- Retention & drop-off analysis scaffolding
- Business recommendations framework
- Interactive Streamlit dashboard
- Bahasa Indonesia version of executive summary
- Integration with real public streaming datasets
Luthfi Mirza Darsono
Gunadarma University β Information Systems
π§ luthfimirza2004@gmail.com
π LinkedIn | GitHub
This project is licensed under the MIT License.




