Skip to content

asenabeshiktepeli/python-data-fundamentals

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Python Data Fundamentals

A collection of reusable Python utilities for data loading, exploratory data analysis (EDA), and data cleaning — built as part of the Google Advanced Data Analytics Professional Certificate.

Project Structure

python-data-fundamentals/
├── src/
│   ├── data_loader.py      # CSV loading, validation, dataset description
│   ├── eda_toolkit.py       # Summary stats, distributions, correlations, outlier detection
│   └── data_cleaning.py     # Deduplication, missing values, type conversion, outlier clipping
├── notebooks/
│   ├── 01_python_basics_demo.ipynb
│   └── 02_pandas_eda_walkthrough.ipynb
├── data/
│   └── c2_epa_air_quality.csv
├── requirements.txt
└── README.md

Key Features

Module Highlights
data_loader load_csv() with auto-preview, validate_dataframe() with missing-data thresholds, describe_dataset()
eda_toolkit Extended summary_statistics() (IQR, skew, kurtosis), distribution/boxplot/correlation plots, IQR-based outlier detection
data_cleaning Duplicate removal, 5 missing-value strategies (drop/mean/median/mode/ffill), dtype conversion, quantile-based outlier clipping

Dataset

EPA Air Quality Index (AQI) — 1,725 observations of air quality measurements across U.S. states and counties.

Column Description
state_name U.S. state
county_name County within the state
aqi Air Quality Index value

Getting Started

# Clone the repository
git clone https://github.com/asenabeshiktepeli/python-data-fundamentals.git
cd python-data-fundamentals

# Install dependencies
pip install -r requirements.txt

# Quick usage
python -c "
from src.data_loader import load_csv, validate_dataframe
df = load_csv('data/c2_epa_air_quality.csv')
print(validate_dataframe(df))
"

Notebooks

  1. Python Basics Demo — Variables, control flow, functions, and list comprehensions demonstrated with real data
  2. Pandas EDA Walkthrough — End-to-end exploratory analysis of the EPA Air Quality dataset using the src/ utilities

Technologies

  • Python 3.10+
  • pandas, NumPy
  • matplotlib, seaborn

License

This project is for educational and portfolio purposes. The EPA Air Quality dataset is publicly available from the U.S. Environmental Protection Agency.

About

Reusable Python utilities for data loading, EDA, and cleaning — Google Advanced Data Analytics

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors