Additional Resources for Data-Engineering Module for Level 5 Data Science students (2024-2025 University of Westminster)
The module is mandatory for Level 5 Data Science students and was taught by myself. This module provides a basic understanding of data engineering with hands-on experience in data engineering pipelines and some theory.
The compiled list includes courses, blog posts, videos, and textbooks sorted by themes.
- Beginner’s Guide to Data Engineering by Geeks.
- What is Data Engineering and is it right for you? by Real Python.
- Data Engineering Course for Beginners (FreeCodeCamp YouTube) by FreeCodeCamp.
- W3Schools: SQL Tutorial by W3school.
- Learn Data Engineering: Comprehensive tutorials on ETL, databases, and cloud tools.
- Data Engineering Roadmap: A curated roadmap for aspiring data engineers.
- Introduction to Google Cloud Platform (GCP):
- Introduction to AWS for Data Engineering:
- Introduction to Apache Spark:
- Data Engineering on GCP Specialization (Coursera).
- Building Data Pipelines with Python.
- Designing Data-Intensive Applications by Martin Kleppmann.
- ETL in Depth: A Beginner’s Guide.
- Database Design and Implementation.
- Introduction to Big Data.
- The Hadoop Ecosystem: Official Apache Hadoop resources.
- Understanding Apache Kafka: Kafka documentation and guides.
- Big Data Specialization by University of California, San Diego.
- AWS for Data Engineering.
- Data Engineering on Azure.
- Google Cloud Data Engineering.
- Kubernetes for Beginners.
- Data Ethics and Responsible Use (DataCamp).
- Understanding Data Governance.
- Ethics in AI and Big Data.
- Building Data Pipelines with Apache Airflow.
- ETL Frameworks for Python: A Review.
- Apache NiFi: Automating Data Workflows.
- Stream Processing with Apache Flink.
- Best Practices for Data Pipeline Design.
This section contains scripts demonstrating a data engineering pipeline built using MongoDB Atlas (for unstructured data) and SQLite (for structured data). The pipeline covers essential steps such as ingestion, transformation, and storage for both structured datasets like tabular records and unstructured data like JSON or documents.
The scripts are designed to be simple and easy to understand for beginners or intermediate data engineers.
Pipeline Features
Structured Data Processing (SQLite):
Ingestion: Import structured data from CSV or Excel files.
Transformation: Perform basic operations like cleaning.
Storage: Store transformed data in SQLite, a lightweight and easy-to-use database.
Unstructured Data Processing (MongoDB Atlas):
Ingestion: Insert unstructured data(Image and text) as BSON and JSON documents into MongoDB Atlas.
Transformation: Use MongoDB's aggregation framework to extract and manipulate nested or hierarchical data.
Storage: MongoDB Atlas serves as the repository for unstructured data, enabling fast retrieval and scalability.
Folder Structure
├── Weekly Hands-on/ # pdf file providing guide and context, accompanied by python script and data for each week.
└── README.md # Project documentation
Prerequisites
Python 3.8+
MongoDB Atlas account with a free-tier cluster set up.
SQLite (pre-installed with Python).
Google Colab was used, but you can use any Python IDE of your own choice; make sure you change the directory in the Python script to fit your own data location.
- Fundamentals of Data Engineering: Plan and Build Robust Data Systems by Joe Reis and Matt Housley
- Designing Data-Intensive Applications by Martin Kleppmann
- The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling by Ralph Kimball and Margy Ross
- Cloud Data Management by Hector Garcia-Molina, Jeffrey D. Ullman, and Jennifer Widom
- Data Engineering with Python by Paul Crickard
- Building the Data Lakehouse by Bill Inmon and Mary Levins
- Streaming Systems by Tyler Akidau, Slava Chernyak, and Reuven Lax
- The Big Data Handbook by Arvind Sathi
- Data Pipelines Pocket Reference by James Densmore
- Database Internals by Alex Petrov
Happy learning!
If this is any useful to you, please star/fork the repo. Thank you!