Additional Resources for Data-Engineering Module for Level 5 Data Science students (2024-2025 University of Westminster)

The module is mandatory for Level 5 Data Science students and was taught by myself. This module provides a basic understanding of data engineering with hands-on experience in data engineering pipelines and some theory.

The compiled list includes courses, blog posts, videos, and textbooks sorted by themes.

Content

Resources for Beginners
Weekly Hands-on
Recommended Textbooks for Data Engineering

Resources for Beginners

Beginner’s Guide to Data Engineering by Geeks.
What is Data Engineering and is it right for you? by Real Python.
Data Engineering Course for Beginners (FreeCodeCamp YouTube) by FreeCodeCamp.
W3Schools: SQL Tutorial by W3school.
Learn Data Engineering: Comprehensive tutorials on ETL, databases, and cloud tools.
Data Engineering Roadmap: A curated roadmap for aspiring data engineers.

Resources for Students Unfamiliar with Cloud Platforms

Introduction to Google Cloud Platform (GCP):
- Google Cloud Free Tier.
- Introduction to GCP for Beginners (Video).
Introduction to AWS for Data Engineering:
- AWS Free Tier.
- AWS Fundamentals Specialization (Coursera).
Introduction to Apache Spark:
- Apache Spark Quick Start Guide.
- Databricks Introduction to Spark.

Additional Resources

Introduction to Data Engineering (General Resources)

Big Data and Distributed Systems

Introduction to Big Data.
The Hadoop Ecosystem: Official Apache Hadoop resources.
Understanding Apache Kafka: Kafka documentation and guides.
Big Data Specialization by University of California, San Diego.

Cloud Platforms and Scalable Solutions

Data Governance, Ethics, and Bias

Data Pipelines and Automation

Weekly Hands-on

This section contains scripts demonstrating a data engineering pipeline built using MongoDB Atlas (for unstructured data) and SQLite (for structured data). The pipeline covers essential steps such as ingestion, transformation, and storage for both structured datasets like tabular records and unstructured data like JSON or documents.

The scripts are designed to be simple and easy to understand for beginners or intermediate data engineers.

Pipeline Features

Structured Data Processing (SQLite):

Ingestion: Import structured data from CSV or Excel files.
Transformation: Perform basic operations like cleaning.
Storage: Store transformed data in SQLite, a lightweight and easy-to-use database.

Unstructured Data Processing (MongoDB Atlas):

Ingestion: Insert unstructured data(Image and text) as BSON and JSON documents into MongoDB Atlas.
Transformation: Use MongoDB's aggregation framework to extract and manipulate nested or hierarchical data.
Storage: MongoDB Atlas serves as the repository for unstructured data, enabling fast retrieval and scalability.

Folder Structure
├── Weekly Hands-on/    # pdf file providing guide and context, accompanied by python script and data for each week. 
└── README.md            # Project documentation

Prerequisites

Python 3.8+
MongoDB Atlas account with a free-tier cluster set up.
SQLite (pre-installed with Python).
Google Colab was used, but you can use any Python IDE of your own choice; make sure you change the directory in the Python script to fit your own data location.

Recommended Textbooks for Data Engineering

Fundamentals of Data Engineering: Plan and Build Robust Data Systems by Joe Reis and Matt Housley
Designing Data-Intensive Applications by Martin Kleppmann
The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling by Ralph Kimball and Margy Ross
Cloud Data Management by Hector Garcia-Molina, Jeffrey D. Ullman, and Jennifer Widom
Data Engineering with Python by Paul Crickard
Building the Data Lakehouse by Bill Inmon and Mary Levins
Streaming Systems by Tyler Akidau, Slava Chernyak, and Reuven Lax
The Big Data Handbook by Arvind Sathi
Data Pipelines Pocket Reference by James Densmore
Database Internals by Alex Petrov

Happy learning!

If this is any useful to you, please star/fork the repo. Thank you!

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
Weekly Hands-on		Weekly Hands-on
.DS_Store		.DS_Store
Data_engineering_module_feedback.ipynb		Data_engineering_module_feedback.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Additional Resources for Data-Engineering Module for Level 5 Data Science students (2024-2025 University of Westminster)

Content

Resources for Beginners

Resources for Students Unfamiliar with Cloud Platforms

Additional Resources

Introduction to Data Engineering (General Resources)

Big Data and Distributed Systems

Cloud Platforms and Scalable Solutions

Data Governance, Ethics, and Bias

Data Pipelines and Automation

Weekly Hands-on

Recommended Textbooks for Data Engineering

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Additional Resources for Data-Engineering Module for Level 5 Data Science students (2024-2025 University of Westminster)

Content

Resources for Beginners

Resources for Students Unfamiliar with Cloud Platforms

Additional Resources

Introduction to Data Engineering (General Resources)

Big Data and Distributed Systems

Cloud Platforms and Scalable Solutions

Data Governance, Ethics, and Bias

Data Pipelines and Automation

Weekly Hands-on

Recommended Textbooks for Data Engineering

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages