Skip to content

Balogunhabeeb14/UoW_Data_Engineering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Additional Resources for Data-Engineering Module for Level 5 Data Science students (2024-2025 University of Westminster)

The module is mandatory for Level 5 Data Science students and was taught by myself. This module provides a basic understanding of data engineering with hands-on experience in data engineering pipelines and some theory.

The compiled list includes courses, blog posts, videos, and textbooks sorted by themes.

Content


Resources for Beginners


Resources for Students Unfamiliar with Cloud Platforms



Additional Resources

Introduction to Data Engineering (General Resources)


Big Data and Distributed Systems


Cloud Platforms and Scalable Solutions


Data Governance, Ethics, and Bias


Data Pipelines and Automation


Weekly Hands-on

This section contains scripts demonstrating a data engineering pipeline built using MongoDB Atlas (for unstructured data) and SQLite (for structured data). The pipeline covers essential steps such as ingestion, transformation, and storage for both structured datasets like tabular records and unstructured data like JSON or documents.

The scripts are designed to be simple and easy to understand for beginners or intermediate data engineers.

Pipeline Features

Structured Data Processing (SQLite):

Ingestion: Import structured data from CSV or Excel files.
Transformation: Perform basic operations like cleaning.
Storage: Store transformed data in SQLite, a lightweight and easy-to-use database.

Unstructured Data Processing (MongoDB Atlas):

Ingestion: Insert unstructured data(Image and text) as BSON and JSON documents into MongoDB Atlas.
Transformation: Use MongoDB's aggregation framework to extract and manipulate nested or hierarchical data.
Storage: MongoDB Atlas serves as the repository for unstructured data, enabling fast retrieval and scalability.
Folder Structure
├── Weekly Hands-on/    # pdf file providing guide and context, accompanied by python script and data for each week. 
└── README.md            # Project documentation

Prerequisites

Python 3.8+
MongoDB Atlas account with a free-tier cluster set up.
SQLite (pre-installed with Python).
Google Colab was used, but you can use any Python IDE of your own choice; make sure you change the directory in the Python script to fit your own data location. 

Recommended Textbooks for Data Engineering

  1. Fundamentals of Data Engineering: Plan and Build Robust Data Systems by Joe Reis and Matt Housley
  2. Designing Data-Intensive Applications by Martin Kleppmann
  3. The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling by Ralph Kimball and Margy Ross
  4. Cloud Data Management by Hector Garcia-Molina, Jeffrey D. Ullman, and Jennifer Widom
  5. Data Engineering with Python by Paul Crickard
  6. Building the Data Lakehouse by Bill Inmon and Mary Levins
  7. Streaming Systems by Tyler Akidau, Slava Chernyak, and Reuven Lax
  8. The Big Data Handbook by Arvind Sathi
  9. Data Pipelines Pocket Reference by James Densmore
  10. Database Internals by Alex Petrov

Happy learning!

If this is any useful to you, please star/fork the repo. Thank you!

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors