Multiomics analysis of the gene expression and epigenetic dynamics across Cardiac differentiation with Variational Autoencoders
This repository contains a computational pipeline designed for the analysis of RNA-seq and ChIP-seq data through various stages of cardiac differentiation in mice. The pipeline leverages multiple tools for data processing, normalization, dimensionality reduction, and clustering to explore gene regulatory dynamics throughout differentiation.
- Data Mapping: Use scripts in
01_Mappingto download and align raw data. - Differential Expression Analysis:
02_DESeq/DESeq2.Rmdruns differential expression on the processed RNA-seq data. - Signal Recovery: Run notebooks and scripts in
03_RecoverSignalto focus on promoter region signals. - Modeling: Use notebooks in
04_Models/jupyter_notebooksto preprocess data, train models, and perform dimensionality reduction and clustering analyses.
Ensure the required Python libraries and R packages are installed. Use DL.yml for setting up the python environment in conda.
| Program | Version |
|---|---|
| NCBI-SRA | 3.0.10 |
| BOWTIE | 2.5.3 |
| TOPHAT | 2.0.14 |
| SAMTOOLS | 1.19.2 |
This directory contains scripts and Jupyter notebooks for downloading, parsing, and mapping ChIP-seq and RNA-seq data. The main steps are as follows:
- ChIP-seq and RNA-seq Parsing and Downloading: Scripts such as
01_1_ChIP_ENA_table_parse.ipynband02_1_RNA_ENA_table_parse.ipynbprocess tables of data from public repositories. - Mapping Scripts:
MapChIPseq2.plandProcessRNAseq.plhandle sequence mapping to the mouse genome.
Files:
ChIP_ENA_table.tsvandRNA_ENA_table.tsv: Metadata tables for sequencing datasets.
This directory includes the DESeq2.Rmd R markdown file for differential expression analysis of RNA-seq data, enabling identification of differentially expressed genes across stages of cardiac differentiation. A helper script, GenerateSymLinks.sh, organizes files for DESeq2 processing.
Scripts for generating BED files and counting reads in specific promoter regions are included here:
- Signal Recovery Notebooks and Scripts:
01_GenerateBed.ipynband02_CountReadsInPromoterRegions.shfocus on recovering signal data in promoter regions to support downstream analysis.
This directory contains Jupyter notebooks and scripts for data preprocessing, dimensionality reduction, deep learning models training, and clustering to identify patterns in gene expression and epigenetic modifications across cardiac differentiation stages.
-
Data Preparation:
01_DataPreprocessing.ipynb: Cleans and prepares RNA-seq and ChIP-seq data, standardizing inputs for modeling.
-
Dimensionality Reduction:
- Autoencoders:
03_AE_tuning.ipynband04_AE_training.ipynbperform hyperparameter tuning and training for the autoencoder model. - Variational Autoencoder (VAE):
03_VAE_tuning.ipynband04_VAE_training.ipynbhandle VAE tuning and training, with VAE used for compressed representations of gene features. - Classical Approaches:
04_PCA_training.ipynband04_UMAP_training.ipynbapply PCA and UMAP, respectively, to reduce data dimensions as alternative methods for comparison.
- Autoencoders:
-
Latent Space and Clustering Analysis:
05_LatentSpaceAnalysis.ipynb: Analyzes the low-dimensional latent space produced by dimensionality reduction models, examining the structure and distribution of gene features.06_GMMClustering.ipynb: Uses Gaussian Mixture Modeling (GMM) for clustering within the latent space, grouping genes with potentially shared regulatory patterns.07_ClustersAnalysis.ipynb: Further explores cluster characteristics to identify functional groups among the clustered genes.
-
Functional Enrichment and Visualization:
- Over-Representation Analysis (ORA):
01_ORA_RNA_CV.ipynband08_ORA_gmm.ipynbperform functional enrichment analysis to determine if gene clusters are associated with specific biological pathways or functions. - Visualization:
09_TSSPlots.ipynbandTSSplots.shgenerate transcription start site (TSS) meta-plots
- Over-Representation Analysis (ORA):
CustomObjects.py: Contains custom Python objects and functions imported in the ipynb files