Analysis of NGS data from TCGA to study chromatin accessibility and predict transcription factor (TF) binding, with a particular focus on CEBPB.
- Overview
- Code vs Data
- Repository Structure
- Server Data Structure
- File Formats & Column Descriptions
- Workflow Overview
- Cancer Type Summary
- Cancer Type โ Cell Line Mapping
- Key Analysis Questions
- Tools & Environments
- Quick Start
- References
- Notes
This project focuses on the analysis of ATAC-seq data from TCGA (The Cancer Genome Atlas) to study chromatin accessibility and predict transcription factor (TF) binding, with a particular focus on CEBPB.
The pipeline covers:
- Raw read alignment
- Signal track generation
- Normalization
- Peak calling
- TF binding prediction (MaxATAC)
- Downstream analysis (PCA, t-SNE, DESeq2, clustering)
The goal is to produce high-quality, normalized data suitable for downstream analysis such as transcription factor binding prediction.
This project is split into two parts:
This repository contains the code required to run the analysis, but NOT the large datasets.
All large files are stored on a remote server at:
/data/hichamif/pred_tf_cancer/
This includes:
- ๐งฌ Raw ATAC-seq BAM files (TCGA)
- ๐ BigWig / BedGraph signal tracks
- ๐ Peak files (MACS2)
- ๐ฎ TF predictions (MaxATAC)
- ๐ QC results
- ๐งช Processed intermediate files
๐ These are not included in GitHub due to size constraints.
.
โโโ src/ # Core logic (pipeline + analysis)
โ โโโ preprocessing/ # BAM processing, indexing, normalization
โ โโโ analysis/ # PCA, t-SNE, DESeq2, clustering
โ โโโ benchmarking/ # Evaluation of TF predictions
โ โโโ visualization/ # Plots and heatmaps
โ โโโ utils/ # Helper scripts (logging, config, merging)
โโโ workflows/
โ โโโ pipeline/ # Full pipeline (step-by-step scripts)
โโโ notebooks/ # Exploratory & validation notebooks (Jupyter)
โโโ config/ # Configuration (SLURM, parameters)
โโโ archive/ # Old / deprecated files
โโโ README.md
/data/hichamif/pred_tf_cancer/
โ
โโโ reads/ # Raw ATAC-seq data (symlinked BAMs from TCGA)
โโโ new_tracks/ # Initial coverage tracks (BigWig, scaled to 1M reads)
โโโ normalized_tracks/ # Normalized coverage tracks (RP20M)
โโโ cluster_normalization/ # TF-specific normalization (e.g., for CEBPB)
โโโ peak/ # MACS2 peak calls (filtered)
โโโ predicions_cluster/ # Subset of ATAC-seq predictions (e.g., LIHC)
โโโ predictions_cluster_all/ # Full set of ATAC-seq predictions (across TCGA)
โโโ pca_analysis/ # Downstream analysis (PCA, DESeq2, clustering)
โโโ QC_results/ # Output from fast QC pipeline
โโโ cell_line_data/ # Reference cell line data for comparison
โโโ processed_files/ # Generated during processing and feature annotation
โโโ data/ # Reference data files used for annotations
โโโ others/ # Misc files (e.g., hg38_chrom.sizes)
โโโ subsets/ # Cancer-type-specific subset analyses
โโโ annotated_peaks/ # Peak data organized by cancer/sample/genes
โโโ cancers/ # Peak overlap analysis results
โโโ SAMPLEFILE # List of samples and metadata
โโโ [scripts, logs, other metadata]
Contains symbolic links to BAM files downloaded from TCGA, each representing aligned reads for a specific sample. Each BAM file is accompanied by an index (.bai) file. The file number_reads_1.txt provides statistics like number of mapped reads and converted fragments.
reads/
โโโ ATAC_TCGA-XXX_YYY_1.bam # BAM: mapped reads for each sample
โโโ ATAC_TCGA-XXX_YYY_1.bam.bai # BAM index for fast access
โโโ number_reads_1.txt # QC: read counts and mapping stats
โ ๏ธ Note: Not all stats are available; some samples are missing โ checkQC_results/for complete data.
First-stage signal tracks in BigWig format, converted from BAM files using scaling to 1 million reads. They provide genome-wide coverage for each sample.
new_tracks/
โโโ ATAC_TCGA-XXX_YYY_1.bw # BigWig: coverage per sample
Coverage tracks normalized to a common read depth of 20 million (RP20M) to allow comparison across samples. Includes both BedGraph and BigWig formats. Intermediate uncompressed versions are also retained for inspection. Signal is calculated as total signal per sample, weighted by the size of each region.
normalized_tracks/
โโโ ATAC_TCGA-XXX_YYY_1.bedgraph # BedGraph: raw coverage (genomic coordinates + signal values)
โโโ ATAC_TCGA-XXX_YYY_1_RP20M.bedgraph # BedGraph: scaled to 20M reads
โโโ ATAC_TCGA-XXX_YYY_1_RP20M.bw # BigWig: scaled to 20M reads
Signal tracks that have undergone TF-specific normalization using the maxatac normalize tool. These are further processed to be used for clustering or input into predictive models. Includes per-chromosome and genome-wide summary statistics.
cluster_normalization/
โโโ ATAC_TCGA-XXX_YYY_1_RP20M.bw # Normalized BigWig
โโโ ATAC_TCGA-XXX_YYY_1_RP20M_chromosome_min_max.txt # Per-chromosome min/max stats
โโโ ATAC_TCGA-XXX_YYY_1_RP20M_genome_stats.txt # Genome-wide stats
Peak files output by MACS2, filtered to keep only standard chromosomes. One BED file per sample lists identified open chromatin regions. A summary file logs peak counts per sample.
peak/
โโโ ATAC_TCGA-XXX_YYY_1_peaks_macs.bed # BED: called peaks per sample
โโโ peaks_summary.txt # Number of peaks per sample
Contains a subset of samples: all LIHC samples and one representative from each other cancer type (22 types excluding LIHC).
predicions_cluster/
โโโ 1LIHC/ # Folder for ONE peak file (test)
โ โโโ ATAC_TCGA-LIHC_TCGA-BC-A3KF_1.bed # All peaks, unfiltered by MaxATAC
โ โโโ ATAC_TCGA-LIHC_TCGA-BC-A3KF_1_RP20M.bw
โ โโโ ATAC_TCGA-LIHC_TCGA-BC-A3KF_1_RP20M_peaks.bed
โโโ ATAC_TCGA-[CANCER]_[SAMPLE]_RP20M.bw # One per cancer type (22 types, excl. LIHC)
โโโ ATAC_TCGA-[CANCER]_[SAMPLE]_peaks.bed # One per cancer type (22 types, excl. LIHC)
โโโ LIHC/ # All LIHC samples
โโโ ATAC_TCGA-LIHC_TCGA-[SAMPLE]_RP20M.bw
โโโ ATAC_TCGA-LIHC_TCGA-[SAMPLE]_RP20M_peaks.bed
โโโ logs/ # Logs for prediction runs
Full set of ATAC-seq predictions across TCGA. Contains all ACC samples, all BLCA samples, and subsets of BRCA and CESC samples.
predictions_cluster_all/
โโโ ATAC_TCGA-[CANCER]_[SAMPLE]_RP20M.bw
โโโ ATAC_TCGA-[CANCER]_[SAMPLE]_peaks.bed
โ ๏ธ Status: May need verification or cleanup.
Contains results from dimensionality reduction analyses on the ATAC-seq data, including PCA, tSNE, and differential accessibility analysis using DESeq2.
pca_analysis/
โโโ filtered_bams/ # Processed filtered BAMs (NOT ALL FILES)
โโโ generate_counts_*.log
โโโ results/
โโโ ATAC_all_regions_count*.txt
โโโ cell_line_counts/
โโโ sample_counts/
โโโ metadata.csv, samples.txt
โโโ DESeq_results_*.csv
โโโ DESeq2_significant_genes.csv
โโโ PCA_plot*.png
โโโ tSNE_plot1.png
โโโ heatmap_DESeq2.png
โโโ volcano_plot_*.pdf
โโโ CancerType_Clustering1.png
Comprehensive quality control metrics for all samples, divided into three categories:
QC_results/
โโโ basic/ # Basic BAM-level QC
โ โโโ SAMPLE_flagstat.txt
โ โโโ SAMPLE_idxstats.txt
โ โโโ summary.txt # Aggregated metrics: mapped %, mt %, pairing
โโโ fragments/ # Insert size distributions (subsampled reads)
โ โโโ summary.txt # Fragment type proportions (NFR, mono, di, tri)
โโโ peaks/ # Peak-level statistics
โโโ counts.txt # Total peak counts per sample
Reference data from cancer cell lines corresponding to TCGA tumor types, including both ATAC-seq and ChIP-seq peaks. Data was copied from ENCODE for comparison purposes (no direct Jupyter access to ENCODE).
cell_line_data/
โโโ atac/ # ATAC-seq peaks from cell lines
โ โโโ ATAC_[CELL_LINE]_*_peaks_macs.bed
โโโ chip/ # ChIP-seq peaks from cell lines
โโโ CEBPB_[CELL_LINE]_*_peaks_peakzilla.bed
Contains intermediate and final processed files for downstream analyses, including summit locations, fixed-width windows (100bp around summits), and annotation files.
processed_files/
โโโ summits/ # Peak summit information
โ โโโ *_summit.bed # Summits from ATAC-seq, ChIP-seq, predictions
โ โโโ motif_summits.bed # Summit locations for motif hits
โโโ windows/clean/ # Fixed-width 100bp windows around summits
โโโ *_window.bed
Reference data files used throughout the pipeline.
data/
โโโ hg38_maxatac_blacklist.bed # Genomic regions to exclude (blacklist)
โโโ CEBPB_filtered_6mer.bed # Filtered CEBPB motif locations
Miscellaneous files.
others/
โโโ hg38_chrom.sizes # Chromosome sizes for hg38
Cancer-type-specific subset analyses with visualization outputs.
subsets/
โโโ {BRCA, LIHC, COAD, LUAD, ...}/
โโโ subset_{cancer_type}.tsv
โโโ plots/
โโโ plot1.1_box_atac.png
โโโ plot1.2_access_overlap_bar.png
โโโ plot1.3_scatter_patient_vs_cell.png
โโโ plot2.2_motif_heatmap.png
โโโ plot2.3_motif_access_combo.png
โโโ plot4.1_feature_overlap_combo.png
โโโ plot4.3_confidence_dist.png
Peak data organized in three different ways:
annotated_peaks/
โโโ by_cancer_type/ # Subdirectories for 23 TCGA cancer types
โ โโโ {ACC, BLCA, BRCA, CESC, CHOL, COAD, ESCA, GBM, HNSC, KIRC,
โ KIRP, LGG, LIHC, LUAD, LUSC, MESO, PCPG, PRAD, SKCM,
โ STAD, TGCT, THCA, UCEC}/
โโโ by_sample/ # Per-sample annotated peak files
โ โโโ ATAC_TCGA-[CANCER]_TCGA-[PATIENT]_1_peaks_macs_annotated.bed
โโโ top_genes/ # Top genes per cancer type
โโโ [CANCER_TYPE]_top_100_genes.txt
โโโ all_cancer_types_top10_summary.csv
Peak overlap analysis results.
cancers/
โโโ peak_overlap_TCGA-[CANCER_TYPE].txt # Raw peak overlap data
โโโ peak_overlap_TCGA-[CANCER_TYPE]_percent_table.txt # Percentage tables
โโโ heatmap_[CANCER_TYPE].png # Heatmap visualizations
โโโ cancers/
โโโ heatmap_results/ # Heatmaps for all 23 cancer types
โโโ heatmapsresults_num/ # Numerical data for heatmaps
A tab-delimited file containing metadata per sample, including:
- Sample name
- Genome reference (e.g., hg38)
- Additional required columns for downstream processing
๐ Location: peak/
๐ Format: ATAC_TCGA-{CANCER_TYPE}_{SAMPLE_ID}_{REPLICATE}_peaks_macs.bed
| Column | Description |
|---|---|
chr |
Chromosome |
start |
Start position |
end |
End position |
length |
Peak length |
signal_value |
Signal intensity |
p-value |
Statistical significance |
๐ Location: data/CEBPB_filtered_6mer.bed
| Column | Description |
|---|---|
chr |
Chromosome |
start |
Start position |
end |
End position |
motif_sequence |
Motif sequence |
score |
Motif score |
strand |
Strand (+/-) |
๐ Location:
- Multiple samples:
predictions_cluster_all/ - Single sample:
predicions_cluster/
| Column | Description |
|---|---|
chr |
Chromosome |
start |
Start position |
end |
End position |
prediction_score |
Predicted binding probability |
๐ Location: cell_line_data/atac/
๐ Format: ATAC_{CELL_LINE}_{REPLICATE}_peaks_macs.bed
| Column | Description |
|---|---|
chr |
Chromosome |
start |
Start position |
end |
End position |
length |
Peak length |
signal_value |
Signal intensity |
p-value |
Statistical significance |
๐ Location: cell_line_data/chip/
๐ Format: CEBPB_{CELL_LINE}_{REPLICATE}_peaks_peakzilla.bed
| Column | Description |
|---|---|
chr |
Chromosome |
start |
Start position |
end |
End position |
summit |
Summit position |
fold_change |
Fold change |
q-value |
Adjusted p-value |
๐ Location: processed_files/windows/clean/
If 100bp windows around summits are needed, they are available here.
Patient ATAC-seq peaks (windowed):
๐ ATAC_{CANCER_TYPE}_{PATIENT_ID}_{REPLICATE}_peaks_macs_window.bed
chr1 10184 10284 441 199.492 4.68993
chr1 14482 14582 2327 44.2601 2.22135
| Column | Description |
|---|---|
chr |
Chromosome |
start |
Start position |
end |
End position |
length |
Peak length |
signal_value |
Signal intensity |
p-value |
Statistical significance |
Cell line ATAC-seq peaks (windowed):
๐ ATAC_{CELL_LINE}_{REPLICATE}_peaks_macs_window.bed
chr1 181569 181669 444 36.79459 7.72453
chr1 191425 191525 219 8.13932 3.35317
Cell line ChIP-seq peaks (windowed):
๐ CEBPB_{CELL_LINE}_{REPLICATE}_peaks_peakzilla_window.bed
chr1 920269 920369 1.14 6.04
chr1 1000848 1000948 2.26 11.61
| Column | Description |
|---|---|
chr |
Chromosome |
start |
Start position |
end |
End position |
summit |
Summit position |
fold_change |
Fold change |
q-value |
Adjusted p-value |
Patient predicted ChIP-seq peaks (windowed):
๐ ATAC_{CANCER_TYPE}_{PATIENT_ID}_{REPLICATE}_RP20M_peaks_pred_window.bed
โ ๏ธ Not present for all patients โ 78 samples available.
chr1 10366 10466 0.99420804
chr1 15134 15234 0.5894148
| Column | Description |
|---|---|
chr |
Chromosome |
start |
Start position |
end |
End position |
prediction_score |
Predicted binding probability |
Motif file (windowed):
๐ motif_summits_window.bed
chr1 19523 19623 ATTGTGAAAT 0.000176 -
chr1 33155 33255 ATTGTGTAAT 7.22e-05 +
| Column | Description |
|---|---|
chr |
Chromosome |
start |
Start position |
end |
End position |
motif_sequence |
Motif sequence |
score |
Motif score |
strand |
Strand (+/-) |
BAM Files (TCGA)
โ
โผ
โโโโโโโโโโโโโโโโโโโ
โ 1. BAM QC & โโโโ reads/
โ Indexing โ
โโโโโโโโโโฌโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโ
โ 2. Signal Track โโโโ new_tracks/
โ Generation โ
โโโโโโโโโโฌโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโ
โ 3. Normalize โโโโ normalized_tracks/
โ (RP20M) โ
โโโโโโโโโโฌโโโโโโโโโ
โ
โโโโโโดโโโโโ
โผ โผ
โโโโโโโโโโ โโโโโโโโโโโโโโโโโโ
โ 4. Peakโ โ 5. TF-specific โโโโ cluster_normalization/
โ Callingโ โ Normalization โ
โโโโโฌโโโโโ โโโโโโโโโฌโโโโโโโโโ
โ โ
โผ โผ
peak/ โโโโโโโโโโโโโโโโโโ
โ 6. TF Binding โโโโ predicions_cluster/
โ Prediction โ predictions_cluster_all/
โ (MaxATAC) โ
โโโโโโโโโฌโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโ
โ 7. Feature โโโโ processed_files/
โ Integration โ
โโโโโโโโโฌโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโ
โ 8. Downstream โโโโ pca_analysis/
โ Analysis โ subsets/
โโโโโโโโโโโโโโโโโโ
- BAM Acquisition & QC โ BAM files are linked from TCGA repositories and stored in
reads/. Indexing and QC are performed. - Signal Track Generation โ Using
bedtools, BAMs are converted to genome-wide coverage tracks and stored innew_tracks/. - Normalization โ Tracks are scaled to 20M reads and saved in
normalized_tracks/. - Peak Calling โ MACS2 is run on each BAM file. Peaks are filtered and saved in
peak/. - TF-specific Normalization โ Optional normalization (e.g., for CEBPB) using
maxatac normalize, stored incluster_normalization/. - TF Binding Prediction โ MaxATAC is used to predict CEBPB binding in each sample.
- Feature Integration โ Peak summit locations are extracted from ATAC-seq, ChIP-seq, and predictions. 100bp windows are created around summits. Windows are merged to create a unified region set.
- QC and Metadata Tracking โ Sample info is tracked in
SAMPLEFILE. QC summaries are included in each respective folder. Fragment size distributions are analyzed for NFR and nucleosome patterns.
Total Samples: 410
Based on available reads, peaks, and tracks (predictions still in progress).
| Cancer Type | TCGA Code | Samples |
|---|---|---|
| Adrenocortical carcinoma | ACC | 9 |
| Bladder urothelial carcinoma | BLCA | 10 |
| Breast invasive carcinoma | BRCA | 75 |
| Cervical squamous cell carcinoma | CESC | 4 |
| Cholangiocarcinoma | CHOL | 5 |
| Colon adenocarcinoma | COAD | 41 |
| Esophageal carcinoma | ESCA | 18 |
| Glioblastoma multiforme | GBM | 9 |
| Head and neck squamous cell carcinoma | HNSC | 9 |
| Kidney renal clear cell carcinoma | KIRC | 16 |
| Kidney renal papillary cell carcinoma | KIRP | 34 |
| Brain lower grade glioma | LGG | 13 |
| Liver hepatocellular carcinoma | LIHC | 17 |
| Lung adenocarcinoma | LUAD | 22 |
| Lung squamous cell carcinoma | LUSC | 16 |
| Mesothelioma | MESO | 7 |
| Pheochromocytoma and Paraganglioma | PCPG | 9 |
| Prostate adenocarcinoma | PRAD | 26 |
| Skin cutaneous melanoma | SKCM | 13 |
| Stomach adenocarcinoma | STAD | 21 |
| Testicular germ cell tumors | TGCT | 9 |
| Thyroid carcinoma | THCA | 14 |
| Uterine corpus endometrial carcinoma | UCEC | 13 |
| Total | 410 |
Cell line data available on the server for comparison:
| Cancer Type | Cell Line | Tissue Type |
|---|---|---|
| BRCA | MCF7 | Breast Cancer |
| LIHC | HepG2 | Liver Cancer |
| LUAD / LUSC | A549 | Lung Cancer |
| COAD | HCT116 | Colon Cancer |
| STAD | SNU719 | Stomach Cancer |
| BLCA | T24 | Bladder Cancer |
| GBM | U87 | Glioblastoma |
| PRAD | LNCaP | Prostate Cancer |
- How does CEBPB binding differ across cancer types?
- Do cancer cell lines accurately represent patient tissue for CEBPB binding?
- Can we identify cancer type-specific CEBPB binding sites?
- What is the correlation between CEBPB binding and chromatin accessibility?
- How does CEBPB binding relate to known cancer pathways?
| Tool | Purpose |
|---|---|
samtools |
BAM indexing and statistics (module load) |
bedtools |
Coverage and read manipulation |
macs2 |
Peak calling (pipeline compatible with both versions) |
ucsc-bedgraphtobigwig |
BedGraph โ BigWig conversion |
bigWigToBedGraph |
BigWig โ BedGraph conversion |
maxatac |
TF-specific normalization, predictions, benchmarking |
R (DESeq2, ggplot2) |
Statistical analysis and visualization |
| Python | Scripting and utilities |
| SLURM | HPC job scheduling |
# MaxATAC environment
source /shared/software/miniconda3/etc/profile.d/conda.sh
conda activate /shared/home/bancquaa/.conda/envs/maxatac
# Python analysis environment
conda activate /data/hichamif/envs/pred_tf_env1. Link BAM file to reads/ directory:
ln -s /path/to/original/sample.bam reads/ATAC_TCGA-XXX_YYY_1.bam2. Generate signal track:
bash others/scripts/generate_tracks.sh reads/ATAC_TCGA-XXX_YYY_1.bam3. Call peaks:
bash others/scripts/call_peaks.sh reads/ATAC_TCGA-XXX_YYY_1.bam4. Predict TF binding:
# Activate MaxATAC environment
source /shared/software/miniconda3/etc/profile.d/conda.sh
conda activate /shared/home/bancquaa/.conda/envs/maxatac
# Run prediction
maxatac predict \
--signal normalized_tracks/ATAC_TCGA-XXX_YYY_1_RP20M.bw \
--model CEBPB \
--output predictions_cluster_all/ATAC_TCGA-XXX_YYY_15. Add to metadata:
echo -e "ATAC_TCGA-XXX_YYY_1\thg38\tTCGA-XXX\tYYY\t1" >> SAMPLEFILE๐ก For batch processing of multiple samples, use the provided SLURM scripts in the
others/scripts/directory.
- TCGA Research Network: https://www.cancer.gov/tcga
- ENCODE Project: https://www.encodeproject.org/
- MaxATAC: Avsec, Z. et al. Nature Methods (2021)
- MACS2: Zhang, Y. et al. Genome Biology (2008)
- CEBPB function: Nerlov, C. Nature Reviews Cancer (2007)
- All files are named in a consistent format:
ATAC_TCGA-<CANCER>_<ID>_<REPLICATE>for traceability. - Scripts used for each step are available on request or in the supplementary scripts folder (if included).
- This repository is optimized for reproducibility and modular processing.
๐ /data/hichamif/pred_tf_cancer/peak/
Ce rรฉpertoire contient les fichiers de pics ATAC-seq. Chaque fichier BED reprรฉsente les rรฉgions ouvertes (accessibles) dans le gรฉnome d'un รฉchantillon.
Utile pour :
- Localiser les rรฉgions rรฉgulatrices actives.
- Croiser ces rรฉgions avec les prรฉdictions CEBPB et les motifs.
๐ /data/hichamif/pred_tf_cancer/predictions_cluster_all/
๐ /data/hichamif/pred_tf_cancer/predicions_cluster/
predictions_cluster_all/โ contient les prรฉdictions MaxATAC pour tous les รฉchantillons de chaque cancer.predicions_cluster/โ contient une prรฉdiction par รฉchantillon reprรฉsentatif de chaque cancer.
๐ก Chaque fichier reprรฉsente des rรฉgions (format BED) prรฉdites comme รฉtant des sites de liaison de CEBPB.
๐ /data/hichamif/pred_tf_cancer/data/CEBPB_filtered_6mer.bed
Fichier BED des motifs CEBPB identifiรฉs (probablement via FIMO, MOODS ou une base comme HOCOMOCO ou JASPAR).
Utile pour :
- Vรฉrifier si les pics ChIP/ATAC prรฉdits contiennent un motif canonique CEBPB.
- Ajouter une couche de validation (filtrage par prรฉsence de motif).
๐ /data/hichamif/pred_tf_cancer/processed_files/windows/clean/
Ces fichiers reprรฉsentent des fenรชtres gรฉnomiques utilisรฉes pour la prรฉdiction, filtrรฉes (qualitรฉ, scores, taille, etc.).
Utile pour :
- Lier les prรฉdictions aux rรฉgions gรฉnomiques prรฉcises.
- Permettre un croisement propre entre prรฉdictions, motifs et pics ATAC/ChIP.