🧬 Pred_tf_cancer

Analysis of NGS data from TCGA to study chromatin accessibility and predict transcription factor (TF) binding, with a particular focus on CEBPB.

📌 Table of Contents

Overview
Code vs Data
Repository Structure
Server Data Structure
File Formats & Column Descriptions
Workflow Overview
Cancer Type Summary
Cancer Type ↔ Cell Line Mapping
Key Analysis Questions
Tools & Environments
Quick Start
References
Notes

🔬 Overview

This project focuses on the analysis of ATAC-seq data from TCGA (The Cancer Genome Atlas) to study chromatin accessibility and predict transcription factor (TF) binding, with a particular focus on CEBPB.

The pipeline covers:

Raw read alignment
Signal track generation
Normalization
Peak calling
TF binding prediction (MaxATAC)
Downstream analysis (PCA, t-SNE, DESeq2, clustering)

The goal is to produce high-quality, normalized data suitable for downstream analysis such as transcription factor binding prediction.

⚠️ Important: Code vs Data

This project is split into two parts:

💻 1. This Git Repository (Code Only)

This repository contains the code required to run the analysis, but NOT the large datasets.

📦 2. Server Data (NOT in Git)

All large files are stored on a remote server at:

/data/hichamif/pred_tf_cancer/

This includes:

🧬 Raw ATAC-seq BAM files (TCGA)
📊 BigWig / BedGraph signal tracks
📍 Peak files (MACS2)
🔮 TF predictions (MaxATAC)
📈 QC results
🧪 Processed intermediate files

👉 These are not included in GitHub due to size constraints.

📁 Repository Structure

.
├── src/                # Core logic (pipeline + analysis)
│   ├── preprocessing/  # BAM processing, indexing, normalization
│   ├── analysis/       # PCA, t-SNE, DESeq2, clustering
│   ├── benchmarking/   # Evaluation of TF predictions
│   ├── visualization/  # Plots and heatmaps
│   └── utils/          # Helper scripts (logging, config, merging)
├── workflows/
│   └── pipeline/       # Full pipeline (step-by-step scripts)
├── notebooks/          # Exploratory & validation notebooks (Jupyter)
├── config/             # Configuration (SLURM, parameters)
├── archive/            # Old / deprecated files
└── README.md

🗂 Server Data Structure

/data/hichamif/pred_tf_cancer/
│
├── reads/                          # Raw ATAC-seq data (symlinked BAMs from TCGA)
├── new_tracks/                     # Initial coverage tracks (BigWig, scaled to 1M reads)
├── normalized_tracks/              # Normalized coverage tracks (RP20M)
├── cluster_normalization/          # TF-specific normalization (e.g., for CEBPB)
├── peak/                           # MACS2 peak calls (filtered)
├── predicions_cluster/             # Subset of ATAC-seq predictions (e.g., LIHC)
├── predictions_cluster_all/        # Full set of ATAC-seq predictions (across TCGA)
├── pca_analysis/                   # Downstream analysis (PCA, DESeq2, clustering)
├── QC_results/                     # Output from fast QC pipeline
├── cell_line_data/                 # Reference cell line data for comparison
├── processed_files/                # Generated during processing and feature annotation
├── data/                           # Reference data files used for annotations
├── others/                         # Misc files (e.g., hg38_chrom.sizes)
├── subsets/                        # Cancer-type-specific subset analyses
├── annotated_peaks/                # Peak data organized by cancer/sample/genes
├── cancers/                        # Peak overlap analysis results
├── SAMPLEFILE                      # List of samples and metadata
└── [scripts, logs, other metadata]

📂 Folder Descriptions

`reads/`

Contains symbolic links to BAM files downloaded from TCGA, each representing aligned reads for a specific sample. Each BAM file is accompanied by an index (.bai) file. The file number_reads_1.txt provides statistics like number of mapped reads and converted fragments.

reads/
├── ATAC_TCGA-XXX_YYY_1.bam        # BAM: mapped reads for each sample
├── ATAC_TCGA-XXX_YYY_1.bam.bai    # BAM index for fast access
└── number_reads_1.txt              # QC: read counts and mapping stats

⚠️ Note: Not all stats are available; some samples are missing — check QC_results/ for complete data.

`new_tracks/`

First-stage signal tracks in BigWig format, converted from BAM files using scaling to 1 million reads. They provide genome-wide coverage for each sample.

new_tracks/
└── ATAC_TCGA-XXX_YYY_1.bw         # BigWig: coverage per sample

`normalized_tracks/`

Coverage tracks normalized to a common read depth of 20 million (RP20M) to allow comparison across samples. Includes both BedGraph and BigWig formats. Intermediate uncompressed versions are also retained for inspection. Signal is calculated as total signal per sample, weighted by the size of each region.

normalized_tracks/
├── ATAC_TCGA-XXX_YYY_1.bedgraph           # BedGraph: raw coverage (genomic coordinates + signal values)
├── ATAC_TCGA-XXX_YYY_1_RP20M.bedgraph     # BedGraph: scaled to 20M reads
└── ATAC_TCGA-XXX_YYY_1_RP20M.bw           # BigWig: scaled to 20M reads

`cluster_normalization/`

Signal tracks that have undergone TF-specific normalization using the maxatac normalize tool. These are further processed to be used for clustering or input into predictive models. Includes per-chromosome and genome-wide summary statistics.

cluster_normalization/
├── ATAC_TCGA-XXX_YYY_1_RP20M.bw                       # Normalized BigWig
├── ATAC_TCGA-XXX_YYY_1_RP20M_chromosome_min_max.txt    # Per-chromosome min/max stats
└── ATAC_TCGA-XXX_YYY_1_RP20M_genome_stats.txt          # Genome-wide stats

`peak/`

Peak files output by MACS2, filtered to keep only standard chromosomes. One BED file per sample lists identified open chromatin regions. A summary file logs peak counts per sample.

peak/
├── ATAC_TCGA-XXX_YYY_1_peaks_macs.bed     # BED: called peaks per sample
└── peaks_summary.txt                       # Number of peaks per sample

`predicions_cluster/`

Contains a subset of samples: all LIHC samples and one representative from each other cancer type (22 types excluding LIHC).

predicions_cluster/
├── 1LIHC/                                              # Folder for ONE peak file (test)
│   ├── ATAC_TCGA-LIHC_TCGA-BC-A3KF_1.bed              # All peaks, unfiltered by MaxATAC
│   ├── ATAC_TCGA-LIHC_TCGA-BC-A3KF_1_RP20M.bw
│   └── ATAC_TCGA-LIHC_TCGA-BC-A3KF_1_RP20M_peaks.bed
├── ATAC_TCGA-[CANCER]_[SAMPLE]_RP20M.bw                # One per cancer type (22 types, excl. LIHC)
├── ATAC_TCGA-[CANCER]_[SAMPLE]_peaks.bed               # One per cancer type (22 types, excl. LIHC)
└── LIHC/                                               # All LIHC samples
    ├── ATAC_TCGA-LIHC_TCGA-[SAMPLE]_RP20M.bw
    ├── ATAC_TCGA-LIHC_TCGA-[SAMPLE]_RP20M_peaks.bed
    └── logs/                                           # Logs for prediction runs

`predictions_cluster_all/`

Full set of ATAC-seq predictions across TCGA. Contains all ACC samples, all BLCA samples, and subsets of BRCA and CESC samples.

predictions_cluster_all/
├── ATAC_TCGA-[CANCER]_[SAMPLE]_RP20M.bw
└── ATAC_TCGA-[CANCER]_[SAMPLE]_peaks.bed

`pca_analysis/`

⚠️ Status: May need verification or cleanup.

Contains results from dimensionality reduction analyses on the ATAC-seq data, including PCA, tSNE, and differential accessibility analysis using DESeq2.

pca_analysis/
├── filtered_bams/                          # Processed filtered BAMs (NOT ALL FILES)
├── generate_counts_*.log
└── results/
    ├── ATAC_all_regions_count*.txt
    ├── cell_line_counts/
    ├── sample_counts/
    ├── metadata.csv, samples.txt
    ├── DESeq_results_*.csv
    ├── DESeq2_significant_genes.csv
    ├── PCA_plot*.png
    ├── tSNE_plot1.png
    ├── heatmap_DESeq2.png
    ├── volcano_plot_*.pdf
    └── CancerType_Clustering1.png

`QC_results/`

Comprehensive quality control metrics for all samples, divided into three categories:

QC_results/
├── basic/                                  # Basic BAM-level QC
│   ├── SAMPLE_flagstat.txt
│   ├── SAMPLE_idxstats.txt
│   └── summary.txt                         # Aggregated metrics: mapped %, mt %, pairing
├── fragments/                              # Insert size distributions (subsampled reads)
│   └── summary.txt                         # Fragment type proportions (NFR, mono, di, tri)
└── peaks/                                  # Peak-level statistics
    └── counts.txt                          # Total peak counts per sample

`cell_line_data/`

Reference data from cancer cell lines corresponding to TCGA tumor types, including both ATAC-seq and ChIP-seq peaks. Data was copied from ENCODE for comparison purposes (no direct Jupyter access to ENCODE).

cell_line_data/
├── atac/                                   # ATAC-seq peaks from cell lines
│   └── ATAC_[CELL_LINE]_*_peaks_macs.bed
└── chip/                                   # ChIP-seq peaks from cell lines
    └── CEBPB_[CELL_LINE]_*_peaks_peakzilla.bed

`processed_files/`

Contains intermediate and final processed files for downstream analyses, including summit locations, fixed-width windows (100bp around summits), and annotation files.

processed_files/
├── summits/                                # Peak summit information
│   ├── *_summit.bed                        # Summits from ATAC-seq, ChIP-seq, predictions
│   └── motif_summits.bed                   # Summit locations for motif hits
└── windows/clean/                          # Fixed-width 100bp windows around summits
    └── *_window.bed

`data/`

Reference data files used throughout the pipeline.

data/
├── hg38_maxatac_blacklist.bed              # Genomic regions to exclude (blacklist)
└── CEBPB_filtered_6mer.bed                 # Filtered CEBPB motif locations

`others/`

Miscellaneous files.

others/
└── hg38_chrom.sizes                        # Chromosome sizes for hg38

`subsets/`

Cancer-type-specific subset analyses with visualization outputs.

subsets/
└── {BRCA, LIHC, COAD, LUAD, ...}/
    ├── subset_{cancer_type}.tsv
    └── plots/
        ├── plot1.1_box_atac.png
        ├── plot1.2_access_overlap_bar.png
        ├── plot1.3_scatter_patient_vs_cell.png
        ├── plot2.2_motif_heatmap.png
        ├── plot2.3_motif_access_combo.png
        ├── plot4.1_feature_overlap_combo.png
        └── plot4.3_confidence_dist.png

`annotated_peaks/`

Peak data organized in three different ways:

annotated_peaks/
├── by_cancer_type/         # Subdirectories for 23 TCGA cancer types
│   └── {ACC, BLCA, BRCA, CESC, CHOL, COAD, ESCA, GBM, HNSC, KIRC,
│        KIRP, LGG, LIHC, LUAD, LUSC, MESO, PCPG, PRAD, SKCM,
│        STAD, TGCT, THCA, UCEC}/
├── by_sample/              # Per-sample annotated peak files
│   └── ATAC_TCGA-[CANCER]_TCGA-[PATIENT]_1_peaks_macs_annotated.bed
└── top_genes/              # Top genes per cancer type
    ├── [CANCER_TYPE]_top_100_genes.txt
    └── all_cancer_types_top10_summary.csv

`cancers/`

Peak overlap analysis results.

cancers/
├── peak_overlap_TCGA-[CANCER_TYPE].txt              # Raw peak overlap data
├── peak_overlap_TCGA-[CANCER_TYPE]_percent_table.txt # Percentage tables
├── heatmap_[CANCER_TYPE].png                         # Heatmap visualizations
└── cancers/
    ├── heatmap_results/                              # Heatmaps for all 23 cancer types
    └── heatmapsresults_num/                          # Numerical data for heatmaps

`SAMPLEFILE`

A tab-delimited file containing metadata per sample, including:

Sample name
Genome reference (e.g., hg38)
Additional required columns for downstream processing

📋 File Formats & Column Descriptions

Patient ATAC-seq Peaks

📂 Location: peak/
📄 Format: ATAC_TCGA-{CANCER_TYPE}_{SAMPLE_ID}_{REPLICATE}_peaks_macs.bed

Column	Description
`chr`	Chromosome
`start`	Start position
`end`	End position
`length`	Peak length
`signal_value`	Signal intensity
`p-value`	Statistical significance

CEBPB Motif Data

📂 Location: data/CEBPB_filtered_6mer.bed

Column	Description
`chr`	Chromosome
`start`	Start position
`end`	End position
`motif_sequence`	Motif sequence
`score`	Motif score
`strand`	Strand (+/-)

Prediction Files

📂 Location:

Multiple samples: predictions_cluster_all/
Single sample: predicions_cluster/

Column	Description
`chr`	Chromosome
`start`	Start position
`end`	End position
`prediction_score`	Predicted binding probability

Cell Line ATAC-seq Peaks

📂 Location: cell_line_data/atac/
📄 Format: ATAC_{CELL_LINE}_{REPLICATE}_peaks_macs.bed

Column	Description
`chr`	Chromosome
`start`	Start position
`end`	End position
`length`	Peak length
`signal_value`	Signal intensity
`p-value`	Statistical significance

Cell Line ChIP-seq Peaks

📂 Location: cell_line_data/chip/
📄 Format: CEBPB_{CELL_LINE}_{REPLICATE}_peaks_peakzilla.bed

Column	Description
`chr`	Chromosome
`start`	Start position
`end`	End position
`summit`	Summit position
`fold_change`	Fold change
`q-value`	Adjusted p-value

100bp Window Files

📂 Location: processed_files/windows/clean/

If 100bp windows around summits are needed, they are available here.

Patient ATAC-seq peaks (windowed):
📄 ATAC_{CANCER_TYPE}_{PATIENT_ID}_{REPLICATE}_peaks_macs_window.bed

chr1    10184   10284   441     199.492   4.68993
chr1    14482   14582   2327    44.2601   2.22135

Column	Description
`chr`	Chromosome
`start`	Start position
`end`	End position
`length`	Peak length
`signal_value`	Signal intensity
`p-value`	Statistical significance

Cell line ATAC-seq peaks (windowed):
📄 ATAC_{CELL_LINE}_{REPLICATE}_peaks_macs_window.bed

chr1    181569  181669  444     36.79459  7.72453
chr1    191425  191525  219     8.13932   3.35317

Cell line ChIP-seq peaks (windowed):
📄 CEBPB_{CELL_LINE}_{REPLICATE}_peaks_peakzilla_window.bed

chr1    920269  920369  1.14    6.04
chr1    1000848 1000948 2.26    11.61

Column	Description
`chr`	Chromosome
`start`	Start position
`end`	End position
`summit`	Summit position
`fold_change`	Fold change
`q-value`	Adjusted p-value

Patient predicted ChIP-seq peaks (windowed):
📄 ATAC_{CANCER_TYPE}_{PATIENT_ID}_{REPLICATE}_RP20M_peaks_pred_window.bed

⚠️ Not present for all patients — 78 samples available.

chr1    10366   10466   0.99420804
chr1    15134   15234   0.5894148

Column	Description
`chr`	Chromosome
`start`	Start position
`end`	End position
`prediction_score`	Predicted binding probability

Motif file (windowed):
📄 motif_summits_window.bed

chr1    19523   19623   ATTGTGAAAT   0.000176   -
chr1    33155   33255   ATTGTGTAAT   7.22e-05   +

Column	Description
`chr`	Chromosome
`start`	Start position
`end`	End position
`motif_sequence`	Motif sequence
`score`	Motif score
`strand`	Strand (+/-)

🧬 Workflow Overview

BAM Files (TCGA)
      │
      ▼
┌─────────────────┐
│  1. BAM QC &    │──→ reads/
│     Indexing     │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  2. Signal Track │──→ new_tracks/
│     Generation   │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  3. Normalize    │──→ normalized_tracks/
│     (RP20M)      │
└────────┬────────┘
         │
    ┌────┴────┐
    ▼         ▼
┌────────┐ ┌────────────────┐
│ 4. Peak│ │ 5. TF-specific │──→ cluster_normalization/
│ Calling│ │  Normalization  │
└───┬────┘ └───────┬────────┘
    │              │
    ▼              ▼
  peak/    ┌────────────────┐
           │ 6. TF Binding  │──→ predicions_cluster/
           │  Prediction    │    predictions_cluster_all/
           │  (MaxATAC)     │
           └───────┬────────┘
                   │
                   ▼
           ┌────────────────┐
           │ 7. Feature     │──→ processed_files/
           │  Integration   │
           └───────┬────────┘
                   │
                   ▼
           ┌────────────────┐
           │ 8. Downstream  │──→ pca_analysis/
           │  Analysis      │    subsets/
           └────────────────┘

Step Details

BAM Acquisition & QC — BAM files are linked from TCGA repositories and stored in reads/. Indexing and QC are performed.
Signal Track Generation — Using bedtools, BAMs are converted to genome-wide coverage tracks and stored in new_tracks/.
Normalization — Tracks are scaled to 20M reads and saved in normalized_tracks/.
Peak Calling — MACS2 is run on each BAM file. Peaks are filtered and saved in peak/.
TF-specific Normalization — Optional normalization (e.g., for CEBPB) using maxatac normalize, stored in cluster_normalization/.
TF Binding Prediction — MaxATAC is used to predict CEBPB binding in each sample.
Feature Integration — Peak summit locations are extracted from ATAC-seq, ChIP-seq, and predictions. 100bp windows are created around summits. Windows are merged to create a unified region set.
QC and Metadata Tracking — Sample info is tracked in SAMPLEFILE. QC summaries are included in each respective folder. Fragment size distributions are analyzed for NFR and nucleosome patterns.

📊 Cancer Type Summary

Total Samples: 410
Based on available reads, peaks, and tracks (predictions still in progress).

Cancer Type	TCGA Code	Samples
Adrenocortical carcinoma	ACC	9
Bladder urothelial carcinoma	BLCA	10
Breast invasive carcinoma	BRCA	75
Cervical squamous cell carcinoma	CESC	4
Cholangiocarcinoma	CHOL	5
Colon adenocarcinoma	COAD	41
Esophageal carcinoma	ESCA	18
Glioblastoma multiforme	GBM	9
Head and neck squamous cell carcinoma	HNSC	9
Kidney renal clear cell carcinoma	KIRC	16
Kidney renal papillary cell carcinoma	KIRP	34
Brain lower grade glioma	LGG	13
Liver hepatocellular carcinoma	LIHC	17
Lung adenocarcinoma	LUAD	22
Lung squamous cell carcinoma	LUSC	16
Mesothelioma	MESO	7
Pheochromocytoma and Paraganglioma	PCPG	9
Prostate adenocarcinoma	PRAD	26
Skin cutaneous melanoma	SKCM	13
Stomach adenocarcinoma	STAD	21
Testicular germ cell tumors	TGCT	9
Thyroid carcinoma	THCA	14
Uterine corpus endometrial carcinoma	UCEC	13
Total		410

🔁 Cancer Type ↔ Cell Line Mapping

Cell line data available on the server for comparison:

Cancer Type	Cell Line	Tissue Type
BRCA	MCF7	Breast Cancer
LIHC	HepG2	Liver Cancer
LUAD / LUSC	A549	Lung Cancer
COAD	HCT116	Colon Cancer
STAD	SNU719	Stomach Cancer
BLCA	T24	Bladder Cancer
GBM	U87	Glioblastoma
PRAD	LNCaP	Prostate Cancer

🧪 Key Analysis Questions

How does CEBPB binding differ across cancer types?
Do cancer cell lines accurately represent patient tissue for CEBPB binding?
Can we identify cancer type-specific CEBPB binding sites?
What is the correlation between CEBPB binding and chromatin accessibility?
How does CEBPB binding relate to known cancer pathways?

🛠️ Tools & Environments

Software

Tool	Purpose
`samtools`	BAM indexing and statistics (`module load`)
`bedtools`	Coverage and read manipulation
`macs2`	Peak calling (pipeline compatible with both versions)
`ucsc-bedgraphtobigwig`	BedGraph → BigWig conversion
`bigWigToBedGraph`	BigWig → BedGraph conversion
`maxatac`	TF-specific normalization, predictions, benchmarking
R (`DESeq2`, `ggplot2`)	Statistical analysis and visualization
Python	Scripting and utilities
SLURM	HPC job scheduling

Conda Environments

# MaxATAC environment
source /shared/software/miniconda3/etc/profile.d/conda.sh
conda activate /shared/home/bancquaa/.conda/envs/maxatac

# Python analysis environment
conda activate /data/hichamif/envs/pred_tf_env

🚀 Quick Start

Process a New Sample

1. Link BAM file to reads/ directory:

ln -s /path/to/original/sample.bam reads/ATAC_TCGA-XXX_YYY_1.bam

2. Generate signal track:

bash others/scripts/generate_tracks.sh reads/ATAC_TCGA-XXX_YYY_1.bam

3. Call peaks:

bash others/scripts/call_peaks.sh reads/ATAC_TCGA-XXX_YYY_1.bam

4. Predict TF binding:

# Activate MaxATAC environment
source /shared/software/miniconda3/etc/profile.d/conda.sh
conda activate /shared/home/bancquaa/.conda/envs/maxatac

# Run prediction
maxatac predict \
  --signal normalized_tracks/ATAC_TCGA-XXX_YYY_1_RP20M.bw \
  --model CEBPB \
  --output predictions_cluster_all/ATAC_TCGA-XXX_YYY_1

5. Add to metadata:

echo -e "ATAC_TCGA-XXX_YYY_1\thg38\tTCGA-XXX\tYYY\t1" >> SAMPLEFILE

💡 For batch processing of multiple samples, use the provided SLURM scripts in the others/scripts/ directory.

📚 References

TCGA Research Network: https://www.cancer.gov/tcga
ENCODE Project: https://www.encodeproject.org/
MaxATAC: Avsec, Z. et al. Nature Methods (2021)
MACS2: Zhang, Y. et al. Genome Biology (2008)
CEBPB function: Nerlov, C. Nature Reviews Cancer (2007)

📎 Notes

All files are named in a consistent format: ATAC_TCGA-<CANCER>_<ID>_<REPLICATE> for traceability.
Scripts used for each step are available on request or in the supplementary scripts folder (if included).
This repository is optimized for reproducibility and modular processing.

Added details for me

🧬 1. Données ATAC-seq

📁 /data/hichamif/pred_tf_cancer/peak/

Ce répertoire contient les fichiers de pics ATAC-seq. Chaque fichier BED représente les régions ouvertes (accessibles) dans le génome d'un échantillon.

Utile pour :

Localiser les régions régulatrices actives.
Croiser ces régions avec les prédictions CEBPB et les motifs.

🔮 2. Prédictions CEBPB (à partir d'ATAC-seq)

📁 /data/hichamif/pred_tf_cancer/predictions_cluster_all/
📁 /data/hichamif/pred_tf_cancer/predicions_cluster/

predictions_cluster_all/ → contient les prédictions MaxATAC pour tous les échantillons de chaque cancer.
predicions_cluster/ → contient une prédiction par échantillon représentatif de chaque cancer.

💡 Chaque fichier représente des régions (format BED) prédites comme étant des sites de liaison de CEBPB.

🧬 3. Motifs de CEBPB

📁 /data/hichamif/pred_tf_cancer/data/CEBPB_filtered_6mer.bed

Fichier BED des motifs CEBPB identifiés (probablement via FIMO, MOODS ou une base comme HOCOMOCO ou JASPAR).

Utile pour :

Vérifier si les pics ChIP/ATAC prédits contiennent un motif canonique CEBPB.
Ajouter une couche de validation (filtrage par présence de motif).

🧪 4. Fichiers prédictifs par fenêtre (clean)

📂 /data/hichamif/pred_tf_cancer/processed_files/windows/clean/

Ces fichiers représentent des fenêtres génomiques utilisées pour la prédiction, filtrées (qualité, scores, taille, etc.).

Utile pour :

Lier les prédictions aux régions génomiques précises.
Permettre un croisement propre entre prédictions, motifs et pics ATAC/ChIP.

Name		Name	Last commit message	Last commit date
Latest commit History 102 Commits
archive		archive
config		config
notebooks		notebooks
scripts		scripts
src		src
workflows/pipeline		workflows/pipeline
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

🧬 Pred_tf_cancer

📌 Table of Contents

🔬 Overview

⚠️ Important: Code vs Data

💻 1. This Git Repository (Code Only)

📦 2. Server Data (NOT in Git)

📁 Repository Structure

🗂 Server Data Structure

📂 Folder Descriptions

reads/

new_tracks/

normalized_tracks/

cluster_normalization/

peak/

predicions_cluster/

predictions_cluster_all/

pca_analysis/

QC_results/

cell_line_data/

processed_files/

data/

others/

subsets/

annotated_peaks/

cancers/

SAMPLEFILE

📋 File Formats & Column Descriptions

Patient ATAC-seq Peaks

CEBPB Motif Data

Prediction Files

Cell Line ATAC-seq Peaks

Cell Line ChIP-seq Peaks

100bp Window Files

🧬 Workflow Overview

Step Details

📊 Cancer Type Summary

🔁 Cancer Type ↔ Cell Line Mapping

🧪 Key Analysis Questions

🛠️ Tools & Environments

Software

Conda Environments

🚀 Quick Start

Process a New Sample

📚 References

📎 Notes

Added details for me

🧬 1. Données ATAC-seq

🔮 2. Prédictions CEBPB (à partir d'ATAC-seq)

🧬 3. Motifs de CEBPB

🧪 4. Fichiers prédictifs par fenêtre (clean)

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`reads/`

`new_tracks/`

`normalized_tracks/`

`cluster_normalization/`

`peak/`

`predicions_cluster/`

`predictions_cluster_all/`

`pca_analysis/`

`QC_results/`

`cell_line_data/`

`processed_files/`

`data/`

`others/`

`subsets/`

`annotated_peaks/`

`cancers/`

`SAMPLEFILE`

Packages