This repository provides a Python-based solution for generating high-quality volcano plots from differential expression data. It identifies genes that are significantly differentially expressed between two conditions by combining fold-change and statistical significance (Adjusted P-value). Volcano plots are a widely used visualization in bioinformatics to identify genes that are significantly differentially expressed between two conditions, combining fold-change and statistical significance.
- CLI Support: Run the script from the command line with custom arguments.
- Dynamic Thresholds: Set your own Adjusted P-value and Log2 Fold Change cutoffs.
- Automated Labeling: Automatically labels the top N most significant genes.
- Publication Ready: Exports high-resolution PNGs (300 DPI).
- Automated Plot Generation: Generates publication-ready volcano plots from structured input data.
- Customizable Gene Labeling: Supports highlighting and labeling specific genes based on user-defined criteria for fold-change and statistical significance (adjusted p-value).
- Dependency Management: Utilizes a
requirements.txtfile for straightforward environment setup. - Jupyter Notebook Integration: Includes an accompanying Jupyter notebook (
high_quality_volcano_plots.ipynb) for interactive data exploration and plot customization.
To set up the project environment, ensure you have Python 3.8+ installed. It is recommended to use a virtual environment.
-
Clone the repository:
git clone https://github.com/YOUR_USERNAME/volcano_plot_py.git cd volcano_plot_py(Note: Replace
YOUR_USERNAMEwith your actual GitHub username and adjust the repository name if different.) -
Create and activate a virtual environment:
python3 -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
The primary script is make_volcano_plot.py. You can run it with default settings or specify your own parameters.
-
Prepare your data: Ensure your differential expression data is in a CSV format named
differential_expression.csv. You can use the provideddata_template.csvas a starting point. The file should contain the following columns:Column Header Description Required? ""(Index)Unique gene identifier (e.g., Ensembl ID). First column. Yes baseMeanMean expression level. Used to determine the size of the points. Yes log2FoldChangeLog2 fold change between conditions. Mapped to the X-axis. Yes padjAdjusted p-value. Transformed to -log10(padj)for the Y-axis.Yes symbolGene symbol or name. Used for labeling and identification. Yes lfcSELog fold change standard error. Optional statWald statistic. Optional pvalueRaw p-value. Optional -
Run the script:
Default Run
Assumes input is
differential_expression.csvand uses defaults: Adjusted P-value < 0.05, LogFC > 1.python make_volcano_plot.py
Custome Run
Example: Using a specific file, stricter thresholds, and labeling the top 20 genes
python make_volcano_plot.py --input my_results.csv --output my_plot.png --pval 0.01 --lfc 2.0 --top_n 20
Arguments for Custome Run 😘
| Flag | Description | Default |
|---|---|---|
-i, --input |
Path to input CSV file | differential_expression.csv |
-o, --output |
Path to save the PNG image | volcano.png |
--pval |
Adjusted P-value cutoff for significance | 0.05 |
--lfc |
Log2 Fold Change cutoff (absolute) | 1.0 |
--top_n |
Number of top significant genes to label | 10 |
Another example for custome run — favorite features 😃:
python make_volcano_plot.py --lfc 2.0 --pval 0.01 --output strict_volcano.png-
python make_volcano_plot.py: Runs the script. Because no input file is specified with-i, it will look for the default filedifferential_expression.csv. -
--lfc 2.0: Sets the Log2 Fold Change cutoff to 2.0.-
Comment: This is stricter than the default (1.0). A gene must have at least a 4-fold change (
$2^2 = 4$ ) in expression (either up or down) to be considered biologically significant.
-
Comment: This is stricter than the default (1.0). A gene must have at least a 4-fold change (
-
--pval 0.01: Sets the Adjusted P-value cutoff to 0.01.- Comment: This is stricter than the default (0.05). It means there is a only a 1% estimated False Discovery Rate (FDR) allowed for the genes you highlight.
-
--output strict_volcano.png: Saves the resulting image asstrict_volcano.png.-
Comment: This is useful so you don't overwrite your previous
volcano.png.
Comparison with default run
Compared to the default run, this plot will show fewer significant genes (fewer red/blue dots) because the criteria to be colored are much harder to meet. This is useful when you have too many "significant" genes and want to narrow focus to only the strongest candidates.
-
Comment: This is useful so you don't overwrite your previous
- Jupyter Notebook: For interactive analysis and further customization, open the provided Jupyter notebook:
jupyter notebook high_quality_volcano_plots.ipynb
The script make_volcano_plot.py produces a PNG image file named volcano.png which visually represents the differential expression analysis.
This project is inspired by and derived from the excellent work of Mark (mousepixels). Special thanks for his valuable contributions to the bioinformatics community.
- Original Notebook: high_quality_volcano_plots.ipynb on GitHub
- Tutorial Video: Volcano Plot in Python Tutorial by mousepixels
The project relies on the following Python libraries:
pandasseabornmatplotlibnumpyadjustText
These are listed in requirements.txt and will be installed during the setup process.
