Solutions and final project for the APBI course at the University of Vienna (SS17).
This repository contains exercises from the course 2017S 300353-1 Applied Programming for Bioinformatics — a practical introduction to programming for solving biological problems, with a focus on text processing and automation of external tools — as well as the final project: an orthologous group analysis using the eggNOG v4 database.
Note: This repo was updated in 2026 with Hermes Agent to test AI-assisted repository maintenance. The original content and structure have been preserved.
AppliedProgrammingBioinfo/
├── weekly_exercises/
│ ├── 01_Basics/ Perl — basic syntax, file I/O, sequence handling
│ ├── 02_Subroutines/ Perl — string manipulation, pattern matching
│ ├── 03_FileIO/ Perl — subroutines, arrays, hash tables
│ ├── 04_CommandLine/ Perl — command-line arguments
│ ├── 05_Modules/ Perl — restriction enzyme analysis (Pattern.pm, Restrictions.pm)
│ ├── 06_Databases/ Perl — SQL database queries via DBI
│ ├── 07_Python_Basics/ Python — sequence analysis, file parsing
│ ├── 08_Python_Patterns/ Python — pattern search, restriction site analysis
│ └── 09_Bash/ Bash — shell scripting, automation
├── final_project/ ← Final project (eggNOG orthology analysis)
│ ├── orthologe_comparison.pl
│ ├── taxonomic_distribution.pl
│ ├── rodent_gene_isolation.pl
│ ├── runscript.sh
│ ├── lib/
│ │ └── FinalProject.pm ← Shared Perl module (taxon lookup, counting, descriptions)
│ └── data/
│ ├── eggnog4.functional_categories.txt
│ └── eggnog4.species_list.txt
├── python_rewrite/ ← Partial Python port of the Perl exercises (unfinished)
│ ├── Exercise01_01.py .. 03_04.py
│ └── ABANDONED.md
├── example_data/ ← Shared datasets used by multiple exercises
│ ├── gene_positions1.txt
│ ├── gene_positions2.txt
│ ├── names.txt
│ ├── sequence.fasta
│ └── swissprot_list.txt
├── perl_cheat_sheet.txt ← Common Perl pitfalls reference
├── README.md
├── LICENSE
└── .gitignore
9 weekly assignments progressing from Perl basics through Python to Bash scripting.
| Exercise | Language | Topic |
|---|---|---|
| 01 | Perl | Basic syntax, file I/O, sequence handling |
| 02 | Perl | String manipulation, pattern matching |
| 03 | Perl | Subroutines, arrays, hash tables |
| 04 | Perl | Regular expressions (contributed by a friend — not uploaded) |
| 05 | Perl | Restriction enzyme analysis |
| 06 | Perl | SQL database queries |
| 07 | Python | Sequence analysis, file parsing |
| 08 | Python | Pattern search, restriction site analysis |
| 09 | Bash | Shell scripting, automation |
The final_project/ directory contains the capstone project for the course (originally in a separate repository). It analyses orthologous groups from the eggNOG v4 database across multiple species.
The project consists of three Perl scripts orchestrated by a Bash wrapper:
-
orthologe_comparison.pl— given two species, counts orthologous groups they share and genes from species 1 that have a homolog in species 2 -
taxonomic_distribution.pl— given three species, identifies genes from species 1 that have a homolog in species 3 but not in species 2 (lineage-specific gene loss) -
rodent_gene_isolation.pl— given one species, identifies orthologous groups that only contain genes from that species (species-specific orthogroups) -
runscript.sh— interactive orchestrator that chains all three scripts, prompts for species names, and warns about output file conflicts
lib/FinalProject.pm provides four subroutines used across all scripts:
get_taxon_ID()— looks up a species name → NCBI taxon IDgene_count_members()— counts gene occurrences for a taxon across orthologous groupsget_protein_ID()— extracts protein IDs from eggNOG members filesget_desc()— cross-references COG functional category codes (e.g. J → "Translation") with eggNOG descriptions
The pipeline uses eggNOG v4 data files:
meNOG.members.tsv— the orthologous group membership file (not shipped due to size — must be downloaded from eggNOG)eggnog4.functional_categories.txt— COG functional category descriptionseggnog4.species_list.txt— species-to-taxon-ID mapping
cd final_project
# Interactive mode (prompts for species names)
bash runscript.sh
# Direct mode (pass species as command-line arguments)
perl orthologe_comparison.pl "Homo sapiens" "Mus musculus"
perl taxonomic_distribution.pl "Homo sapiens" "Mus musculus" "Pan troglodytes"
perl rodent_gene_isolation.pl "Mus musculus"- The first six exercises are written in Perl (developed with Eclipse + EPIC plugin on Windows/Linux).
- Exercises 7–8 are written in Python (developed with PyCharm).
- Exercise 09 covers Bash scripting (developed on Fedora/KDE and Windows Subsystem for Linux).
- Exercise 04 (regex) was contributed by a friend and is not included here.
- A cheat sheet with common Perl pitfalls is included as
perl_cheat_sheet.txt.
This project is licensed under the MIT License — see the LICENSE file for details.