Skip to content

Bierinformatik/lncRNA_host_gene_classification

Repository files navigation

Initial steps

This readme provides instructions towards the usage of the included jupyter notebook and command line scripts to reproduce the findings of our study to identify lncRNA subtypes.

The folder feature extraction scripts contains Python scripts used to process the datasets, i.e. for data extraction and feature engineering. They are provided as an overview of the elaborate steps involved in data preprocessing.

The folder machine learning steps contains Python scripts that can be used to reproduce the training steps and generate the performance metrics in a terminal, should you choose not to use the iPython environment. See Usage for more information.

Prerequisites are python >=3.6, jupyter, and anaconda.

To make life easier, all the necessary python packages are compiled into a conda environment and provided here.

To activate the provided conda environment, use

conda env create -f lncRNA_classify_environment.yaml

More information regarding managing environments can be found here.

If you would rather build everything from scratch, we would recommend not to skip the next part, otherwise please head over to Usage.

To install anaconda, please follow the instructions on the website.

To install jupyter, type:

pip install notebook

or use conda:

conda install -c conda-forge notebook

For more choices, please follow the instructions here.

The python packages required to successfully execute the training are:

Usage

iPython environment

To open the notebook, navigate to the notebook's folder and execute

jupyter notebook

and choose lncRNA_classify.ipynb from the tree menu.

Each index of training_* folders stands for a dataset.

training_1 stands for dataset 1

Each training_* directory contains subdirectories labelled fs_*. Each index stands for a feature set.

fs1 stands for feature set 1

Every fs_* directory two subdirectories, w_fickett and wo_fickett, which stand for with and without Fickett score, respectively. These two directories contain the files necessary for the execution of the code.

Every fs1 folder contain a database of the kmer weights transformed using a multi-label binarizer.

The whole notebook is divided into four parts for the four datasets and each part is further divided into a supervised module and an unsupervised module.

The code to train the classifier using feature set 1 with Fickett score is already provided.

Only the directories fs[1-3] need to be changed to proceed with the training using the features containing structural information and conservation scores. Inclusion of the Fickett score can be regulated choosing the appropriate folder for every feature set.

Command line

Two python scripts are included in the machine learning steps folder, namely supervised.py and unsupervised.py to enable command line usage.

supervised.py trains a random forest classifier on the training sets as described above.

unsupervised.py performs PCA and k-means clustering on the training sets.

Each script can be run with the following switches:

-t, -training_set choose the training set for the classifier: [1,2,3,4]. The default value is 1.

-fs, -feature_set choose the feature set: [1,2,3]. The default value is 1.

-fi, -fickett_score should you choose to include fickett score, enter 1, else 0. Disabled by default.

Help with the switches can be accessed by:

supervised.py -h

OR

unsupervised.py -h

If no parameters are specified, both the files will run on default parameters.

unsupervised.py will save the plots generated for PCA and k-means clustering in the working directory.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors