A resource for malaria gold-standard validation data

Introduction

The aim of this resource is to provide datasets that may be used in validating bioinformatics pipelines that use genomic data to infer the drug resistance status of Plasmodium falciparum samples. Validation and verification of pipelines are key activities in the ISO accreditation.

Ideally, the basis for such a dataset would be a set of samples that (a) have been sequenced, and (b) have a drug resistance status (i.e. sensitive or resistent) that has been confirmed in a clinical setting. However, as no such datasets are freely available for malaria, we take two different approaches to fill this gap: one based on real samples for which the genomic data and prediction of the resistance status have been obtained by two independent means, with concordance being used as marker of quality/confidence; and the other based on custom-designed sythetic read sets (for which the "correct answer" is known by construction).

Approach 1: Real-world data

The data in this approach was obtained from two high-profile public malaria genomics resources:

A large number of samples have been sequenced and analysed by both projects, under a common sample ID. This provides an opportunity to create high-quality datasets where two different analysis methodologies based on different sequencing technologies arrive at the same conclusions.

Assumptions

Analysis pipelines differ in scope and methodology. Thus, no single dataset will be applicable to every possible genomics pipeline. For the selection of real-world public datasets, we are assuming that the pipeline to be validated produces at least some of the following:

genotype (haplotype) calls at known drug-resistance loci
high-level drug-resistance phenotype calls Furthermore, we assume that the pipeline can work with P. falciparum data.

For pipelines that are using amplicon sequencing (AmpSeq) data, we assume that the pipeline can work with the SpotMalaria panel. For details about this panel, consult the SpotMalaria technical manual and this SpotMalaria supplementary data file, which provides details for primers used in the panel.

For pipelines that work with AmpSeq data for a specific primer panel that is not SpotMalaria, please take a look at the section on simulated data, which shows how to create simulated runs for cases where real-world data may not exist.

Datasets provided

Samples where all inferred drug resistance phenotypes that have been tested in both projects are identical: Pf8-GenReMekong_concordant_phenotypes.csv
All samples where all genotypes at loci known to be relevant to drug resistances are identical: Pf8-GenReMekong_concordant_genotypes.csv
A subset of samples, each representing one distinct pattern of drug resistance haplotypes: Pf8-GenReMekong_concordant_genotypes_representative_samples.csv

Full details on how these datasets were created from public sources is provided in the form of an executable Jupyter notebook pf8genre.ipynb. To run the notebook, Python and the pandas library are the only dependencies that need to be installed. Alternatively, consider using a public service to run Jupyter notebooks such as Google Colab.

For information about data access, please visit https://malariagen.github.io/parasite-data/pf8/Data_access.html

As detailed in the Jupyter notebook, additional files are available in folder additional_output_files. This includes versions of the above files with ENA FTP download links as well as ready-made download manifest data that can be used with a downloader tool provided here to retrieve the read FASTQ files for the samples in these datasets.

How to use the datasets

The three datasets are provided as comma-separated tables. Most data fields are either directly taken from public data or are calculated from the public data fields as detailed in the accompanying Jupyter notebook. All of the changes to the data columns are limited to renaming columns or extracting values from columns, unchanged, into new columns to enable comparisons between Pf8 and GenRe Mekong data. Data dictionaries that describe the original public data fields are identified and linked to in the Jupyter notebook.

Each dataset has a column called 'sample', which contains the sample ID that is used in both projects, Pf8 and GenRe Mekong. This is the "primary key" of the data and it can be used to obtain raw sequencing data from public archives. Please note that the field was uploaded as "sample title" to ENA.

The sample ID can also be used to add more of the original metadata to the datasets, if required. For details on how to do this, consult the data analysis guides for Pf8. Note that the sample ID column title in the original Pf8 dataset is spelled with an upper case 'S', whereas the same data column is spelled with a lower case 's' in GenRe Mekong. You may have to convert the title accordingly, depending on where additional metadata is coming from.

To retrieve the FASTQ files from both projects, a custom ENA data helper module is provided alongside the Jupyter notebook. The notebook uses this module to search ENA by sample ID and to add search results back into the sample data tables. At the end of the notebook, a section is provided that creates input files for FASTQ download and demonstrates the use of the ENA data helper on the commandline for the purpose of retrieving the FASTQ files along with a manifest file.

Approach 2: Synthetic data

While real-world data from public resources are an important part of any validation strategy, such data suffer from some issues in the context of pipeline validation, such as:

Real data may not exist to cover all the different configurations that you wish to test in your pipeline
Data is generated using specific lab techniques that may not be be compatible with a given pipeline
When developing a pipeline for a specific assay (e.g. that based on enrichment of specific genomic loci), high quality real data may not be yet available for validation.

We have therefore provides some tools and recipes for the creation of designed synthetic data sets (by simulation). To demonstrate the use of these tools, we have applied them to the generation of synthetic validation data set for the SPOTMalaria panel and associated pipeline . This data set has been submitted to ENA under BioProject XXXXXX.

Recipe and tools for creating simulated dataset

We have created a pipeline, pop_var_sim that builds on published tools to facilitate the creation of simulated read datasets with known genotypes and the ability to simulate custom AmpSeq panels.

A Jupyter notebook is provided here. It describes the design of the synthetic data set, and includes code for generating the required input files and configuration for pop_var_sim.

How to use the simulated data

TODO

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
lib		lib
real-world-dataset		real-world-dataset
synthetic-dataset		synthetic-dataset
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A resource for malaria gold-standard validation data

Introduction

Approach 1: Real-world data

Assumptions

Datasets provided

How to use the datasets

Approach 2: Synthetic data

Recipe and tools for creating simulated dataset

How to use the simulated data

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

A resource for malaria gold-standard validation data

Introduction

Approach 1: Real-world data

Assumptions

Datasets provided

How to use the datasets

Approach 2: Synthetic data

Recipe and tools for creating simulated dataset

How to use the simulated data

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages