Skip to content

SNPs with unusually high/low Fst #46

@kaede0e

Description

@kaede0e

Hi Lucas and the Pool-seq community,

Thanks to all of you involved in developing this useful program! We are interested in identifying potential genes that control the colour polymorphism found in salmonberry (Rubus spectabilis), which is a native raspberry-relative in Pacific Northwest. The species is a diploid, with a haploid size of ~250Mbp. We have built a haplotype-phased reference genome for both red and gold morph salmonberry de novo with PacBio HiFi + Hi-C data from Canada BioGenome Project, and compared the assemblies at the baseline (like nucleotide divergence, alignment, structural variations, etc.) to check to make sure the two morphs were not completely different 'species' at genome-level. With the hypothesis/assumption that gold morph individuals lack the anthocyanin/pigment production gene function in some way, we are using red morph salmonberry genome as our reference genome for now. We note that the fruit colour seems to have no correlation with environments, and so we assume this is genetically controlled in a relatively simple manner.

To get at the genetic loci, we decided to take the Pool-seq approach, where we sampled a bunch of red and gold individuals, and pooled them by populations and colours to do WGS of pools, targeting 4X coverage per individual with Illumina NovaSeq X PE150bp. I am analyzing this dataset of 8 populations (2x4), separated by red vs. gold berry paired pools in four geographically different locations. The populations we have vary in number of individuals per pool due to sampling limitation, and while 2/4 locations have ~100 vs 100 individuals compared, the other 2/4 locations have only ~20 individuals per pool, and are also unequal distribution between red vs. gold pools.

With that in mind, I initially tried running this automated pipeline called Poolparty2 by Micheletti SJ & Narum SR 2018 (https://github.com/stuartwillis/poolparty, https://onlinelibrary.wiley.com/doi/full/10.1111/1755-0998.12784), and calculated Fst between red vs gold pools per location and four locations altogether. However, the Fst peaks were rather inconsistent across populations, and a clear high Fst region from one population was not significant in other populations. This peak was thus not significant in CMH test, which measures the consistent allele frequency difference between red vs. gold pools across populations. So, in order to validate the variable peaks from PoolParty2, I wanted to test Fst calculation using grenedalf to get a confirmation that these are real signals. After several attempts, I admit that I don't understand all the parameters that need to be adjusted for our experimental design, and the Fst values I am getting are weird (a lot of negatives and 1's than I'd expect, even in 10,000bp sliding-window analysis). Any ideas why/what's wrong with this? Pasted below are my commands and relevant input files.

poolsize_8pop.txt
INTGold 14
INTRed 38
LYCGold 200
LYCRed 200
PSPGold 182
PSPRed 198
SFUGold 28
SFURed 62

with command:
grenedalf fst --sam-path BAM_ppalign_redH1_refgenome/ --sam-min-map-qual 5 --sam-min-base-qual 20
--reference-genome-dict Salmonberry_redmorph_H1.v2.fa.dict --rename-samples-list
rename_sample_names_8pop.txt --filter-sample-min-count 2 --filter-sample-max-count 2000 --window-type
single --method unbiased-nei --window-average-policy available-loci --comparand-list
comparand_list_8pop.txt --pool-sizes poolsize_8pop.txt --write-pi-tables --separator-char tab
--na-entry nan --out-dir redH1_refgenome_output --threads 40 --log-file
redH1_refgenome_output/grenedalf_redH1_refgenome_output.log.txt

Also, how do you decide --filter-sample-min-count and --filter-sample-max-count, or should I leave them as unset? Should I be removing low coverage SNPs and too high coverage SNPs because they are bad quality SNPs to include? What's the difference between --filter-sample-min-count and --filter-sample-min-read-depth? Is there one that's better to use if I am trying to filter SNPs that have too low coverage (such as those in repetitive/un-complex sequences because the reads map poorly)?

Thank you for your help,
Kaede

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions