-
Notifications
You must be signed in to change notification settings - Fork 5
Description
Having previously used npstat for calculating theta pi, theta Watterson, and Tajima’s D, but being very interested in Grenedalf’s greater flexibility, I conducted some empirical comparative tests between the two methods.
I performed these tests on 8 populations from 4 insect species (2 populations per species), with genome sizes ranging from 250Mb to 2.5Gb. These populations consist of pooled samples of 40 to 50 individuals, with an average sequencing depth of approximately 100X. My tests involved window-based analyses using windows of 100,000 bases.
The general trends I observe are as follows:
• A very high correlation between the two methods for all three statistics (generally r > 0.95), except for one of the four species (details below).
• A systematic bias: npstat tends to yield higher values for theta pi, whereas Grenedalf tends to yield higher values for theta Watterson; no clear bias is observed for Tajima’s D. This trend is consistent across the 8 populations from the 4 species.
Among the four species tested, I observed an exception with the species having the largest genome (2.5Gb, with 10 chromosomes). In this case, the correlations for theta pi and theta Watterson are much lower, between 0.6 and 0.7. However, the correlations for Tajima’s D remain very high (>0.97).
Notably, for this species only, I also tested larger window sizes (500,000 and 1,000,000 bases), but no improvement was observed (the figures were very similar, and there was even a slight decrease in correlations).
Have you ever observed this kind of trend? Do you have any idea what might be causing the systematic biases in theta estimates?
Overall, I am unsure which tool yields the most reliable results, and I would be very interested in any feedback or comparative evaluations.
Thanks in advance to the community!