This is more of a question than a bug.
My understanding is that when the user set a cutoff, PyHMMER will not report hits that would have hit.included == False. However, the same is not applied at the domain level, so there are cases where the hit is reported because one of the domains passes the cutoffs, but domains that do not satisfy the cutoffs are also reported.
For example:
>P03865 1-315
MATTKLGNTKSASRAINYAEKRAEEKSGLNCDVDYAKSAFKQTRALYGKEDGIQAHTVIQSFKPGEVTPEQCNQLGLELA
EKIAPNHQVAVYTHTDKDHYHNHIVINSVDLETGKKYQSNKKQRDLVKKENDNICREHGLSVTERGIAKMRYTQAEKGIV
FDRDEYSWKDELRDLIENAKTHTSNLETFSEHLEEKGVGVKLRGETISYKPENENKWVRGRTLGSEYEKGAIDHEHERHQ
KQQREPEYADEFKINWDAVEQHTEQLKQRRVERAQETKQAHSKISSRDTRESENQRERAKGNNIRIERGDEGLSR
with open("hmmsearch_plasmid_hallmarks.tsv", "w") as fo:
fo.write("query\thit\tstart\tend\tevalue\tdomain_score\tdomain_included\n")
for p in Path(".").glob("*.faa"):
with pyhmmer.easel.SequenceFile(p, digital=True) as seq_file:
seqs = seq_file.read_block()
with pyhmmer.plan7.HMMFile("Pfam-A.hmm") as hmm_file:
for hits in pyhmmer.hmmsearch(hmm_file, seqs, bit_cutoffs="gathering"):
for hit in hits:
for domain in hit.domains:
fout.write(
f"{hits.query_accession.decode()}\t"
f"{hit.name.decode()}\t"
f"{domain.env_from}\t"
f"{domain.env_to}\t"
f"{hit.evalue:.2E}\t"
f"{domain.score:.2f}\t"
f"{domain.included}\n"
)
Generates this output:
query hit start end evalue domain_score domain_included
PF03432.18 P03865 8 239 3.25E-90 288.01 True
PF03432.18 P03865 240 314 3.25E-90 -0.23 False
To be honest, I don't really know what determines if a domain is included or not, given that the cutoffs are applied to the hit. I expected that when a cutoff is applied, domains were also filtered.
This is more of a question than a bug.
My understanding is that when the user set a cutoff, PyHMMER will not report hits that would have
hit.included == False. However, the same is not applied at the domain level, so there are cases where the hit is reported because one of the domains passes the cutoffs, but domains that do not satisfy the cutoffs are also reported.For example:
Generates this output:
To be honest, I don't really know what determines if a domain is included or not, given that the cutoffs are applied to the hit. I expected that when a cutoff is applied, domains were also filtered.