cogent3-h5seqs: a HDF5 storage driver for cogent3 sequence collections

cogent3-h5seqs is a sequence storage plug-in for cogent3. It uses HDF5 as the storage format for biological sequences, supporting both unaligned sequence collections and alignments. Storage can be in memory (the default) or on disk and sequences are compressed using the lzf compression engine.

The advantage of HDF5 is that once primary sequence formats have been converted from text into numpy arrays, loading and manipulating sequence data is fast and very memory efficient.

Sequences are stored under the hexdigest of their xxhash.hash64(). This means duplicated sequences are stored only once.

Installation

pip install cogent3-h5seqs

Usage

Three types of sequence storage

Storage Class	Suffix	Compression	Notes
`UnalignedSeqsData`	`.c3h5u`	lzf	For variable-length sequences. DNA/RNA moltypes are 2-bit encoded, reducing storage by 75%. The encoding can be turned off using `packed=False`.
`AlignedSeqsData`	`.c3h5a`	lzf	Dense storage for equal-length sequences. Every sequence is stored separately.
`SparseSeqsData`	`.c3h5s`	lzf	Sparse storage for equal-length sequences. Much more memory efficient than `AlignedSeqsData`. Faster to create and write.

Making `cogent3-h5seqs` the default storage

Using cogent3.set_storage_defaults(), you can set cogent3-h5seqs as the default storage. This means whenever a sequence collection is loaded from disk or created in memory, it will use the storage within this package.

The following statement makes cogent3-h5seqs the default for both unaligned and aligned sequence collections.

import cogent3

cogent3.set_storage_defaults(unaligned_seqs="c3h5u",
                             aligned_seqs="c3h5a")

You can undo this setting by

cogent3.set_storage_defaults(reset=True)

Equivalently, you could define

Using `cogent3-h5seqs` as storage per object

You don't have to specify the storage as the default for all instances, but can do it on a per object basis.

coll = cogent3.load_unaligned_seqs(some_path,
                                   moltype="dna",
                                   storage_backend="h5seqs_unaligned")

or, for alignments.

aln = cogent3.load_aligned_seqs(some_path,
                                   moltype="dna",
                                   storage_backend="c3h5s")

The same values can also be provided to the make_unaligned_seqs(), make_aligned_seqs() functions in cogent3.

Note With the 2-bit encoding for DNA/RNA sequences, you can safely turn off compression with compression=False. This can speed up operations. You can also turn off the encoding by setting packed=False.

Saving storage to disk

cogent3-h5seqs supports writing to disk, and employs the filename suffix .c3h5u for unaligned sequences and .c3h5a for aligned sequences. This will work whether your current object is using cogent3-h5seqs for storage or not. For example

sample_aln = cogent3.get_dataset("brca1")  # using the cogent3 builtin storage
outpath = "~/Desktop/alignment_output.c3h5s"
sample_aln.write(outpath)  # writes out as cogent3-h5seqs HDF5 storage

For a sequence collection, do the following.

sample_coll = cogent3.get_dataset("brca1").degap()
# Note the different suffix
outpath = "~/Desktop/alignment_output.c3h5u"
sample_coll.write(outpath)  # writes out as cogent3-h5seqs HDF5 storage

Loading storage from disk

cogent3 correctly directs to cogent3-h5seqs for loading based on the filename suffix.

inpath = "~/Desktop/alignment_output.c3h5u"
sample_coll = cogent3.load_unaligned_seqs(inpath, moltype="dna")

Note You cannot write an alignment instance to an unaligned storage type or vice versa. Nor can you read into the different types.

Name		Name	Last commit message	Last commit date
Latest commit History 222 Commits
.github		.github
src/cogent3_h5seqs		src/cogent3_h5seqs
tests		tests
.gitignore		.gitignore
.hgignore		.hgignore
.sourcery.yaml		.sourcery.yaml
LICENSE		LICENSE
README.md		README.md
noxfile.py		noxfile.py
pyproject.toml		pyproject.toml
ruff.toml		ruff.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cogent3-h5seqs: a HDF5 storage driver for cogent3 sequence collections

Installation

Usage

Three types of sequence storage

Making `cogent3-h5seqs` the default storage

Using `cogent3-h5seqs` as storage per object

Saving storage to disk

Loading storage from disk

About

Uh oh!

Releases 13

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

License

cogent3/cogent3-h5seqs

Folders and files

Latest commit

History

Repository files navigation

cogent3-h5seqs: a HDF5 storage driver for cogent3 sequence collections

Installation

Usage

Three types of sequence storage

Making cogent3-h5seqs the default storage

Using cogent3-h5seqs as storage per object

Saving storage to disk

Loading storage from disk

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 13

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Making `cogent3-h5seqs` the default storage

Using `cogent3-h5seqs` as storage per object

Packages