Support for on-disk appends, partitioning

I've been working on a refactor of Castra - before I spend any more time on this, I should probably get some feedback. Here's the plan:

Issues I'm attempting to solve:
1. Castra provides no validation that partitions split evenly - indices like `[[1, 2, 3, 3], [3, 3, 4, 5, 6], ...]` were possible (and happened to me)
2. Castra provides no easy way to say "partition weekly", without manually doing the partitioning elsewhere (issue #3)

The plan:
1. Add `partitionby=None` to the `init` signature. This will live in `meta`. If `None`, no repartitioning is done by Castra. Can also be a time period (things you can pass to `resample` in pandas).
2. `extend` checks current partitions for equality overlap (even if `partitionby=None`). There are 3 cases that can happen here:
   1. Start of new frame is before end of existing partition. This errors
   2. Start of new frame is equal to end of existing partition. The equal parts are split off and appended to existing partition. Remainder is stored as new partition.
   3. Start of new frame is after existing partition. New frame is written to disk (current behavior)
3. If `partitionby != None`, then data is partitioned by Castra into blocks. `extend` should still take large dataframes (calling extend on a row is a bad idea), but will group them into partitions based on the rule passed to `partitionby`. Using the functionality provided by bloscpack, the on disk partitions can be appended to with little overhead. This makes writing in cases where this happens slightly slower, but has no penalty on reads.
4. Add `extend_sequence` function. This takes an iterable of dataframes (can be a generator), and does the partitioning in memory instead of on disk. This will be faster than calling `extend` in a loop (no on disk appends), but will result in the same disk file format.

This method means that the disk will match what's in memory after calls to `extend` or `extend_sequence` complete, will allow castra to do partitioning for the user, and will ensure that the partitions are valid. I have a crude version of this working now, and have found writes to be only slightly penalized when appends happen (no penalty if they don't), and no penalty for reading from disk.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for on-disk appends, partitioning #36

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support for on-disk appends, partitioning #36

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions