-
Notifications
You must be signed in to change notification settings - Fork 18
Support for on-disk appends, partitioning #36
Copy link
Copy link
Open
Description
I've been working on a refactor of Castra - before I spend any more time on this, I should probably get some feedback. Here's the plan:
Issues I'm attempting to solve:
- Castra provides no validation that partitions split evenly - indices like
[[1, 2, 3, 3], [3, 3, 4, 5, 6], ...]were possible (and happened to me) - Castra provides no easy way to say "partition weekly", without manually doing the partitioning elsewhere (issue Split input into multiple partitions on request? #3)
The plan:
- Add
partitionby=Noneto theinitsignature. This will live inmeta. IfNone, no repartitioning is done by Castra. Can also be a time period (things you can pass toresamplein pandas). extendchecks current partitions for equality overlap (even ifpartitionby=None). There are 3 cases that can happen here:- Start of new frame is before end of existing partition. This errors
- Start of new frame is equal to end of existing partition. The equal parts are split off and appended to existing partition. Remainder is stored as new partition.
- Start of new frame is after existing partition. New frame is written to disk (current behavior)
- If
partitionby != None, then data is partitioned by Castra into blocks.extendshould still take large dataframes (calling extend on a row is a bad idea), but will group them into partitions based on the rule passed topartitionby. Using the functionality provided by bloscpack, the on disk partitions can be appended to with little overhead. This makes writing in cases where this happens slightly slower, but has no penalty on reads. - Add
extend_sequencefunction. This takes an iterable of dataframes (can be a generator), and does the partitioning in memory instead of on disk. This will be faster than callingextendin a loop (no on disk appends), but will result in the same disk file format.
This method means that the disk will match what's in memory after calls to extend or extend_sequence complete, will allow castra to do partitioning for the user, and will ensure that the partitions are valid. I have a crude version of this working now, and have found writes to be only slightly penalized when appends happen (no penalty if they don't), and no penalty for reading from disk.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels