Skip to content

Support for on-disk appends, partitioning #36

@jcrist

Description

@jcrist

I've been working on a refactor of Castra - before I spend any more time on this, I should probably get some feedback. Here's the plan:

Issues I'm attempting to solve:

  1. Castra provides no validation that partitions split evenly - indices like [[1, 2, 3, 3], [3, 3, 4, 5, 6], ...] were possible (and happened to me)
  2. Castra provides no easy way to say "partition weekly", without manually doing the partitioning elsewhere (issue Split input into multiple partitions on request? #3)

The plan:

  1. Add partitionby=None to the init signature. This will live in meta. If None, no repartitioning is done by Castra. Can also be a time period (things you can pass to resample in pandas).
  2. extend checks current partitions for equality overlap (even if partitionby=None). There are 3 cases that can happen here:
    1. Start of new frame is before end of existing partition. This errors
    2. Start of new frame is equal to end of existing partition. The equal parts are split off and appended to existing partition. Remainder is stored as new partition.
    3. Start of new frame is after existing partition. New frame is written to disk (current behavior)
  3. If partitionby != None, then data is partitioned by Castra into blocks. extend should still take large dataframes (calling extend on a row is a bad idea), but will group them into partitions based on the rule passed to partitionby. Using the functionality provided by bloscpack, the on disk partitions can be appended to with little overhead. This makes writing in cases where this happens slightly slower, but has no penalty on reads.
  4. Add extend_sequence function. This takes an iterable of dataframes (can be a generator), and does the partitioning in memory instead of on disk. This will be faster than calling extend in a loop (no on disk appends), but will result in the same disk file format.

This method means that the disk will match what's in memory after calls to extend or extend_sequence complete, will allow castra to do partitioning for the user, and will ensure that the partitions are valid. I have a crude version of this working now, and have found writes to be only slightly penalized when appends happen (no penalty if they don't), and no penalty for reading from disk.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions