Skip to content

Tinku10/plank

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

55 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Plank GitHub Actions Workflow Status

A simple and efficient binary file format for structured tabular data. Plank stores data in a columnar layout for fast metadata access.


Format Specification

Plank files are organized into two sections: row groups, and a footer.

Layout

[row group-1 size: 4 bytes]
    [row group id: 4 bytes]
        [column-1 size: 4 bytes]
            [data size: 4 bytes]?[data]
        [column-2]
        ...
        [column-n]
    [row count: 4 bytes]
[row group-2]
...
[row group-n]
[schema size]
    [field-1 name size: 4 bytes][field-1 name][field-1 type]
    [field-2]
    ...
    [field-n]
[row count size: 4 bytes][u32]
[column count size: 4 bytes][u32]
[row group count size: 4 bytes][u32]
[offset size]
    [row group-1 offset: 4 bytes]..[row group-n offset]
[sha256 checksum]
[footer offset: 4 bytes]

Row Groups

A row group is a fixed-size chunk of rows. Each line in a row group represents one column.

Jack,Emily,
Johnson,Clark,
28,34,
New York,London,

The above encodes two rows across four columns (first_name, last_name, age, city). The example uses comma-separated values for visualization. The actual values are binary-encoded.

Footer

The footer contains complete file metadata and is located at the end of the file. The footer offset (a little-endian u32) is stored in the last 4 bytes of the file, allowing readers to seek directly to the footer without scanning the file.


Data Types

The following types are supported yet.

  • Str: Variable size text
  • Int32
  • Int64
  • Bool
  • Struct: Supports fields of any of the supported types
  • List: A homogeneous list of items (homogeneity is not yet enforced)

Usage

Reading all rows

use plank::PlankReader;

let mut f = PlankReader::open("/path/to/file.plank")?;

for rg in &mut f {
    if let Ok(rg) = rg {
        for row in rg {
            println!("{:?}", row);
        }
    }
}

Converting a CSV

use plank::PlankWriter;

let mut f = PlankWriter::new("/path/to/file.plank")?;
f.write_from_csv("/path/to/file.csv")?;

Reading specific row groups with selected columns

use plank::PlankWriter;

let mut f = PlankReader::open("./data/file.plank").unwrap();

// Read columns 0 and 1 from RowGroup 0
let result = f.read_row_group_columns(0, &vec!["name", "age"]).unwrap();

println!("{:#?}, ", result);

Using the Java Bindings from Java

import io.plank.*;

PlankReader reader = new PlankReader("/path/to/file.plank");

RecordBatch rb = reader.readRowGroupColumns(0, new String[]{"name", "age"});

System.out.println(rb);

Possible Improvements

  • Entire row group is read into memory per call currently
  • Lists are not checked for homogeneity and cannot recognize the type in some scenarios
  • A column in a row group contains the full column irrespective of the byte size (maybe good, maybe not)
  • Row groups are divided into fixed number of collection of rows and cannot be configured (no metadata of this is kept in the footer)

File Extension

.plank

About

A simple hybrid binary file format for structured data

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages