A simple and efficient binary file format for structured tabular data. Plank stores data in a columnar layout for fast metadata access.
Plank files are organized into two sections: row groups, and a footer.
[row group-1 size: 4 bytes]
[row group id: 4 bytes]
[column-1 size: 4 bytes]
[data size: 4 bytes]?[data]
[column-2]
...
[column-n]
[row count: 4 bytes]
[row group-2]
...
[row group-n]
[schema size]
[field-1 name size: 4 bytes][field-1 name][field-1 type]
[field-2]
...
[field-n]
[row count size: 4 bytes][u32]
[column count size: 4 bytes][u32]
[row group count size: 4 bytes][u32]
[offset size]
[row group-1 offset: 4 bytes]..[row group-n offset]
[sha256 checksum]
[footer offset: 4 bytes]
A row group is a fixed-size chunk of rows. Each line in a row group represents one column.
Jack,Emily,
Johnson,Clark,
28,34,
New York,London,
The above encodes two rows across four columns (first_name, last_name, age, city). The example uses comma-separated values for visualization. The actual values are binary-encoded.
The footer contains complete file metadata and is located at the end of the file. The footer offset (a little-endian u32) is stored in the last 4 bytes of the file, allowing readers to seek directly to the footer without scanning the file.
The following types are supported yet.
Str: Variable size textInt32Int64BoolStruct: Supports fields of any of the supported typesList: A homogeneous list of items (homogeneity is not yet enforced)
use plank::PlankReader;
let mut f = PlankReader::open("/path/to/file.plank")?;
for rg in &mut f {
if let Ok(rg) = rg {
for row in rg {
println!("{:?}", row);
}
}
}use plank::PlankWriter;
let mut f = PlankWriter::new("/path/to/file.plank")?;
f.write_from_csv("/path/to/file.csv")?;use plank::PlankWriter;
let mut f = PlankReader::open("./data/file.plank").unwrap();
// Read columns 0 and 1 from RowGroup 0
let result = f.read_row_group_columns(0, &vec!["name", "age"]).unwrap();
println!("{:#?}, ", result);import io.plank.*;
PlankReader reader = new PlankReader("/path/to/file.plank");
RecordBatch rb = reader.readRowGroupColumns(0, new String[]{"name", "age"});
System.out.println(rb);- Entire row group is read into memory per call currently
- Lists are not checked for homogeneity and cannot recognize the type in some scenarios
- A column in a row group contains the full column irrespective of the byte size (maybe good, maybe not)
- Row groups are divided into fixed number of collection of rows and cannot be configured (no metadata of this is kept in the footer)
.plank