Skip to content

Latest commit

 

History

History
65 lines (56 loc) · 2.63 KB

File metadata and controls

65 lines (56 loc) · 2.63 KB

Datasets and Dataloaders

Datasets

Datasets must satisfy the IsDataset, concept.

  • A typedef'd DataType, which is a tuple of arbitrary types
  • A size() method which returns the number of samples the dataset contains
  • A get(int idx) method to return the sample for the given index in range [0, size())
template <typename T>
concept IsDataset = requires(const T ct, int idx) {
    typename T::DataType;
    requires IsSpecialization<typename T::DataType, std::tuple>;
    { ct.size() } -> std::same_as<int>;
    { ct.get(idx) } -> std::same_as<typename T::DataType>;
};

If your dataset simply contains tensors (which are created externally), you can reuse and instantiate TensorDataset.

Tensor inputs = uniform_real(0, 1, {100, 4});
Tensor targets = uniform_real(0, 1, {100, 4});
data::TensorDataset train_data(inputs, targets);

DatasetView

Views are created over datasets, such as the case when creating train/validation/test partitions. They are also required to use with Dataloaders.

  • Views can be created directly by moving the dataset into it, as is the case if the dataset is not to be partitioned
data::DatasetView train_data_view(std::move(train_data));
  • Or using random_split, which creates views of various partition sizes that must sum to the total size of the dataset
int seed = 0;
// dataset has a size() = 100
auto [train_data, validate_data, test_data] = data::random_split(std::move(dataset), seed, 70, 20, 10);

Dataloaders

A Dataloader takes a DatasetView and provides batching and shuffling. It also supports range-based loops with internal iterators.

Tensor inputs = uniform_real(0, 1, {100, 4});
Tensor targets = uniform_real(0, 1, {100, 4});
data::TensorDataset dataset(inputs, targets);

auto [train_data, validate_data, test_data] = data::random_split(std::move(dataset), seed, 70, 20, 10);

// Train loader shuffles its batches, validate/test loaders do not
int batch_size = 4;
auto train_loader = data::DataLoader(train_data, batch_size, true, seed);
auto validate_loader = data::DataLoader(validate_data, batch_size, false);
auto test_loader = data::DataLoader(test_data, batch_size, false);

// Iterate over the dataloader
// Since we supplied two tensors to our dataset, the iterated value will be a tuple of two batched tensors
for (auto [inp, tar] : train_loader) {
    // inp has shape (batch_size, 4) unless the last batch is partial
    // tar has shape (batch_size, 4) unless the last batch is partial
}