This repository was archived by the owner on Feb 27, 2026. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 13
GMM with Mini-Batches #51
Copy link
Copy link
Open
Description
Hi,
Like #7 and #19 I am trying to fit a GMM to a large dataset [10^10, 50] and want to (need to) use mini-batching.
However, in contrast to the previous answers, gmm.fit only accpects a TensorLike and won't work with my data which is a torch.utils.data.DataLoader. Even if I input a torch.utils.data.Dataset it only computes a GMM on the first batch.
What is the preferred way to do what I want to do?
Ideally, I would want my code to work like this:
from pycave.bayes import GaussianMixture as GMM
from torch.utils.data import Dataset, DataLoader
data = Data(DATA_PATH).dataloader(batch_size=256)
assert(type(data) == DataLoader)
gmm = GMM(num_components=3, batch_size=256, trainer_params=dict(accelerator='gpu', devices=1))
class_labels = gmm.fit_predict(data)
means, stds = gmm.model_.means, gmm.model_.covariancesManually changing the code in gmm/estimator.py (among others) from
num_features = len(data[0])
...
loader = DataLoader(
dataset_from_tensors(data),
batch_size=self.batch_size or len(data),
collate_fn=collate_tensor,
)
is_batch_training = self._num_batches_per_epoch(loader) == 1 # Also, shouldn't this be > anyway?to
num_features = data.dataset[0].shape[1]
...
loader = data
is_batch_training = True
allows the for error-free fitting and prediction but I am not sure if the output is trustworthy.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels