Skip to content

Comments

optimization datasets usage#3470

Open
Alexandr-Solovev wants to merge 66 commits intouxlfoundation:mainfrom
Alexandr-Solovev:dev/asolovev_optimization_datasets
Open

optimization datasets usage#3470
Alexandr-Solovev wants to merge 66 commits intouxlfoundation:mainfrom
Alexandr-Solovev:dev/asolovev_optimization_datasets

Conversation

@Alexandr-Solovev
Copy link
Contributor

@Alexandr-Solovev Alexandr-Solovev commented Jan 6, 2026

Description

PR: Dataset Cleanup, De-duplication, and Example Parameter Refactoring

This PR introduces a cleanup and restructuring of datasets and examples, along with parameter de-hardcoding and new data split utilities.

Changes

  • Moved all datasets into a new unified directory: root/data (or simply data) and removed duplicate files.
  • Removed pre-split datasets for online and distributed modes (CSR format), except for the implicit_als case where it is still required.
  • Eliminated hardcoded rank parameters in samples.
  • Updated all examples to minimize hardcoded parameters and replace them with dynamic configuration where possible.
  • Added reusable data-splitting functions for DAAL examples and samples.

Impact

  • Reduces repository size by approximately 50 MB.
  • Reduces make onedal_dpc build size by approximately 120 MB.

Checklist:

Completeness and readability

  • I have commented my code, particularly in hard-to-understand areas.
  • I have updated the documentation to reflect the changes or created a separate PR with updates and provided its number in the description, if necessary.
  • Git commit message contains an appropriate signed-off-by string (see CONTRIBUTING.md for details).
  • I have resolved any merge conflicts that might occur with the base branch.

Testing

  • I have run it locally and tested the changes extensively.
  • All CI jobs are green or I have provided justification why they aren't.
  • I have extended testing suite if new functionality was introduced in this PR.

Performance

  • I have measured performance for affected algorithms using scikit-learn_bench and provided at least a summary table with measured data, if performance change is expected.
  • I have provided justification why performance and/or quality metrics have changed or why changes are not expected.
  • I have extended the benchmarking suite and provided a corresponding scikit-learn_bench PR if new measurable functionality was introduced in this PR.

@Alexandr-Solovev Alexandr-Solovev added the dependencies Pull requests that update a dependency file label Jan 6, 2026
@david-cortes-intel
Copy link
Contributor

I think moving the data to a folder named 'dev' makes it harder to find.

@Alexandr-Solovev
Copy link
Contributor Author

I think moving the data to a folder named 'dev' makes it harder to find.

I’m open to moving the data to a different folder. My first suggestion was oneDAL/data, but I’m not sure. Open to other suggestions, @Vika-F.

@david-cortes-intel
Copy link
Contributor

I think moving the data to a folder named 'dev' makes it harder to find.

I’m open to moving the data to a different folder. My first suggestion was oneDAL/data, but I’m not sure. Open to other suggestions, @Vika-F.

What's wrong with the current location?

@Alexandr-Solovev
Copy link
Contributor Author

I think moving the data to a folder named 'dev' makes it harder to find.

I’m open to moving the data to a different folder. My first suggestion was oneDAL/data, but I’m not sure. Open to other suggestions, @Vika-F.

What's wrong with the current location?

~100MB of datasets are duplications and I would like to merge/unify them in the same folder

@david-cortes-intel
Copy link
Contributor

I think moving the data to a folder named 'dev' makes it harder to find.

I’m open to moving the data to a different folder. My first suggestion was oneDAL/data, but I’m not sure. Open to other suggestions, @Vika-F.

What's wrong with the current location?

~100MB of datasets are duplications and I would like to merge/unify them in the same folder

But why would that require moving them elsewhere? Perhaps they could be grouped by dataset instead of by algorithm, or something like that.

@Alexandr-Solovev
Copy link
Contributor Author

I think moving the data to a folder named 'dev' makes it harder to find.

I’m open to moving the data to a different folder. My first suggestion was oneDAL/data, but I’m not sure. Open to other suggestions, @Vika-F.

What's wrong with the current location?

~100MB of datasets are duplications and I would like to merge/unify them in the same folder

But why would that require moving them elsewhere? Perhaps they could be grouped by dataset instead of by algorithm, or something like that.

Okey, the problem is:
For now we have 4 places of datasets duplications:
examples
-daal
--datasets
-oneapi
--datasets

samples
-daal
--datasets
-oneapi
--datasets

And I want to remove datasets duplications here. For me it makes sense to move them in a separate folder with common acces for all samples and examples

@david-cortes-intel
Copy link
Contributor

For me it makes sense to move them in a separate folder with common acces for all samples and examples

Could it be under a similar folder as where the CSV reader is?

@ethanglaser
Copy link
Contributor

having a data/ folder in the root directory makes sense to me

@Alexandr-Solovev
Copy link
Contributor Author

For me it makes sense to move them in a separate folder with common acces for all samples and examples

Could it be under a similar folder as where the CSV reader is?

Don't think so, we have separate csv readers for oneDAL and DAAL and its pretty deep inside
cpp/oneapi/dal/io/csv and cpp/daal/include/data_management/data_source/... and could be complicated for user to find

@Vika-F
Copy link
Contributor

Vika-F commented Jan 8, 2026

I am Ok with the placement of all the data files in the ./data folder, but need to check that the BOMs generation is Ok.
Because the changes in the _release* catalogues structure might affect it.

@Alexandr-Solovev Alexandr-Solovev marked this pull request as ready for review February 17, 2026 08:30
Copilot AI review requested due to automatic review settings February 17, 2026 08:30
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@Vika-F Vika-F requested a review from Copilot February 17, 2026 10:16
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Comment on lines +97 to +104
size_t skip = rowStart;
while (skip > 0) {
size_t s1 = trainDataSource.loadDataBlock(skip);
size_t s2 = trainLabelSource.loadDataBlock(skip);
if (s1 == 0 || s2 == 0)
break;
skip -= s1;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loop requires a comment at least.
And maybe it is possible to load the data without this loop at all?

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@Vika-F
Copy link
Contributor

Vika-F commented Feb 18, 2026

Copilot AI review requested due to automatic review settings February 20, 2026 08:39
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review this pull request. There are 300 or more changed files, try reducing the number of files in this pull request and requesting a review from Copilot again.

DataSource::doAllocateNumericTable,
DataSource::doDictionaryFromContext);

size_t skip = rowStart;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add this comment like in another similar place:
/* Skip rows before rowStart */

Though I do not fully understand why the loop is required, and why loadDataBlock(rowStart) is not enough.

Copy link
Contributor

@Vika-F Vika-F left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this restructuring!
The changes look good to me. Let's wait for the CI and LGTM!

@Alexandr-Solovev
Copy link
Contributor Author

/intelci: run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Pull requests that update a dependency file

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants