optimization datasets usage by Alexandr-Solovev · Pull Request #3470 · uxlfoundation/oneDAL

Alexandr-Solovev · 2026-01-06T14:37:50Z

Description

PR: Dataset Cleanup, De-duplication, and Example Parameter Refactoring

This PR introduces a cleanup and restructuring of datasets and examples, along with parameter de-hardcoding and new data split utilities.

Changes

Moved all datasets into a new unified directory: root/data (or simply data) and removed duplicate files.
Removed pre-split datasets for online and distributed modes (CSR format), except for the implicit_als case where it is still required.
Eliminated hardcoded rank parameters in samples.
Updated all examples to minimize hardcoded parameters and replace them with dynamic configuration where possible.
Added reusable data-splitting functions for DAAL examples and samples.

Impact

Reduces repository size by approximately 50 MB.
Reduces make onedal_dpc build size by approximately 120 MB.

Checklist:

Completeness and readability

I have commented my code, particularly in hard-to-understand areas.
I have updated the documentation to reflect the changes or created a separate PR with updates and provided its number in the description, if necessary.
Git commit message contains an appropriate signed-off-by string (see CONTRIBUTING.md for details).
I have resolved any merge conflicts that might occur with the base branch.

Testing

I have run it locally and tested the changes extensively.
All CI jobs are green or I have provided justification why they aren't.
I have extended testing suite if new functionality was introduced in this PR.

Performance

I have measured performance for affected algorithms using scikit-learn_bench and provided at least a summary table with measured data, if performance change is expected.
I have provided justification why performance and/or quality metrics have changed or why changes are not expected.
I have extended the benchmarking suite and provided a corresponding scikit-learn_bench PR if new measurable functionality was introduced in this PR.

david-cortes-intel · 2026-01-07T09:25:19Z

I think moving the data to a folder named 'dev' makes it harder to find.

Alexandr-Solovev · 2026-01-07T12:23:24Z

I think moving the data to a folder named 'dev' makes it harder to find.

I’m open to moving the data to a different folder. My first suggestion was oneDAL/data, but I’m not sure. Open to other suggestions, @Vika-F.

david-cortes-intel · 2026-01-07T12:24:40Z

I think moving the data to a folder named 'dev' makes it harder to find.

I’m open to moving the data to a different folder. My first suggestion was oneDAL/data, but I’m not sure. Open to other suggestions, @Vika-F.

What's wrong with the current location?

Alexandr-Solovev · 2026-01-07T13:00:48Z

I think moving the data to a folder named 'dev' makes it harder to find.

I’m open to moving the data to a different folder. My first suggestion was oneDAL/data, but I’m not sure. Open to other suggestions, @Vika-F.

What's wrong with the current location?

~100MB of datasets are duplications and I would like to merge/unify them in the same folder

david-cortes-intel · 2026-01-07T13:05:19Z

I think moving the data to a folder named 'dev' makes it harder to find.

I’m open to moving the data to a different folder. My first suggestion was oneDAL/data, but I’m not sure. Open to other suggestions, @Vika-F.

What's wrong with the current location?

~100MB of datasets are duplications and I would like to merge/unify them in the same folder

But why would that require moving them elsewhere? Perhaps they could be grouped by dataset instead of by algorithm, or something like that.

Alexandr-Solovev · 2026-01-07T13:23:07Z

I think moving the data to a folder named 'dev' makes it harder to find.

I’m open to moving the data to a different folder. My first suggestion was oneDAL/data, but I’m not sure. Open to other suggestions, @Vika-F.

What's wrong with the current location?

~100MB of datasets are duplications and I would like to merge/unify them in the same folder

But why would that require moving them elsewhere? Perhaps they could be grouped by dataset instead of by algorithm, or something like that.

Okey, the problem is:
For now we have 4 places of datasets duplications:
examples
-daal
--datasets
-oneapi
--datasets

samples
-daal
--datasets
-oneapi
--datasets

And I want to remove datasets duplications here. For me it makes sense to move them in a separate folder with common acces for all samples and examples

david-cortes-intel · 2026-01-07T16:18:45Z

For me it makes sense to move them in a separate folder with common acces for all samples and examples

Could it be under a similar folder as where the CSV reader is?

ethanglaser · 2026-01-07T20:40:33Z

having a data/ folder in the root directory makes sense to me

Alexandr-Solovev · 2026-01-08T08:15:42Z

For me it makes sense to move them in a separate folder with common acces for all samples and examples

Could it be under a similar folder as where the CSV reader is?

Don't think so, we have separate csv readers for oneDAL and DAAL and its pretty deep inside
cpp/oneapi/dal/io/csv and cpp/daal/include/data_management/data_source/... and could be complicated for user to find

Vika-F · 2026-01-08T09:44:39Z

I am Ok with the placement of all the data files in the ./data folder, but need to check that the BOMs generation is Ok.
Because the changes in the _release* catalogues structure might affect it.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

samples/daal/cpp/mpi/sources/covariance_dense_distributed_mpi.cpp

samples/daal/cpp/mpi/sources/kmeans_init_csr_distributed_mpi.cpp

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Vika-F · 2026-02-17T10:48:24Z

samples/daal/cpp/mpi/sources/linear_regression_norm_eq_distributed_mpi.cpp

+    size_t skip = rowStart;
+    while (skip > 0) {
+        size_t s1 = trainDataSource.loadDataBlock(skip);
+        size_t s2 = trainLabelSource.loadDataBlock(skip);
+        if (s1 == 0 || s2 == 0)
+            break;
+        skip -= s1;
+    }


This loop requires a comment at least.
And maybe it is possible to load the data without this loop at all?

samples/daal/cpp/mpi/sources/ridge_regression_norm_eq_distributed_mpi.cpp

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Vika-F · 2026-02-18T09:59:16Z

Assignment to a const here:
https://dev.azure.com/daal/DAAL/_build/results?buildId=55742&view=logs&j=f63091dc-9ab9-584a-136f-2e9696acf4d1&t=f6fde8c3-a9e6-5fbc-abd0-f94701d178fd&l=709

Copilot

Copilot wasn't able to review this pull request. There are 300 or more changed files, try reducing the number of files in this pull request and requesting a review from Copilot again.

Vika-F · 2026-02-20T09:59:33Z

samples/daal/cpp/mpi/sources/linear_regression_norm_eq_distributed_mpi.cpp

                                                       DataSource::doAllocateNumericTable,
                                                       DataSource::doDictionaryFromContext);

    size_t skip = rowStart;


Please add this comment like in another similar place:
/* Skip rows before rowStart */

Though I do not fully understand why the loop is required, and why loadDataBlock(rowStart) is not enough.

Vika-F

Thanks for this restructuring!
The changes look good to me. Let's wait for the CI and LGTM!

Alexandr-Solovev · 2026-02-20T14:21:53Z

/intelci: run

Alexandr-Solovev added 2 commits January 6, 2026 04:16

init removals

9169113

fixes

df03114

Alexandr-Solovev added the dependencies Pull requests that update a dependency file label Jan 6, 2026

Alexandr-Solovev added 4 commits January 6, 2026 06:56

fixes

f22df38

fixes

a2c9ba7

fixes

3b0e4cc

fixes

a5b1eef

fixes

a17c13e

Alexandr-Solovev added 2 commits January 7, 2026 06:16

fixes

9eed85a

fixes

ef16b34

fixes

86e1399

Alexandr-Solovev added 9 commits January 8, 2026 02:43

fixes

d060c5c

fixes

5f124cf

Merge branch 'main' into dev/asolovev_optimization_datasets

71e7f15

fixes

ef1f311

fixes

9cff56b

fixes

e8f1545

fixes

207ecca

fixes

3e5ae4d

fixes

48cd236

Alexandr-Solovev added 3 commits February 13, 2026 06:58

fixes

9ecc63b

fixes

f15a581

Merge branch 'main' into dev/asolovev_optimization_datasets

291aa5d

Alexandr-Solovev marked this pull request as ready for review February 17, 2026 08:30

Copilot AI review requested due to automatic review settings February 17, 2026 08:30

Alexandr-Solovev requested review from Vika-F, avolkov-intel, david-cortes-intel, ethanglaser, icfaust and maria-Petrova as code owners February 17, 2026 08:30

Copilot AI reviewed Feb 17, 2026

View reviewed changes

Vika-F reviewed Feb 17, 2026

View reviewed changes

samples/daal/cpp/mpi/sources/covariance_dense_distributed_mpi.cpp Outdated Show resolved Hide resolved

Vika-F reviewed Feb 17, 2026

View reviewed changes

samples/daal/cpp/mpi/sources/covariance_dense_distributed_mpi.cpp Outdated Show resolved Hide resolved

Vika-F requested a review from Copilot February 17, 2026 10:16

Vika-F reviewed Feb 17, 2026

View reviewed changes

samples/daal/cpp/mpi/sources/kmeans_init_csr_distributed_mpi.cpp Outdated Show resolved Hide resolved

Copilot AI reviewed Feb 17, 2026

View reviewed changes

Vika-F reviewed Feb 17, 2026

View reviewed changes

samples/daal/cpp/mpi/sources/ridge_regression_norm_eq_distributed_mpi.cpp Outdated Show resolved Hide resolved

Alexandr-Solovev requested a review from Copilot February 17, 2026 13:08

Copilot AI reviewed Feb 17, 2026

View reviewed changes

fixes and alignments

660669b

minor fix

4bfc26e

Copilot AI review requested due to automatic review settings February 20, 2026 08:39

Copilot AI reviewed Feb 20, 2026

View reviewed changes

Vika-F reviewed Feb 20, 2026

View reviewed changes

Vika-F approved these changes Feb 20, 2026

View reviewed changes

Merge branch 'main' into dev/asolovev_optimization_datasets

0008523

Comments

Conversation

Alexandr-Solovev commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

PR: Dataset Cleanup, De-duplication, and Example Parameter Refactoring

Changes

Impact

Uh oh!

david-cortes-intel commented Jan 7, 2026

Uh oh!

Alexandr-Solovev commented Jan 7, 2026

Uh oh!

david-cortes-intel commented Jan 7, 2026

Uh oh!

Alexandr-Solovev commented Jan 7, 2026

Uh oh!

david-cortes-intel commented Jan 7, 2026

Uh oh!

Alexandr-Solovev commented Jan 7, 2026

Uh oh!

david-cortes-intel commented Jan 7, 2026

Uh oh!

ethanglaser commented Jan 7, 2026

Uh oh!

Alexandr-Solovev commented Jan 8, 2026

Uh oh!

Vika-F commented Jan 8, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Vika-F Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Vika-F commented Feb 18, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Vika-F Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Vika-F left a comment

Choose a reason for hiding this comment

Uh oh!

Alexandr-Solovev commented Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Alexandr-Solovev commented Jan 6, 2026 •

edited

Loading