Skip to content

Implement flat-first archiving for Action reports to improve limiting and memory consumption#24191

Open
sgiehl wants to merge 15 commits into5.x-devfrom
archivetest
Open

Implement flat-first archiving for Action reports to improve limiting and memory consumption#24191
sgiehl wants to merge 15 commits into5.x-devfrom
archivetest

Conversation

@sgiehl
Copy link
Member

@sgiehl sgiehl commented Mar 9, 2026

Summary

This PR introduces flat-first archiving for Action reports, controlled by a new config flag:

  • datatable_archiving_maximum_rows_actions_flat

Enable / Disable

  • Disabled (legacy behavior): flag is 0 or not set
    • Only hierarchical Action tables are archived (same as before).
  • Enabled (new behavior): flag is > 0
    • A flat Action archive is built first and limited to this configured row count.
    • Hierarchical tables are then built from this already-limited flat data.

A useful initial value is 50000, which matches the current query/report row limit.

Archiving approach changes

With flat-first enabled:

  1. Day Action archives build a flat limited table first.
  2. The hierarchy is built from that flat table (instead of building hierarchy first and flattening later).
  3. Higher periods (week/month/year) aggregate from flat archives.
  4. For mixed historical data, if a period has no flat archive yet, existing hierarchical archives are flattened and
    merged so re-archiving everything is not required.

Limits and Others behavior

  • The main limit is now applied at the flat table level using datatable_archiving_maximum_rows_actions_flat.
  • No additional hierarchical limiting is applied afterward (hierarchy is built from already-limited data).
  • Others is determined during flat limiting, so behavior is more consistent and avoids extra secondary Others created
    by hierarchical re-limiting.

Impact on memory and archiving time

For high-cardinality Action data (many distinct URLs, changing month-to-month), this reduces resource pressure by
avoiding repeated deep hierarchical merges during period aggregation.

  • Memory: lower peak memory usage during aggregation.
  • Archiving time: typically faster/more stable for larger periods.
  • Tradeoff: additional flat archive data is stored, but bounded by the configured limit.

Testing

The changes contain a couple of tests to proof that flat-first archiving in general works. However, all tests were run with flat-archiving enabled, to ensure no other tests regress unexpectedly. Most of those failures were fixed by applying unrelated fixes (see linked PRs).
There is currently one mismatch in tests remaining. See https://github.com/matomo-org/matomo/actions/runs/23338988295

URL Metadata / Segment Mismatch Note

There is an existing inconsistency in how Matomo merges duplicate rows with the same label but different metadata (for
example URL rows where one variant has different segment metadata, empty segment metadata, or different include-depth
behavior like include2 vs include4).

With flat-first archiving, this can become more visible because row merging happens through a slightly different path
than pure legacy hierarchy aggregation. The underlying issue is not specific to flat-first itself: when multiple
candidate rows exist, metadata winner selection is not fully deterministic/explicit in all merge paths.

Practical impact:

  • Metric totals can still be correct.
  • Returned metadata fields (especially segment metadata for URL rows) may differ in edge cases depending on merge
    order/source.
  • This is a pre-existing merge-policy gap and should be fixed separately by defining deterministic metadata precedence
    during row aggregation.

Checklist

  • [✔] I have understood, reviewed, and tested all AI outputs before use
  • [✔] All AI instructions respect security, IP, and privacy rules

Review

@sgiehl sgiehl force-pushed the archivetest branch 2 times, most recently from aafbd67 to 742c6b4 Compare March 20, 2026 12:24
@sgiehl sgiehl changed the title Use flat-first archiving for Action reports to improve limitting and memory consumption Implement flat-first archiving for Action reports to improve limitting and memory consumption Mar 20, 2026
@sgiehl sgiehl changed the title Implement flat-first archiving for Action reports to improve limitting and memory consumption Implement flat-first archiving for Action reports to improve limiting and memory consumption Mar 20, 2026
@sgiehl sgiehl requested review from a team, diosmosis and tsteur March 20, 2026 12:26
@sgiehl sgiehl marked this pull request as ready for review March 20, 2026 12:26
Copy link
Contributor

@nathangavin nathangavin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It generally looks good to me. With regards to testing, would it be feasible to directly compare legacy hierarchy aggregation to flat aggregation to determine that we are getting the same output?

Other than that I don't see any issues.

@sgiehl
Copy link
Member Author

sgiehl commented Mar 24, 2026

@nathangavin you can check the test run linked in the PR description. That one was running with flat-first-archiving enabled for all tests. As I guess the mid-term plan is to enable the flat-first-archiving for everyone and remove the old approach, we don't need any further tests, but simply switch them to run with flat-first-archiving at some point.

Copy link
Member

@tsteur tsteur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sgiehl had a very quick initial look. Will spend more time eventually to properly understand logic. Feel free to ignore what doesn't make sense.

BTW do we need to update anything on https://developer.matomo.org/guides/writing-a-record-builder ?

$lastSegment = (string) $segments[$lastIndex];

if ($actionType === Action::TYPE_PAGE_URL) {
if (strpos($lastSegment, '/') === 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would this need to use action_url_category_delimiter?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a leaf node prefix, rather then a delimiter. Basically the reverse handling of getActionExplodedNames:

public static function getActionExplodedNames($name, $type, $urlPrefix = null)
{
// Site Search does not split Search keywords
if ($type == Action::TYPE_SITE_SEARCH) {
return array($name);
}
$name = str_replace("\n", "", $name);
if ($type == Action::TYPE_PAGE_TITLE && self::$actionTitleCategoryDelimiter === '') {
if ($name === '' || $name === false || $name === null || trim($name) === '') {
$name = self::getUnknownActionName($type);
}
return array(' ' . trim($name));
}
$name = self::parseNameFromPageUrl($name, $type, $urlPrefix);
// outlinks and downloads
if (is_array($name)) {
return $name;
}
$split = self::splitNameByDelimiter($name, $type);
if (empty($split)) {
$defaultName = self::getUnknownActionName($type);
return array(trim($defaultName));
}
$lastPageName = end($split);
// we are careful to prefix the page URL / name with some value
// so that if a page has the same name as a category
// we don't merge both entries
if ($type != Action::TYPE_PAGE_TITLE) {
$lastPageName = '/' . $lastPageName;
} else {
$lastPageName = ' ' . $lastPageName;
}
$split[count($split) - 1] = $lastPageName;
return array_values($split);
}

$segments[$lastIndex] = $lastSegment;
$delimiter = self::$actionUrlCategoryDelimiter;
} else {
if (strpos($lastSegment, ' ') === 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would this need to use action_title_category_delimiter?


if ($recordName === Archiver::PAGE_URLS_RECORD_NAME) {
$prefix = $archiveProcessor->getParams()->getSite()->getMainUrl();
$prefix = rtrim($prefix, '/') . '/';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to use action_url_category_delimiter here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not in this case. This one just provides the URL that will be used for links. You could e.g. define the delimiter to be ?, to only break at query strings. But for the final url you would still prefix http://example.com/ rather then http://example.com?

return $this->multiplePeriodTransform;
}

public function setBuiltFromFlatRecord(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for those new methods could we add some documentation to better understand when and how to use them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added some documentation

}

if (!in_array($record->getName(), $requestedReports)) {
$requestedReports[] = $record->getName();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sometimes could be good to add some comments to less needing to figure out yourself why this is needed. I think I understood but takes some time to process.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added some comments here.

return true;
}

protected function beforeInsertBuiltFromFlatHierarchyRecord(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming this is becoming API for plugins to add extra action, be good to add some comment to the method here too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}
}

protected function aggregateBuiltFromFlatRecordForNonDay(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you think for all these methods it's possible to add some tests? even if it's just a simple happy test? there's quite a lot of logic in these methods

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generated some additional tests with code for those methods.

$generalConfig['datatable_archiving_maximum_rows_site_search'] = 5;
}

protected static function setUpConfigOptionsFlatFirst()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we already have tests with a low row limit to see if "Other" row is still correctly being aggregated when it comes to flat archiving? I didn't look. Just thought to check.

When running the other existing action tests with flat archiving enabled, would we expect any changes in reporting data? I assume it would be good to run the same tests with flat archiving enabled and without?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your recent comment that we did a test run with the flag enabled and we'll remove the flag eventually so that may be a fine for now assuming we remove the flag soon 👍 Probably no change needed then.

continue;
}

$tableToAddTo->addDataTable($blobTable);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this logic around here seems a bit similar with getAggregatedDataTableMapFromBlobs but I assume it is slightly differently and can't be reused?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to refactor this into an external helper to reduce code duplication.


protected function getSubtableIdFromBlobName(string $recordName): ?int
{
$parts = explode('_', $recordName);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the same logic we got in ArchiveProcessor::getSubtableIdFromBlobName(). Should we make these reusable? Public method in a new class or so and then have tests for it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also refactored into a external helper

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants