Implement flat-first archiving for Action reports to improve limiting and memory consumption#24191
Implement flat-first archiving for Action reports to improve limiting and memory consumption#24191
Conversation
f4c6778 to
a5ef0f1
Compare
82902eb to
f4fca88
Compare
…or aggregating higher periods with mixed tables (flat + hierachical)
aafbd67 to
742c6b4
Compare
nathangavin
left a comment
There was a problem hiding this comment.
It generally looks good to me. With regards to testing, would it be feasible to directly compare legacy hierarchy aggregation to flat aggregation to determine that we are getting the same output?
Other than that I don't see any issues.
|
@nathangavin you can check the test run linked in the PR description. That one was running with flat-first-archiving enabled for all tests. As I guess the mid-term plan is to enable the flat-first-archiving for everyone and remove the old approach, we don't need any further tests, but simply switch them to run with flat-first-archiving at some point. |
tsteur
left a comment
There was a problem hiding this comment.
@sgiehl had a very quick initial look. Will spend more time eventually to properly understand logic. Feel free to ignore what doesn't make sense.
BTW do we need to update anything on https://developer.matomo.org/guides/writing-a-record-builder ?
plugins/Actions/ArchivingHelper.php
Outdated
| $lastSegment = (string) $segments[$lastIndex]; | ||
|
|
||
| if ($actionType === Action::TYPE_PAGE_URL) { | ||
| if (strpos($lastSegment, '/') === 0) { |
There was a problem hiding this comment.
would this need to use action_url_category_delimiter?
There was a problem hiding this comment.
This is a leaf node prefix, rather then a delimiter. Basically the reverse handling of getActionExplodedNames:
matomo/plugins/Actions/ArchivingHelper.php
Lines 618 to 659 in 2a1891f
plugins/Actions/ArchivingHelper.php
Outdated
| $segments[$lastIndex] = $lastSegment; | ||
| $delimiter = self::$actionUrlCategoryDelimiter; | ||
| } else { | ||
| if (strpos($lastSegment, ' ') === 0) { |
There was a problem hiding this comment.
would this need to use action_title_category_delimiter?
|
|
||
| if ($recordName === Archiver::PAGE_URLS_RECORD_NAME) { | ||
| $prefix = $archiveProcessor->getParams()->getSite()->getMainUrl(); | ||
| $prefix = rtrim($prefix, '/') . '/'; |
There was a problem hiding this comment.
Do we need to use action_url_category_delimiter here?
There was a problem hiding this comment.
Not in this case. This one just provides the URL that will be used for links. You could e.g. define the delimiter to be ?, to only break at query strings. But for the final url you would still prefix http://example.com/ rather then http://example.com?
| return $this->multiplePeriodTransform; | ||
| } | ||
|
|
||
| public function setBuiltFromFlatRecord( |
There was a problem hiding this comment.
for those new methods could we add some documentation to better understand when and how to use them?
| } | ||
|
|
||
| if (!in_array($record->getName(), $requestedReports)) { | ||
| $requestedReports[] = $record->getName(); |
There was a problem hiding this comment.
sometimes could be good to add some comments to less needing to figure out yourself why this is needed. I think I understood but takes some time to process.
There was a problem hiding this comment.
added some comments here.
| return true; | ||
| } | ||
|
|
||
| protected function beforeInsertBuiltFromFlatHierarchyRecord( |
There was a problem hiding this comment.
I'm assuming this is becoming API for plugins to add extra action, be good to add some comment to the method here too.
| } | ||
| } | ||
|
|
||
| protected function aggregateBuiltFromFlatRecordForNonDay( |
There was a problem hiding this comment.
do you think for all these methods it's possible to add some tests? even if it's just a simple happy test? there's quite a lot of logic in these methods
There was a problem hiding this comment.
generated some additional tests with code for those methods.
| $generalConfig['datatable_archiving_maximum_rows_site_search'] = 5; | ||
| } | ||
|
|
||
| protected static function setUpConfigOptionsFlatFirst() |
There was a problem hiding this comment.
Do we already have tests with a low row limit to see if "Other" row is still correctly being aggregated when it comes to flat archiving? I didn't look. Just thought to check.
When running the other existing action tests with flat archiving enabled, would we expect any changes in reporting data? I assume it would be good to run the same tests with flat archiving enabled and without?
There was a problem hiding this comment.
I see your recent comment that we did a test run with the flag enabled and we'll remove the flag eventually so that may be a fine for now assuming we remove the flag soon 👍 Probably no change needed then.
| continue; | ||
| } | ||
|
|
||
| $tableToAddTo->addDataTable($blobTable); |
There was a problem hiding this comment.
this logic around here seems a bit similar with getAggregatedDataTableMapFromBlobs but I assume it is slightly differently and can't be reused?
There was a problem hiding this comment.
I tried to refactor this into an external helper to reduce code duplication.
|
|
||
| protected function getSubtableIdFromBlobName(string $recordName): ?int | ||
| { | ||
| $parts = explode('_', $recordName); |
There was a problem hiding this comment.
I think the same logic we got in ArchiveProcessor::getSubtableIdFromBlobName(). Should we make these reusable? Public method in a new class or so and then have tests for it?
There was a problem hiding this comment.
also refactored into a external helper
Summary
This PR introduces flat-first archiving for Action reports, controlled by a new config flag:
datatable_archiving_maximum_rows_actions_flatEnable / Disable
0or not set> 0A useful initial value is
50000, which matches the current query/report row limit.Archiving approach changes
With flat-first enabled:
merged so re-archiving everything is not required.
Limits and
Othersbehaviordatatable_archiving_maximum_rows_actions_flat.Othersis determined during flat limiting, so behavior is more consistent and avoids extra secondaryOtherscreatedby hierarchical re-limiting.
Impact on memory and archiving time
For high-cardinality Action data (many distinct URLs, changing month-to-month), this reduces resource pressure by
avoiding repeated deep hierarchical merges during period aggregation.
Testing
The changes contain a couple of tests to proof that flat-first archiving in general works. However, all tests were run with flat-archiving enabled, to ensure no other tests regress unexpectedly. Most of those failures were fixed by applying unrelated fixes (see linked PRs).
There is currently one mismatch in tests remaining. See https://github.com/matomo-org/matomo/actions/runs/23338988295
URL Metadata / Segment Mismatch Note
There is an existing inconsistency in how Matomo merges duplicate rows with the same label but different metadata (for
example URL rows where one variant has different segment metadata, empty segment metadata, or different include-depth
behavior like include2 vs include4).
With flat-first archiving, this can become more visible because row merging happens through a slightly different path
than pure legacy hierarchy aggregation. The underlying issue is not specific to flat-first itself: when multiple
candidate rows exist, metadata winner selection is not fully deterministic/explicit in all merge paths.
Practical impact:
order/source.
during row aggregation.
Checklist
Review