-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Noticed today, when investigating other file-related behaviors
Not all files are in both the archive (s3) and cache (b2) storages:
warehouse=> SELECT COUNT(*) AS file_count FROM release_files WHERE NOT cached;
file_count
------------
7725023
(1 row)
-- Even newer files are missing:
warehouse=> SELECT date_trunc('month', upload_time) AS month,
COUNT(*) AS file_count
FROM release_files
WHERE NOT cached
AND upload_time >= date_trunc('month', now()) - interval '3 years'
GROUP BY month
ORDER BY month;
month | file_count
---------------------+------------
2023-03-01 00:00:00 | 178073
2023-04-01 00:00:00 | 202665
2023-05-01 00:00:00 | 81463
2023-06-01 00:00:00 | 593
2023-07-01 00:00:00 | 165
2023-08-01 00:00:00 | 232
2023-09-01 00:00:00 | 146
2023-10-01 00:00:00 | 389
2023-11-01 00:00:00 | 276
2023-12-01 00:00:00 | 186
2024-01-01 00:00:00 | 150
2024-02-01 00:00:00 | 19
2024-03-01 00:00:00 | 486
2024-04-01 00:00:00 | 3173
2024-05-01 00:00:00 | 1982
2024-06-01 00:00:00 | 749
2024-07-01 00:00:00 | 498
2024-08-01 00:00:00 | 424
2024-09-01 00:00:00 | 983
2024-10-01 00:00:00 | 1097
2024-11-01 00:00:00 | 858
2024-12-01 00:00:00 | 209
2025-01-01 00:00:00 | 199
2025-02-01 00:00:00 | 456
2025-03-01 00:00:00 | 320
2025-04-01 00:00:00 | 1259
2025-05-01 00:00:00 | 106
2025-06-01 00:00:00 | 177
2025-07-01 00:00:00 | 82
2025-08-01 00:00:00 | 94
2025-09-01 00:00:00 | 115
2025-10-01 00:00:00 | 75
2025-11-01 00:00:00 | 231
2025-12-01 00:00:00 | 139
2026-01-01 00:00:00 | 587
2026-02-01 00:00:00 | 97
2026-03-01 00:00:00 | 67
(37 rows)The metric warehouse.packaging.files.not_cached has been pretty static for the past 12+ months:
So the problem doesn't appear to be getting materially worse, which is great, but we still have fresher files not being synced to cache. This issue isn't currently concerned with why not, rather how to proceed when this situation happens.
In #13651 the archived column fell out of use so it naturally has many false entries, so I didn't use it as part of the query.
There doesn't seem to be a periodic task to reconcile - was there ever?
I've run a few iterations in a prod shell to confirm that the numbers change, but since the "batch" will only update once the task is completed, it's risky to set a long-running command with more than ~1000 at a time, since if interrupted it'll have to start over.
A batch of 1000 takes ~11 minutes, so if we ran it continuously, we'd eventually get there in ~2-3 months.
