Skip to content

Many files are not synced across archive and cache #19704

@miketheman

Description

@miketheman

Noticed today, when investigating other file-related behaviors

Not all files are in both the archive (s3) and cache (b2) storages:

warehouse=> SELECT COUNT(*) AS file_count FROM release_files WHERE NOT cached;
 file_count
------------
    7725023
(1 row)

-- Even newer files are missing:

warehouse=> SELECT date_trunc('month', upload_time) AS month,
         COUNT(*) AS file_count
  FROM release_files
  WHERE NOT cached
    AND upload_time >= date_trunc('month', now()) - interval '3 years'
  GROUP BY month
  ORDER BY month;
        month        | file_count
---------------------+------------
 2023-03-01 00:00:00 |     178073
 2023-04-01 00:00:00 |     202665
 2023-05-01 00:00:00 |      81463
 2023-06-01 00:00:00 |        593
 2023-07-01 00:00:00 |        165
 2023-08-01 00:00:00 |        232
 2023-09-01 00:00:00 |        146
 2023-10-01 00:00:00 |        389
 2023-11-01 00:00:00 |        276
 2023-12-01 00:00:00 |        186
 2024-01-01 00:00:00 |        150
 2024-02-01 00:00:00 |         19
 2024-03-01 00:00:00 |        486
 2024-04-01 00:00:00 |       3173
 2024-05-01 00:00:00 |       1982
 2024-06-01 00:00:00 |        749
 2024-07-01 00:00:00 |        498
 2024-08-01 00:00:00 |        424
 2024-09-01 00:00:00 |        983
 2024-10-01 00:00:00 |       1097
 2024-11-01 00:00:00 |        858
 2024-12-01 00:00:00 |        209
 2025-01-01 00:00:00 |        199
 2025-02-01 00:00:00 |        456
 2025-03-01 00:00:00 |        320
 2025-04-01 00:00:00 |       1259
 2025-05-01 00:00:00 |        106
 2025-06-01 00:00:00 |        177
 2025-07-01 00:00:00 |         82
 2025-08-01 00:00:00 |         94
 2025-09-01 00:00:00 |        115
 2025-10-01 00:00:00 |         75
 2025-11-01 00:00:00 |        231
 2025-12-01 00:00:00 |        139
 2026-01-01 00:00:00 |        587
 2026-02-01 00:00:00 |         97
 2026-03-01 00:00:00 |         67
(37 rows)

The metric warehouse.packaging.files.not_cached has been pretty static for the past 12+ months:

Image

So the problem doesn't appear to be getting materially worse, which is great, but we still have fresher files not being synced to cache. This issue isn't currently concerned with why not, rather how to proceed when this situation happens.

In #13651 the archived column fell out of use so it naturally has many false entries, so I didn't use it as part of the query.

There doesn't seem to be a periodic task to reconcile - was there ever?

I've run a few iterations in a prod shell to confirm that the numbers change, but since the "batch" will only update once the task is completed, it's risky to set a long-running command with more than ~1000 at a time, since if interrupted it'll have to start over.

A batch of 1000 takes ~11 minutes, so if we ran it continuously, we'd eventually get there in ~2-3 months.

  1. Are we fine with having a persistent fallback, and not everything synced to cache?

    Image Fastly fetching from fallback
  2. Should we set up a periodic reconcile task every 15 minutes to catch us up, and then ease off the periodicity once caught up to run hourly/daily/etc?

  3. Something else entirely?

Metadata

Metadata

Assignees

Labels

CDN/networkIssues related to our CDN, users having problems connecting to PyPIdata quality

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions