Skip to content

fix(cwv): filter out 4xx URLs from CWV opportunities (SITES-40803)#2034

Open
tathagat2241 wants to merge 2 commits intomainfrom
cwv-filter-4xx-urls
Open

fix(cwv): filter out 4xx URLs from CWV opportunities (SITES-40803)#2034
tathagat2241 wants to merge 2 commits intomainfrom
cwv-filter-4xx-urls

Conversation

@tathagat2241
Copy link
Contributor

Summary

Filters out URLs that return 4xx (or where HEAD request fails) so they are not turned into CWV opportunities. Addresses issue where 404/clientlib/sling URLs were appearing as CWV opportunities.

Changes

  • src/cwv/cwv-audit-result.js: Added isUrl4xxOrFailed(url, log) that performs a HEAD request to the URL; if the response is 4xx or the request fails (timeout/network), the URL is excluded. After building the filtered CWV list (homepage + top N + threshold), URL-type entries are checked via HEAD and 4xx/failed URLs are removed before the audit result is returned. Group entries are not HEAD-checked.
  • src/cwv/opportunity-sync.js: Guard when computing maxOrganicForUrls so that when there are no URL entries (e.g. all filtered as 4xx), we use 0 instead of Math.max(...[]) (-Infinity).
  • test/audits/cwv/cwv-audit-result.test.js: Unit tests for isUrl4xxOrFailed (404, 403, 410, 200, 500, fetch throw) and for buildCWVAuditResult (4xx URLs excluded from result, group entries kept without HEAD check).

Behavior

  • HEAD requests run only for URL-type entries in the already-filtered CWV list (not the full RUM dataset).
  • 4xx URLs and URLs where HEAD fails are excluded from auditResult.cwv, so they never become suggestions or part of the CWV opportunity.
  • Existing flow (RUM → filter → persist → sync opportunities/suggestions) is unchanged; the only change is that the list passed forward no longer includes 4xx/failed URLs.

Please ensure your pull request adheres to the following guidelines:

  • make sure to link the related issues in this description
  • when merging / squashing, make sure the fixed issue references are visible in the commits, for easy compilation of release notes
  • If data sources for any opportunity has been updated/added, please update the wiki for same opportunity.

Related Issues

https://jira.corp.adobe.com/browse/SITES-40803

Thanks for contributing!

@github-actions
Copy link

This PR will trigger a patch release when merged.

Copy link

@ramboz ramboz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall!

We could consider centralizing the logic though to limit code duplication across tracks and simplify maintenance long-term

redirect: 'follow',
signal: controller.signal,
headers: {
'User-Agent': 'Mozilla/5.0 (compatible; Spacecat-Audit/1.0)',
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should maybe centralize the user agents we use somewhere so everybody uses the same one.

I also noticed that the internal links also use a HEAD request approach to validate links:

async function checkLinkWithHead(url, log) {

Maybe worth thinking about creating a more centralized helper for both actually to avoid code duplication

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants