Skip to content

Preflight canonical url#2012

Open
jkf276 wants to merge 1 commit intomainfrom
preflight-canonical-url
Open

Preflight canonical url#2012
jkf276 wants to merge 1 commit intomainfrom
preflight-canonical-url

Conversation

@jkf276
Copy link
Contributor

@jkf276 jkf276 commented Feb 18, 2026

Please ensure your pull request adheres to the following guidelines:

  • make sure to link the related issues in this description
  • when merging / squashing, make sure the fixed issue references are visible in the commits, for easy compilation of release notes
  • If data sources for any opportunity has been updated/added, please update the wiki for same opportunity.

Related Issues

Thanks for contributing!

@jkf276 jkf276 self-assigned this Feb 18, 2026
@jkf276 jkf276 force-pushed the preflight-canonical-url branch from 2c2b3f1 to 30e8814 Compare February 18, 2026 22:27
@github-actions
Copy link

This PR will trigger no release when merged.

@jkf276 jkf276 force-pushed the preflight-canonical-url branch 5 times, most recently from 9ee942b to f382f3e Compare February 23, 2026 17:21
re-enabled automatic testing meant for pre-flight canonical url test

canonical validation logic separated into its own method (validateCanoicalTag()

re-enable and re-added a bunch of automated test.
@jkf276 jkf276 force-pushed the preflight-canonical-url branch from f382f3e to 454d217 Compare February 23, 2026 17:30
@codecov
Copy link

codecov bot commented Feb 23, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!


/**
* Check if canonical is self-referenced (ignoring protocol and case)
*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. This function does not check anything, this is a get function that returns the pathname lowercased. Please update the documentation accordingly.
  2. remove unnecessary whitespace.
  3. I suggest removing this function entirely (see comment below about validation)

Comment on lines +439 to +440
const normalizedCanonical = normalizeUrlExport(canonicalUrl);
const normalizedFinal = normalizeUrlExport(finalUrl);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove the normalizeUrlExport function and create a new function called isSelfReferencing that takes the two URLs and validates if they are the same url. This would be a cleaner approach isolating the logic of what validates two urls are the same.

Comment on lines +357 to +363
const url = scrapedObject.url || scrapedObject.finalUrl;
if (!url) {
log.warn(`[canonical] No URL found in S3 object: ${key}`);
return { url: null, canonicalUrl: null, checks: [] };
}

const finalUrl = scrapedObject.finalUrl || url;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If scrapedObject.url exists we set it to url, but then on line 363 we flip it and say if scrapedObject.finalUrl is available we us it over the line on 357. This is a bit confusing.

The logic has a subtle redundancy and could be misleading:
Line 357: const url = scrapedObject.url || scrapedObject.finalUrl
Line 363: const finalUrl = scrapedObject.finalUrl || url

If scrapedObject.url is missing but scrapedObject.finalUrl exists, then url and finalUrl will be the same value (both set to scrapedObject.finalUrl). The finalUrl fallback on line 363 becomes a no-op, and more importantly, the semantic distinction between "the URL we requested" and "the URL we landed on" is lost.

const url = scrapedObject.url;
if (!url) {
  log.warn(`[canonical] No URL found in S3 object: ${key}`);
  return { url: null, canonicalUrl: null, checks: [] };
}

const finalUrl = scrapedObject.finalUrl || url;

This preserves the semantic distinction: url is always the original requested URL, and finalUrl is the resolved destination (defaulting to url when no redirect occurred). If scrapedObject.url is genuinely absent, it's a real data problem worth rejecting cleanly — not silently masking by substituting finalUrl.

return { url: null, canonicalUrl: null, checks: [] };
}

const isPreview = isPreviewPage(baseURL);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking isPreview here and using it way down in the code, it would be cleaner if we move this check closer to where isPreview is used.

const canonicalTagChecks = [];

// Check if canonical tag exists
if (!canonicalMetadata.exists || !canonicalUrl) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might need to be broken down into two checks as we might have metadata, but the url is missing and we are returning a response saying the tag is missing when it isn't.

if (!canonicalMetadata.exists) {
  canonicalTagChecks.push({
    check: CANONICAL_CHECKS.CANONICAL_TAG_MISSING.check,
    success: false,
    explanation: CANONICAL_CHECKS.CANONICAL_TAG_MISSING.explanation,
  });
} else if (!canonicalUrl) {
  // Tag exists but href is absent or empty
  canonicalTagChecks.push({
    check: CANONICAL_CHECKS.CANONICAL_TAG_NO_HREF.check,
    success: false,
    explanation: CANONICAL_CHECKS.CANONICAL_TAG_NO_HREF.explanation,
  });
}

This requires a dedicated CANONICAL_TAG_NO_HREF check constant, but it surfaces a genuinely distinct failure mode rather than silently mislabeling a malformed tag as a missing one.

}

// Check if the canonical URL has the same protocol as the base URL
if (!url.href.startsWith(base.protocol)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just check the url.protocol === base.protocol? The comment suggest this comparison but the code isn't doing it.

);
const isSelfReferenced = selfRefCheck?.success === true;

// if self-referenced - skip accessibility
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reading this comment had me think that we are skipping an accessibility audit, not the fact that the url was a 200.


// if self-referenced - skip accessibility
if (isSelfReferenced) {
checks.push({
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we have a set of checks that pass when the url is valid, this allows for future conditions to be added with easy and maintainability:

const ACCESSIBILITY_CHECKS = [
  CANONICAL_CHECKS.CANONICAL_URL_STATUS_OK.check,
  CANONICAL_CHECKS.CANONICAL_URL_NO_REDIRECT.check,
];

And then we could:

if (isSelfReferenced) {
  checks.push(...ACCESSIBILITY_CHECKS.map((check) => ({ check, success: true })));
} else {
  ...
}

Comment on lines +481 to +490
// if not self-referenced - validate accessibility

const options = await getPreviewAuthOptions(isPreview, baseURL, site, context, log);

const urlContentCheck = await validateCanonicalRecursively(canonicalUrl, log, options);
checks.push(...urlContentCheck);
}
}

return { url, canonicalUrl, checks };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Formatting is required.

return { url, canonicalUrl, checks };
} catch (error) {
log.error(`[canonical] Error processing scraped content from ${key}: ${error.message}`);
return { url: null, canonicalUrl: null, checks: [] };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Define a default object here that can be used throughout the code base that has these values and return it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants