Preflight canonical url by jkf276 · Pull Request #2012 · adobe/spacecat-audit-worker

jkf276 · 2026-02-18T22:19:26Z

Please ensure your pull request adheres to the following guidelines:

make sure to link the related issues in this description
when merging / squashing, make sure the fixed issue references are visible in the commits, for easy compilation of release notes
If data sources for any opportunity has been updated/added, please update the wiki for same opportunity.

Related Issues

Thanks for contributing!

github-actions · 2026-02-18T22:27:37Z

This PR will trigger no release when merged.

re-enabled automatic testing meant for pre-flight canonical url test canonical validation logic separated into its own method (validateCanoicalTag() re-enable and re-added a bunch of automated test.

codecov · 2026-02-23T18:31:29Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

bhellema · 2026-03-02T12:47:05Z

src/canonical/handler.js


+/**
+ * Check if canonical is self-referenced (ignoring protocol and case)
+ *


This function does not check anything, this is a get function that returns the pathname lowercased. Please update the documentation accordingly.

remove unnecessary whitespace.

I suggest removing this function entirely (see comment below about validation)

bhellema · 2026-03-02T12:50:17Z

src/canonical/handler.js

+      const normalizedCanonical = normalizeUrlExport(canonicalUrl);
+      const normalizedFinal = normalizeUrlExport(finalUrl);


I would remove the normalizeUrlExport function and create a new function called isSelfReferencing that takes the two URLs and validates if they are the same url. This would be a cleaner approach isolating the logic of what validates two urls are the same.

bhellema · 2026-03-02T13:00:08Z

src/canonical/handler.js

+    const url = scrapedObject.url || scrapedObject.finalUrl;
+    if (!url) {
+      log.warn(`[canonical] No URL found in S3 object: ${key}`);
+      return { url: null, canonicalUrl: null, checks: [] };
+    }
+
+    const finalUrl = scrapedObject.finalUrl || url;


If scrapedObject.url exists we set it to url, but then on line 363 we flip it and say if scrapedObject.finalUrl is available we us it over the line on 357. This is a bit confusing.

The logic has a subtle redundancy and could be misleading:
Line 357: const url = scrapedObject.url || scrapedObject.finalUrl
Line 363: const finalUrl = scrapedObject.finalUrl || url

If scrapedObject.url is missing but scrapedObject.finalUrl exists, then url and finalUrl will be the same value (both set to scrapedObject.finalUrl). The finalUrl fallback on line 363 becomes a no-op, and more importantly, the semantic distinction between "the URL we requested" and "the URL we landed on" is lost.

const url = scrapedObject.url; if (!url) { log.warn(`[canonical] No URL found in S3 object: ${key}`); return { url: null, canonicalUrl: null, checks: [] }; } const finalUrl = scrapedObject.finalUrl || url;

This preserves the semantic distinction: url is always the original requested URL, and finalUrl is the resolved destination (defaulting to url when no redirect occurred). If scrapedObject.url is genuinely absent, it's a real data problem worth rejecting cleanly — not silently masking by substituting finalUrl.

bhellema · 2026-03-02T13:02:53Z

src/canonical/handler.js

+      return { url: null, canonicalUrl: null, checks: [] };
+    }
+
+    const isPreview = isPreviewPage(baseURL);


Checking isPreview here and using it way down in the code, it would be cleaner if we move this check closer to where isPreview is used.

bhellema · 2026-03-02T13:06:17Z

src/canonical/handler.js

+    const canonicalTagChecks = [];
+
+    // Check if canonical tag exists
+    if (!canonicalMetadata.exists || !canonicalUrl) {


I think this might need to be broken down into two checks as we might have metadata, but the url is missing and we are returning a response saying the tag is missing when it isn't.

if (!canonicalMetadata.exists) { canonicalTagChecks.push({ check: CANONICAL_CHECKS.CANONICAL_TAG_MISSING.check, success: false, explanation: CANONICAL_CHECKS.CANONICAL_TAG_MISSING.explanation, }); } else if (!canonicalUrl) { // Tag exists but href is absent or empty canonicalTagChecks.push({ check: CANONICAL_CHECKS.CANONICAL_TAG_NO_HREF.check, success: false, explanation: CANONICAL_CHECKS.CANONICAL_TAG_NO_HREF.explanation, }); }

This requires a dedicated CANONICAL_TAG_NO_HREF check constant, but it surfaces a genuinely distinct failure mode rather than silently mislabeling a malformed tag as a missing one.

bhellema · 2026-03-02T13:38:42Z

src/canonical/handler.js

    }

    // Check if the canonical URL has the same protocol as the base URL
    if (!url.href.startsWith(base.protocol)) {


Why not just check the url.protocol === base.protocol? The comment suggest this comparison but the code isn't doing it.

bhellema · 2026-03-02T13:44:10Z

src/canonical/handler.js

+      );
+      const isSelfReferenced = selfRefCheck?.success === true;
+
+      // if self-referenced - skip accessibility


reading this comment had me think that we are skipping an accessibility audit, not the fact that the url was a 200.

bhellema · 2026-03-02T13:47:50Z

src/canonical/handler.js

+
+      // if self-referenced - skip accessibility
+      if (isSelfReferenced) {
+        checks.push({


Ideally we have a set of checks that pass when the url is valid, this allows for future conditions to be added with easy and maintainability:

const ACCESSIBILITY_CHECKS = [ CANONICAL_CHECKS.CANONICAL_URL_STATUS_OK.check, CANONICAL_CHECKS.CANONICAL_URL_NO_REDIRECT.check, ];

And then we could:

if (isSelfReferenced) { checks.push(...ACCESSIBILITY_CHECKS.map((check) => ({ check, success: true }))); } else { ... }

bhellema · 2026-03-02T13:49:02Z

src/canonical/handler.js

+        // if not self-referenced - validate accessibility
+
+        const options = await getPreviewAuthOptions(isPreview, baseURL, site, context, log);
+
+        const urlContentCheck = await validateCanonicalRecursively(canonicalUrl, log, options);
+        checks.push(...urlContentCheck);
+      }
+    }
+
+    return { url, canonicalUrl, checks };


Formatting is required.

bhellema · 2026-03-02T13:49:44Z

src/canonical/handler.js

+    return { url, canonicalUrl, checks };
+  } catch (error) {
+    log.error(`[canonical] Error processing scraped content from ${key}: ${error.message}`);
+    return { url: null, canonicalUrl: null, checks: [] };


Define a default object here that can be used throughout the code base that has these values and return it.

jkf276 self-assigned this Feb 18, 2026

jkf276 force-pushed the preflight-canonical-url branch from 2c2b3f1 to 30e8814 Compare February 18, 2026 22:27

jkf276 force-pushed the preflight-canonical-url branch 5 times, most recently from 9ee942b to f382f3e Compare February 23, 2026 17:21

[sites 40370] re-enable canonical url for pre-flight audit

454d217

re-enabled automatic testing meant for pre-flight canonical url test canonical validation logic separated into its own method (validateCanoicalTag() re-enable and re-added a bunch of automated test.

jkf276 force-pushed the preflight-canonical-url branch from f382f3e to 454d217 Compare February 23, 2026 17:30

bhellema requested changes Mar 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preflight canonical url#2012

Preflight canonical url#2012
jkf276 wants to merge 1 commit intomainfrom
preflight-canonical-url

jkf276 commented Feb 18, 2026

Uh oh!

github-actions bot commented Feb 18, 2026

Uh oh!

codecov bot commented Feb 23, 2026

Uh oh!

bhellema Mar 2, 2026

Uh oh!

bhellema Mar 2, 2026

Uh oh!

bhellema Mar 2, 2026

Uh oh!

bhellema Mar 2, 2026

Uh oh!

bhellema Mar 2, 2026

Uh oh!

bhellema Mar 2, 2026

Uh oh!

bhellema Mar 2, 2026

Uh oh!

bhellema Mar 2, 2026

Uh oh!

bhellema Mar 2, 2026

Uh oh!

bhellema Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		const normalizedCanonical = normalizeUrlExport(canonicalUrl);
		const normalizedFinal = normalizeUrlExport(finalUrl);

Conversation

jkf276 commented Feb 18, 2026

Related Issues

Uh oh!

github-actions bot commented Feb 18, 2026

Uh oh!

codecov bot commented Feb 23, 2026

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants