Conversation
2c2b3f1 to
30e8814
Compare
|
This PR will trigger no release when merged. |
9ee942b to
f382f3e
Compare
re-enabled automatic testing meant for pre-flight canonical url test canonical validation logic separated into its own method (validateCanoicalTag() re-enable and re-added a bunch of automated test.
f382f3e to
454d217
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
|
|
||
| /** | ||
| * Check if canonical is self-referenced (ignoring protocol and case) | ||
| * |
There was a problem hiding this comment.
- This function does not check anything, this is a get function that returns the pathname lowercased. Please update the documentation accordingly.
- remove unnecessary whitespace.
- I suggest removing this function entirely (see comment below about validation)
| const normalizedCanonical = normalizeUrlExport(canonicalUrl); | ||
| const normalizedFinal = normalizeUrlExport(finalUrl); |
There was a problem hiding this comment.
I would remove the normalizeUrlExport function and create a new function called isSelfReferencing that takes the two URLs and validates if they are the same url. This would be a cleaner approach isolating the logic of what validates two urls are the same.
| const url = scrapedObject.url || scrapedObject.finalUrl; | ||
| if (!url) { | ||
| log.warn(`[canonical] No URL found in S3 object: ${key}`); | ||
| return { url: null, canonicalUrl: null, checks: [] }; | ||
| } | ||
|
|
||
| const finalUrl = scrapedObject.finalUrl || url; |
There was a problem hiding this comment.
If scrapedObject.url exists we set it to url, but then on line 363 we flip it and say if scrapedObject.finalUrl is available we us it over the line on 357. This is a bit confusing.
The logic has a subtle redundancy and could be misleading:
Line 357: const url = scrapedObject.url || scrapedObject.finalUrl
Line 363: const finalUrl = scrapedObject.finalUrl || url
If scrapedObject.url is missing but scrapedObject.finalUrl exists, then url and finalUrl will be the same value (both set to scrapedObject.finalUrl). The finalUrl fallback on line 363 becomes a no-op, and more importantly, the semantic distinction between "the URL we requested" and "the URL we landed on" is lost.
const url = scrapedObject.url;
if (!url) {
log.warn(`[canonical] No URL found in S3 object: ${key}`);
return { url: null, canonicalUrl: null, checks: [] };
}
const finalUrl = scrapedObject.finalUrl || url;
This preserves the semantic distinction: url is always the original requested URL, and finalUrl is the resolved destination (defaulting to url when no redirect occurred). If scrapedObject.url is genuinely absent, it's a real data problem worth rejecting cleanly — not silently masking by substituting finalUrl.
| return { url: null, canonicalUrl: null, checks: [] }; | ||
| } | ||
|
|
||
| const isPreview = isPreviewPage(baseURL); |
There was a problem hiding this comment.
Checking isPreview here and using it way down in the code, it would be cleaner if we move this check closer to where isPreview is used.
| const canonicalTagChecks = []; | ||
|
|
||
| // Check if canonical tag exists | ||
| if (!canonicalMetadata.exists || !canonicalUrl) { |
There was a problem hiding this comment.
I think this might need to be broken down into two checks as we might have metadata, but the url is missing and we are returning a response saying the tag is missing when it isn't.
if (!canonicalMetadata.exists) {
canonicalTagChecks.push({
check: CANONICAL_CHECKS.CANONICAL_TAG_MISSING.check,
success: false,
explanation: CANONICAL_CHECKS.CANONICAL_TAG_MISSING.explanation,
});
} else if (!canonicalUrl) {
// Tag exists but href is absent or empty
canonicalTagChecks.push({
check: CANONICAL_CHECKS.CANONICAL_TAG_NO_HREF.check,
success: false,
explanation: CANONICAL_CHECKS.CANONICAL_TAG_NO_HREF.explanation,
});
}
This requires a dedicated CANONICAL_TAG_NO_HREF check constant, but it surfaces a genuinely distinct failure mode rather than silently mislabeling a malformed tag as a missing one.
| } | ||
|
|
||
| // Check if the canonical URL has the same protocol as the base URL | ||
| if (!url.href.startsWith(base.protocol)) { |
There was a problem hiding this comment.
Why not just check the url.protocol === base.protocol? The comment suggest this comparison but the code isn't doing it.
| ); | ||
| const isSelfReferenced = selfRefCheck?.success === true; | ||
|
|
||
| // if self-referenced - skip accessibility |
There was a problem hiding this comment.
reading this comment had me think that we are skipping an accessibility audit, not the fact that the url was a 200.
|
|
||
| // if self-referenced - skip accessibility | ||
| if (isSelfReferenced) { | ||
| checks.push({ |
There was a problem hiding this comment.
Ideally we have a set of checks that pass when the url is valid, this allows for future conditions to be added with easy and maintainability:
const ACCESSIBILITY_CHECKS = [
CANONICAL_CHECKS.CANONICAL_URL_STATUS_OK.check,
CANONICAL_CHECKS.CANONICAL_URL_NO_REDIRECT.check,
];
And then we could:
if (isSelfReferenced) {
checks.push(...ACCESSIBILITY_CHECKS.map((check) => ({ check, success: true })));
} else {
...
}
| // if not self-referenced - validate accessibility | ||
|
|
||
| const options = await getPreviewAuthOptions(isPreview, baseURL, site, context, log); | ||
|
|
||
| const urlContentCheck = await validateCanonicalRecursively(canonicalUrl, log, options); | ||
| checks.push(...urlContentCheck); | ||
| } | ||
| } | ||
|
|
||
| return { url, canonicalUrl, checks }; |
| return { url, canonicalUrl, checks }; | ||
| } catch (error) { | ||
| log.error(`[canonical] Error processing scraped content from ${key}: ${error.message}`); | ||
| return { url: null, canonicalUrl: null, checks: [] }; |
There was a problem hiding this comment.
Define a default object here that can be used throughout the code base that has these values and return it.
Please ensure your pull request adheres to the following guidelines:
Related Issues
Thanks for contributing!