Skip to content

Link checker selectiveness: run only on changed files on small PRs#2708

Merged
holly-cummins merged 1 commit into
quarkusio:mainfrom
holly-cummins:incremental-link-checker
Jun 12, 2026
Merged

Link checker selectiveness: run only on changed files on small PRs#2708
holly-cummins merged 1 commit into
quarkusio:mainfrom
holly-cummins:incremental-link-checker

Conversation

@holly-cummins

@holly-cummins holly-cummins commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

This is the promised follow-on to #2685 to reduce the build-time impact. It only checks changed files for internal dead links. This does introduce a gap where if a file is moved, any files that reference that file wouldn't be checked, so build would be green. However, the external link checker (#2697) would pick that up on its scheduled runs and raise a defect. I think the trade-off is worth it for the improved build speed.

If someone has a PR which is lagging way behind main, they'll also get a slow build, but I think we can live with that.

Here's the logic:

  • Compares the built _site/ against the gh-pages branch (the deployed site) using rsync to find changed HTML files (using a git diff would pick up adoc changes instead, and we'd have to map from them to the html)
  • When ≤15 HTML files changed and no infrastructure files (build.yml, pom.xml, src/test/java/) were modified, runs the link crawler only on the changed pages with a depth-1 check (verifies outbound links from those pages)
  • Falls back to a full crawl when >15 files changed, infrastructure changed, or gh-pages comparison fails
  • Skips the link crawl entirely when no HTML files changed

In this PR, we can always trigger the 'full' scenario so can only exercise the 'full' paths, but I've tried my best to test the others locally. I'll also keep an eye on PRs that go in after this merges to try to validate the scope detection is working as expected.

Attempts at testing it

  • Ran incremental mode against quarkus.io with 2 paths (/blog/, /about/) — crawled 217 pages (depth-1) vs thousands for a full crawl, 0 broken links
  • Tested scope determination script locally: Claude claims it correctly detects changed files, skips unchanged files, converts paths to URLs
  • Tested no-changes edge case: correctly produces mode=skip

@holly-cummins holly-cummins changed the title Run link checker incrementally on small PRs Link checker selectiveness: run only on changed files on small PRs Jun 11, 2026
Compare the built site against the gh-pages branch to find changed
HTML files. When 15 or fewer pages changed and no build infrastructure
files were modified, run the link crawler only on those pages (depth-1
check). This avoids a full-site crawl on small PRs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

# Conflicts:
#	src/test/java/io/quarkusio/LinkCrawlerTest.java
@holly-cummins holly-cummins force-pushed the incremental-link-checker branch from c2215d2 to ff5c236 Compare June 12, 2026 13:56
@holly-cummins holly-cummins merged commit 8970328 into quarkusio:main Jun 12, 2026
1 check passed
@holly-cummins holly-cummins deleted the incremental-link-checker branch June 12, 2026 14:26
@github-actions

github-actions Bot commented Jun 12, 2026

Copy link
Copy Markdown

🙈 The PR is closed and the preview is expired.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants