Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,20 @@
# Changelog

## 8.0.17 - 2026-05-07

### Fixed

- `kg_build` postprocessor contexts now inject
`Context.extensions["kg_build.iri_lookup"]` from the callback URL and
`existing_web_page_id`, allowing the trailing canonical ID postprocessor to
reuse existing URL-mapped IRIs before falling back to generated IDs.

### Changed

- Bump package version to `8.0.17`.
- Sync release-facing documentation and indexes (`README.md`,
`docs/INDEX.md`, `specs/INDEX.md`).

## 8.0.16 - 2026-04-17

### Fixed
Expand Down
4 changes: 3 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

A Python toolkit for orchestrating WordLift imports: fetch URLs from sitemaps, Google Sheets, or explicit lists, filter out already imported pages, enqueue search console jobs, push RDF graphs, and call the WordLift APIs to import web pages.

Current release: `8.0.16` (2026-04-17).
Current release: `8.0.17` (2026-05-07).

## Features
- URL sources: XML sitemaps, Google Sheets (`url` column), or Python lists, with global optional `URL_REGEX` filtering (also enforced in graph-sync source selection before `new_or_changed` GraphQL lookup).
Expand Down Expand Up @@ -317,6 +317,8 @@ The SDK now includes a profile-driven cloud mapping module under `wordlift_sdk.k
3. otherwise no postprocessors
- Callback canonicalization order: profile postprocessors run first, then built-in canonical ID generation runs on the postprocessed graph immediately before patching.
- Built-in canonical IDs support optional lookup-based root IRI reuse via `Context.extensions["kg_build.iri_lookup"]` (`IriLookup.iri_for_subject(graph, subject)`), with default fallback to generated IDs when lookup misses.
- `kg_build` callback contexts populate that lookup from the callback URL and
existing URL-mapped IRI when the URL source provides one.
- Lookup-based reuse is root-only: dependent nodes (for example `Offer`, `Answer`, `Action`) still follow canonical parent-nested rewrite rules.
- Execution is manifest-based only (hard cutover): no legacy `.py` or `*.command.toml` discovery.
- During callback patch preparation, the SDK annotates first-level URI-subject nodes in the generated graph with `seovoc:source "web-page-import"` where first-level is dataset ID depth `/<dataset>/<bucket>/<id>` (for example `https://data.host/dataset/types/name`); deeper child IDs and blank nodes are not annotated.
Expand Down
2 changes: 1 addition & 1 deletion docs/INDEX.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Docs Index

Release sync: this index is aligned with SDK `8.0.16` (see `CHANGELOG.md`).
Release sync: this index is aligned with SDK `8.0.17` (see `CHANGELOG.md`).

- `docs/public_entry_points.md`: Task-oriented public API inventory for clients/agents.
- `docs/packaging_slices_v7.md`: v7 packaging model for lean base installs plus optional feature extras under the same `wordlift_sdk` namespace.
Expand Down
2 changes: 2 additions & 0 deletions docs/canonical_id_policy.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,8 @@ hook:
- protocol: `IriLookup.iri_for_subject(graph, subject) -> str | None`
- builtin dataframe implementation:
`wordlift_sdk.kg_build.DataFrameUrlIriLookup` (`url`/`iri` columns)
- `kg_build` callback contexts populate this lookup from the callback URL and
`existing_web_page_id` when the URL source already resolved a URL-mapped IRI.

Behavior:

Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[tool.poetry]
name = "wordlift-sdk"
version = "8.0.16"
version = "8.0.17"
description = "Python toolkit for orchestrating WordLift imports and structured data workflows."
authors = ["David Riccitelli <david@wordlift.io>"]
readme = "README.md"
Expand Down
2 changes: 1 addition & 1 deletion specs/INDEX.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Specs Index

Release sync: this index is aligned with SDK `8.0.16` (see `CHANGELOG.md`).
Release sync: this index is aligned with SDK `8.0.17` (see `CHANGELOG.md`).

- `specs/GSC_CANONICAL_SELECTION.md`: Client integration spec for canonical URL election from GSC impressions (`url,title` CSV input, OAuth/service-account credential handoff, interval/concurrency contract).
- `specs/GRAPH_SYNC_API.md`: Graph-scoped REST API and persistence model for recording sync runs, live progress, aggregate KPIs, and per-URL SHACL reporting.
Expand Down
3 changes: 3 additions & 0 deletions specs/PIPELINE_ARCHITECTURE.md
Original file line number Diff line number Diff line change
Expand Up @@ -114,6 +114,9 @@ Current implementation status:
(`IriLookup.iri_for_subject(graph, subject)`) resolved via
`Context.extensions["kg_build.iri_lookup"]`; lookup applies only to root
subjects and falls back to generated IDs on misses.
- callback postprocessor contexts inject that lookup from the callback URL and
`existing_web_page_id` when the source lookup has resolved an existing
URL-mapped IRI.
- callback graph canonicalization includes a fallback subject pass so non-blank
subject IRIs are rewritten to canonical dataset-rooted paths when they are not
already under canonical root prefixes; static-template graph patching remains
Expand Down
11 changes: 6 additions & 5 deletions tests/kg_build/postprocessors/processors/test_id_generator.py
Original file line number Diff line number Diff line change
Expand Up @@ -616,7 +616,7 @@ def test_dependency_graph_strategy_reparents_faq_under_article_typed_as_faqpage(

Expected transformation:
old root : .../articles/article-1
new root : .../articles/credit-card-debt-relief-freedom-debt-relief-<hash>
new root : .../articles/article-1-<hash>
question : .../articles/<new>/questions/<q-slug>
answer : .../articles/<new>/questions/<q-slug>/answers/answer
"""
Expand Down Expand Up @@ -659,8 +659,9 @@ def test_dependency_graph_strategy_reparents_faq_under_article_typed_as_faqpage(
assert len(articles) == 1
article_iri = str(articles[0])
url_hash = generator._url_hash("https://example.com/articles/article-1")
assert article_iri == (
f"{DATASET}/articles/credit-card-debt-relief-freedom-debt-relief-{url_hash}"
assert (
article_iri
== f"{DATASET}/articles/credit-card-debt-relief-freedom-debt-relief-{url_hash}"
), article_iri

questions = list(output.subjects(RDF.type, URIRef(f"{SCHEMA}Question")))
Expand All @@ -673,9 +674,9 @@ def test_dependency_graph_strategy_reparents_faq_under_article_typed_as_faqpage(
answer_iri = str(answers[0])
assert answer_iri.startswith(f"{question_iri}/answers/"), answer_iri

# No stale article-1 path may remain anywhere in the graph
# No stale external article path may remain anywhere in the graph
all_iris = {str(s) for s in output.subjects() if isinstance(s, URIRef)}
assert not any("article-1" in iri for iri in all_iris), all_iris
assert not any(iri.startswith("https://example.com/") for iri in all_iris), all_iris


def test_dependency_graph_strategy_is_generic_product_offer_pricespec() -> None:
Expand Down
56 changes: 56 additions & 0 deletions tests/kg_build/test_protocol.py
Original file line number Diff line number Diff line change
Expand Up @@ -107,6 +107,62 @@ def _make_validation_outcome(
)


def test_build_pp_context_injects_url_iri_lookup() -> None:
protocol = ProfileImportProtocol(
context=_make_context(),
profile=_make_profile(),
root_dir=Path("."),
)
response = WebPageScrapeResponse(
web_page=WebPage(url="https://example.com/article", html="<html></html>")
)

context = protocol._build_pp_context(
"https://example.com/article",
response,
"https://data.example.com/dataset/articles/existing-article",
"existing-hash",
)
lookup = context.extensions.get("kg_build.iri_lookup")

assert lookup is not None

graph = Graph()
subject = URIRef("https://example.com/article#node")
faq_page = URIRef("https://example.com/article#faq")
graph.add((subject, RDF.type, URIRef("http://schema.org/Article")))
graph.add(
(
subject,
URIRef("http://schema.org/url"),
Literal("https://example.com/article"),
)
)
graph.add((faq_page, RDF.type, URIRef("http://schema.org/FAQPage")))
graph.add((subject, URIRef("http://schema.org/subjectOf"), faq_page))
graph.add((faq_page, URIRef("http://schema.org/about"), subject))
assert (
lookup.iri_for_subject(graph, subject)
== "https://data.example.com/dataset/articles/existing-article"
)

output = CanonicalIdsPostprocessor(strategy="dependency_graph").process_graph(
graph, context
)
assert (
URIRef("https://data.example.com/dataset/articles/existing-article"),
RDF.type,
URIRef("http://schema.org/Article"),
) in output
assert (
URIRef(
"https://data.example.com/dataset/articles/existing-article/faq-pages/faq-page"
),
URIRef("http://schema.org/about"),
URIRef("https://data.example.com/dataset/articles/existing-article"),
) in output


def _passthrough_pp() -> AsyncMock:
return AsyncMock(
side_effect=lambda g, url, resp, ewi, eih: PostprocessorResult(
Expand Down
3 changes: 2 additions & 1 deletion wordlift_sdk/kg_build/postprocessors/types.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
from __future__ import annotations

from dataclasses import dataclass
from dataclasses import dataclass, field
from enum import Enum
from pathlib import Path
from typing import Any, Protocol, runtime_checkable
Expand All @@ -27,6 +27,7 @@ class PostprocessorContext:
existing_import_hash: str | None = None
import_hash_mode: str = "on"
ids: Any | None = None
extensions: dict[str, Any] = field(default_factory=dict)


@dataclass(frozen=True)
Expand Down
8 changes: 8 additions & 0 deletions wordlift_sdk/kg_build/protocol.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
from typing import Any

from jinja2 import UndefinedError
import pandas as pd
from rdflib import Graph, URIRef
from wordlift_client.models.web_page_scrape_response import WebPageScrapeResponse
from wordlift_sdk.protocol import Context
Expand All @@ -27,6 +28,7 @@
from .cloud_flow import run_cloud_workflow as run_cloud_workflow # noqa: F401
from .config import ProfileDefinition
from .entity_patcher import EntityPatcher
from .iri_lookup import DataFrameUrlIriLookup
from .kpi import KgBuildKpiCollector
from .postprocessors import (
LoadedPostprocessor,
Expand Down Expand Up @@ -503,6 +505,11 @@ def _build_pp_context(
profile_settings = dict(profile_payload.get("settings", {}) or {})
profile_settings.setdefault("api_url", "https://api.wordlift.io")
profile_payload["settings"] = profile_settings
extensions: dict[str, Any] = {}
if url and existing_web_page_id:
extensions["kg_build.iri_lookup"] = DataFrameUrlIriLookup(
pd.DataFrame([{"url": url, "iri": existing_web_page_id}])
)
return PostprocessorContext(
profile_name=self.profile.name,
profile=profile_payload,
Expand All @@ -515,6 +522,7 @@ def _build_pp_context(
existing_import_hash=existing_import_hash,
import_hash_mode=self._import_hash_mode,
ids=ids,
extensions=extensions,
)

def _resolve_path(self, raw_path: str) -> Path:
Expand Down
Loading