Skip to content

[DCP Ingestion]#503

Open
gmechali wants to merge 6 commits intodatacommonsorg:masterfrom
gmechali:jsonld2
Open

[DCP Ingestion]#503
gmechali wants to merge 6 commits intodatacommonsorg:masterfrom
gmechali:jsonld2

Conversation

@gmechali
Copy link
Copy Markdown
Contributor

No description provided.

@codacy-production
Copy link
Copy Markdown

codacy-production Bot commented Apr 30, 2026

Not up to standards ⛔

🔴 Issues 6 high · 13 medium · 17 minor

Alerts:
⚠ 36 issues (≤ 0 issues of at least minor severity)

Results:
36 new issues

Category Results
UnusedCode 6 medium
ErrorProne 6 high
Security 3 medium
CodeStyle 17 minor
Complexity 4 medium

View in Codacy

🟢 Metrics 42 complexity · 0 duplication

Metric Results
Complexity 42
Duplication 0

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new 'DCP Bridge' mode to the ingestion tool, which facilitates bridging the gap between custom data resolution and the production platform by exporting data to sharded JSON-LD files. The changes include a new exporter module, documentation, and logic to automatically trigger Google Cloud Workflows for data ingestion. The review feedback highlights several important improvements: replacing hardcoded GCP resource fallbacks with environment-driven configuration, adopting parameterized SQL queries to mitigate injection risks in the exporter, and refining exception handling when triggering external workflows to avoid silent failures.

Comment thread simple/stats/runner.py Outdated
Comment on lines +603 to +609
spanner_instance = os.getenv("GCP_SPANNER_INSTANCE_ID", "gabe-test-dcp-instance")
spanner_database = os.getenv("GCP_SPANNER_DATABASE_NAME", "gabe-test-dcp-db-v2")
workflow_name = os.getenv("WORKFLOW_NAME", "gabe-test-ingestion-orchestrator")
project_id = os.getenv("PROJECT_ID", "datcom-website-dev")
location = os.getenv("WORKFLOW_LOCATION", "us-central1")
temp_location = os.getenv("TEMP_LOCATION", "gs://gabe-test-ingestion-bucket-datcom-website-dev/temp")
region = os.getenv("REGION", "us-central1")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Hardcoded fallback values for GCP resources (Spanner instances, databases, workflows, etc.) containing personal or test-specific names should be avoided in the codebase. These should be strictly managed via environment variables or configuration files to ensure portability and security.

offset = 0
while True:
triples_tuples = db.engine.fetch_all(
f"SELECT subject_id, predicate, object_id, object_value FROM triples LIMIT {chunk_size} OFFSET {offset}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

Using f-strings to construct SQL queries is generally discouraged as it can lead to SQL injection vulnerabilities if the variables are ever sourced from user input. Although chunk_size and offset are currently integers, it is better practice to use parameterized queries.

Suggested change
f"SELECT subject_id, predicate, object_id, object_value FROM triples LIMIT {chunk_size} OFFSET {offset}"
"SELECT subject_id, predicate, object_id, object_value FROM triples LIMIT ? OFFSET ?", (chunk_size, offset)

obs_counter = 0
while True:
obs_tuples = db.engine.fetch_all(
f"SELECT entity, variable, date, value, provenance, unit, scaling_factor, measurement_method, observation_period, properties FROM observations LIMIT {chunk_size} OFFSET {offset}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

Using f-strings to construct SQL queries is generally discouraged. Please use parameterized queries to ensure safety and follow best practices.

Suggested change
f"SELECT entity, variable, date, value, provenance, unit, scaling_factor, measurement_method, observation_period, properties FROM observations LIMIT {chunk_size} OFFSET {offset}"
"SELECT entity, variable, date, value, provenance, unit, scaling_factor, measurement_method, observation_period, properties FROM observations LIMIT ? OFFSET ?", (chunk_size, offset)

Comment thread simple/stats/runner.py Outdated
Comment on lines +649 to +650
except Exception as e:
logging.error(f"Error triggering workflow via API: {e}")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Catching a broad Exception and only logging it can hide unexpected errors and make debugging difficult. Consider catching more specific exceptions related to the API call (e.g., requests.exceptions.RequestException) or re-raising the exception after logging to ensure the failure is not silently ignored by the caller.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant