Skip to content

[#30180] YSQL: Add attributes in yb_tablet_metadata.tablet_attrs#30861

Open
anstormx wants to merge 2 commits intoyugabyte:masterfrom
anstormx:tablet-attrs-30860
Open

[#30180] YSQL: Add attributes in yb_tablet_metadata.tablet_attrs#30861
anstormx wants to merge 2 commits intoyugabyte:masterfrom
anstormx:tablet-attrs-30860

Conversation

@anstormx
Copy link
Copy Markdown

@anstormx anstormx commented Mar 28, 2026

Note

Medium Risk
Adds a new jsonb output column backed by tserver-side fan-out RPCs to enrich tablet metadata, which could impact latency/robustness of the metadata call despite best-effort error handling. Changes touch cross-layer plumbing (tserver RPC -> pggate structs -> Postgres SRF/view) and require careful compatibility with existing callers.

Overview
Exposes new per-tablet disk size metadata via yb_tablet_metadata by adding a tablet_attrs jsonb column (also returned from yb_get_tablet_metadata).

Implements end-to-end plumbing to populate these attributes: the tserver TabletsMetadata RPC now best-effort enriches master tablet metadata by fanning out ListTablets to all tservers (with a 5s timeout) to fetch SST/WAL/uncompressed SST sizes, and the Postgres-side SRF builds tablet_attrs JSONB from the returned sizes.

Adds a migration to update pg_proc/recreate the view, and extends regression tests to validate expected JSON keys and non-negative values.

Written by Cursor Bugbot for commit 976b0f3. This will update automatically on new commits. Configure here.


Phorge: D51642

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Mar 28, 2026

CLA assistant check
All committers have signed the CLA.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a tablet_attrs JSONB column to the yb_tablet_metadata system view, exposing disk size metrics for SST, WAL, and uncompressed SST files. The implementation involves updating the PostgreSQL system catalog, adding logic to aggregate these metrics from remote tablet servers, and providing regression tests. Feedback focuses on enhancing the robustness of the enrichment process through consistent error handling, using dynamic buffers for JSON construction to prevent potential truncation, and replacing hardcoded RPC timeouts with named constants.

@anstormx anstormx force-pushed the tablet-attrs-30860 branch from 94bf97d to 976b0f3 Compare March 28, 2026 18:40
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

@anstormx anstormx force-pushed the tablet-attrs-30860 branch from 976b0f3 to 6a0cbcb Compare March 28, 2026 18:52
@cursor
Copy link
Copy Markdown

cursor bot commented Mar 28, 2026

You have used all of your free Bugbot PR reviews.

To receive reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

@hari90
Copy link
Copy Markdown
Contributor

hari90 commented Mar 30, 2026

/Gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new tablet_attrs JSONB column to the yb_tablet_metadata view, providing detailed disk size information (SST, WAL, and uncompressed SST sizes) for tablets. The implementation includes backend changes to fetch this data from tablet servers on a best-effort basis, updates to the PostgreSQL catalog and FDW, and new regression tests. Feedback suggests enhancing the regression tests to include a non-negative check for the uncompressed SST size and relaxing a strict null pointer assertion in the C API to better handle cases with zero parameters.

@anstormx anstormx force-pushed the tablet-attrs-30860 branch from 6a0cbcb to 4313d34 Compare March 30, 2026 15:34
@cursor
Copy link
Copy Markdown

cursor bot commented Mar 30, 2026

You have used all of your free Bugbot PR reviews.

To receive reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

@netlify
Copy link
Copy Markdown

netlify bot commented Mar 30, 2026

Deploy Preview for infallible-bardeen-164bc9 ready!

Built without sensitive environment variables

Name Link
🔨 Latest commit 484c271
🔍 Latest deploy log https://app.netlify.com/projects/infallible-bardeen-164bc9/deploys/69ca9a86cda83300089457b8
😎 Deploy Preview https://deploy-preview-30861--infallible-bardeen-164bc9.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@anstormx anstormx force-pushed the tablet-attrs-30860 branch from 4313d34 to 34c604a Compare March 30, 2026 15:39
@cursor
Copy link
Copy Markdown

cursor bot commented Mar 30, 2026

You have used all of your free Bugbot PR reviews.

To receive reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

@anstormx anstormx force-pushed the tablet-attrs-30860 branch from 34c604a to 484c271 Compare March 30, 2026 15:45
@cursor
Copy link
Copy Markdown

cursor bot commented Mar 30, 2026

You have used all of your free Bugbot PR reviews.

To receive reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

@anstormx
Copy link
Copy Markdown
Author

/Gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a new tablet_attrs JSONB column to the yb_tablet_metadata system view, providing tablet-level disk usage statistics such as SST, WAL, and uncompressed SST file sizes. The implementation includes a best-effort enrichment mechanism in the pg_client_service to aggregate these metrics from both local and remote tablet servers. Feedback suggests that defaulting missing metrics to zero is misleading and recommends using sentinel values to dynamically omit unavailable data from the JSON output. Additionally, the 5-second RPC timeout for data enrichment should be re-evaluated as it may cause excessive latency for synchronous SQL queries.

@anstormx anstormx force-pushed the tablet-attrs-30860 branch from 484c271 to eecee2c Compare March 30, 2026 15:56
@cursor
Copy link
Copy Markdown

cursor bot commented Mar 30, 2026

You have used all of your free Bugbot PR reviews.

To receive reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

@anstormx
Copy link
Copy Markdown
Author

/Gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances the yb_tablet_metadata system view by adding a tablet_attrs JSONB column. This column provides tablet-level disk usage metrics, including SST file size, WAL file size, and uncompressed SST file size. The implementation includes backend changes to fetch this data from tablet servers via RPCs, updates to the PostgreSQL catalog and views, and new regression tests. I have no feedback to provide as there were no review comments to evaluate.

@anstormx
Copy link
Copy Markdown
Author

@hari90 please merge if no issues are present

@cursor
Copy link
Copy Markdown

cursor bot commented Mar 30, 2026

You have used all of your free Bugbot PR reviews.

To receive reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

@pkj415
Copy link
Copy Markdown
Contributor

pkj415 commented Mar 30, 2026

Trigger Jenkins

@hari90
Copy link
Copy Markdown
Contributor

hari90 commented Mar 30, 2026

Phorge diff synced: D51642
Commit: 75bac24a

Jenkins build has been triggered. Results will be posted here once it completes.

@anstormx anstormx force-pushed the tablet-attrs-30860 branch from 75bac24 to f427d48 Compare March 30, 2026 20:08
@anstormx
Copy link
Copy Markdown
Author

/Gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a tablet_attrs JSONB column to the yb_tablet_metadata system view to expose tablet-level disk size metrics, including SST and WAL file sizes. The implementation uses a cluster-wide fan-out of RPCs to gather data from TServers, which raised concerns about performance and scalability. Feedback suggests implementing caching or master-side aggregation to mitigate these issues. Additionally, the join logic in the migration script was identified as ambiguous and should be updated to use OIDs to prevent duplicate rows. There is also a suggestion to ensure consistent disk size reporting by prioritizing leader replicas or using aggregates when multiple replicas are present.

@anstormx
Copy link
Copy Markdown
Author

/Gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a new tablet_attrs JSONB column to the yb_tablet_metadata system view to provide disk size information, including SST and WAL file sizes. The implementation involves updating the PostgreSQL catalog, modifying the yb_get_tablet_metadata function, and introducing a TTL-based caching mechanism in the TServer's PgClientService to aggregate these metrics from across the cluster. Review feedback highlights critical concurrency issues where a mutex is held during synchronous network RPCs, potentially leading to thread starvation, and suggests improving cache availability by using a local map for updates instead of clearing the shared cache during refreshes.

@anstormx anstormx force-pushed the tablet-attrs-30860 branch from 562e725 to 414819c Compare March 30, 2026 20:58
@anstormx
Copy link
Copy Markdown
Author

/Gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new tablet_attrs JSONB column to the yb_tablet_metadata system view, providing disk size metrics such as SST, WAL, and uncompressed SST file sizes. The implementation involves enriching tablet metadata by querying TServers and utilizing a TTL-based cache in PgClientService to manage RPC overhead. Review feedback identified a potential 'thundering herd' performance issue during cache expiration and recommended using sentinel values (e.g., -1) to accurately distinguish between zero and unknown disk sizes in the final JSON output.

@anstormx anstormx force-pushed the tablet-attrs-30860 branch from 414819c to de0d98e Compare March 30, 2026 21:44
@anstormx
Copy link
Copy Markdown
Author

/Gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances the yb_tablet_metadata system view by adding a tablet_attrs JSONB column for disk size metrics and updating the join logic with pg_class to use OIDs derived from tablet object UUIDs. The backend implementation introduces a TTL-based cache in the tablet server to aggregate SST and WAL sizes from cluster nodes via remote RPCs. Review feedback recommends strengthening thundering herd protection for the cache when the TTL is disabled, optimizing remote proxy initialization to improve efficiency, and utilizing standard Postgres JSON APIs instead of manual string concatenation for more robust and idiomatic attribute construction.

Comment on lines +2743 to +2748
if (now >= disk_sizes_cache_expiry_) {
needs_refresh = true;
// Push expiry forward so other threads use stale cache while we refresh.
disk_sizes_cache_expiry_ = now + std::chrono::seconds(
FLAGS_tablet_disk_sizes_cache_ttl_sec);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The thundering herd protection logic is weak when FLAGS_tablet_disk_sizes_cache_ttl_sec is set to 0. If the TTL is 0, disk_sizes_cache_expiry_ will always be set to now, causing every concurrent request to trigger a new CollectDiskSizes() call, which fans out RPCs to all tservers. This could overwhelm the cluster if many users query the yb_tablet_metadata view simultaneously. Consider enforcing a minimum internal TTL for the refresh logic or using a separate boolean flag to track if a refresh is already in flight.

Comment on lines +2836 to +2841
auto init_status = remote_tserver->InitProxy(&client());
if (!init_status.ok()) {
LOG(WARNING) << "Failed to init proxy for tserver " << remote_tserver->permanent_uuid()
<< " during disk size enrichment: " << init_status;
continue;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Creating and initializing a proxy for every remote tserver on every cache refresh might be inefficient. While InitProxy usually checks if a proxy already exists, it's generally better to leverage a centralized proxy cache if available in this context, or ensure that remote_tserver already has a warm proxy. Also, consider if ListTabletsAsync should be restricted to only return the necessary metadata to reduce payload size.

Comment on lines +8626 to +8670
StringInfoData tablet_attrs_buf;
bool first_attr = true;

initStringInfo(&tablet_attrs_buf);
appendStringInfoChar(&tablet_attrs_buf, '{');

if (tablet->sst_files_disk_size >= 0)
{
appendStringInfo(&tablet_attrs_buf, "\"sst_files_disk_size\": %lld",
(long long) tablet->sst_files_disk_size);
first_attr = false;
}

if (tablet->wal_files_disk_size >= 0)
{
if (!first_attr)
appendStringInfoString(&tablet_attrs_buf, ", ");
appendStringInfo(&tablet_attrs_buf, "\"wal_files_disk_size\": %lld",
(long long) tablet->wal_files_disk_size);
first_attr = false;
}

if (tablet->uncompressed_sst_files_disk_size >= 0)
{
if (!first_attr)
appendStringInfoString(&tablet_attrs_buf, ", ");
appendStringInfo(&tablet_attrs_buf,
"\"uncompressed_sst_files_disk_size\": %lld",
(long long) tablet->uncompressed_sst_files_disk_size);
first_attr = false;
}

if (first_attr)
{
/* No disk size data available — return NULL instead of {} */
nulls[9] = true;
}
else
{
appendStringInfoChar(&tablet_attrs_buf, '}');
values[9] = DirectFunctionCall1(jsonb_in,
CStringGetDatum(tablet_attrs_buf.data));
}
pfree(tablet_attrs_buf.data);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Manual JSON construction using appendStringInfo is generally discouraged in Postgres code due to potential escaping issues, although it is safe here since the keys are hardcoded and values are numeric. For future-proofing and better adherence to Postgres patterns, consider using the JsonbValue API or datum_to_jsonb if more complex attributes are added. Additionally, ensure that values and nulls arrays are properly initialized to zero/false before this block to avoid garbage data in other columns.

@anstormx
Copy link
Copy Markdown
Author

/Gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a tablet_attrs JSONB column to the yb_tablet_metadata system view to expose tablet-level disk usage metrics, such as SST and WAL file sizes. The implementation features a best-effort enrichment process in the PgClientService that aggregates data from across the cluster using a TTL-based cache. Review feedback highlights the need to check for field presence in protobuf messages to prevent incorrect zero-value reporting during rolling upgrades. Additionally, suggestions were made to optimize lock management during metadata enrichment and to ensure explicit null-flag handling in the PostgreSQL backend for improved robustness.

Comment on lines +2812 to +2814
local_tablet.sst_files_disk_size(),
local_tablet.wal_files_disk_size(),
local_tablet.uncompressed_sst_files_disk_size());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The code merges disk sizes from the local tablet metadata without checking if the fields are actually present in the local_tablet protobuf. If the tserver is running an older version (e.g., during a rolling upgrade) that does not populate these fields, the accessors will return a default value of 0. This can lead to misleading metadata where a tablet is reported to have 0 bytes of disk usage when the actual size is unknown. It is better to check has_... and only merge if the data is available.

        merge_disk_sizes(local_tablet.tablet_id(),
                         local_tablet.has_sst_files_disk_size() ? local_tablet.sst_files_disk_size() : -1,
                         local_tablet.has_wal_files_disk_size() ? local_tablet.wal_files_disk_size() : -1,
                         local_tablet.has_uncompressed_sst_files_disk_size() ? local_tablet.uncompressed_sst_files_disk_size() : -1);

Comment on lines +2875 to +2877
tablet_status.sst_files_disk_size(),
tablet_status.wal_files_disk_size(),
tablet_status.uncompressed_sst_files_disk_size());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Similar to the local metadata collection, the remote tserver responses should be checked for field presence before merging. Relying on default values (0) for missing fields in the TabletStatusPB will result in incorrect '0 byte' reports for tablets on nodes that haven't been upgraded or are otherwise not reporting these metrics.

        merge_disk_sizes(tablet_status.tablet_id(),
                         tablet_status.has_sst_files_disk_size() ? tablet_status.sst_files_disk_size() : -1,
                         tablet_status.has_wal_files_disk_size() ? tablet_status.wal_files_disk_size() : -1,
                         tablet_status.has_uncompressed_sst_files_disk_size() ? tablet_status.uncompressed_sst_files_disk_size() : -1);

Comment on lines +2773 to +2788
for (int i = 0; i < resp->tablets_size(); ++i) {
auto* tablet = resp->mutable_tablets(i);
auto it = disk_sizes_cache_.find(tablet->tablet_id());
if (it != disk_sizes_cache_.end()) {
if (it->second.sst_files_disk_size >= 0) {
tablet->set_sst_files_disk_size(it->second.sst_files_disk_size);
}
if (it->second.wal_files_disk_size >= 0) {
tablet->set_wal_files_disk_size(it->second.wal_files_disk_size);
}
if (it->second.uncompressed_sst_files_disk_size >= 0) {
tablet->set_uncompressed_sst_files_disk_size(
it->second.uncompressed_sst_files_disk_size);
}
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Holding disk_sizes_cache_mutex_ while iterating over the entire response and performing map lookups can lead to lock contention, especially in large clusters with many tablets (e.g., 100k+). Since GetTabletsMetadata is a synchronous call that can be triggered by multiple clients or monitoring tools, this serialization point might impact performance. Consider copying the cache map or using a std::shared_ptr to a constant map to allow lock-free reads during the enrichment loop.

if (first_attr)
{
/* No disk size data available — return NULL instead of {} */
nulls[9] = true;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

When first_attr is true, nulls[9] is explicitly set to true. However, in the else block where the JSON is successfully built, nulls[9] is not explicitly set to false. While the nulls array is likely zero-initialized at the start of the loop, explicitly setting it to false here improves robustness and consistency with other column handling logic in this file.

				nulls[9] = false;
				appendStringInfoChar(&tablet_attrs_buf, '}');
				values[9] = DirectFunctionCall1(jsonb_in,
												CStringGetDatum(tablet_attrs_buf.data));

@hari90
Copy link
Copy Markdown
Contributor

hari90 commented Mar 31, 2026

Trigger Jenkins

@hari90
Copy link
Copy Markdown
Contributor

hari90 commented Mar 31, 2026

Phorge diff synced: D51642
Commit: 7af25994

Jenkins build has been triggered. Results will be posted here once it completes.

@hari90
Copy link
Copy Markdown
Contributor

hari90 commented Apr 1, 2026

Jenkins build for commit 7af25994: Fail

Exceptions:


🚫 Failed Landing Requirements

'Multi-Build Test Failures' failed


🔨 DB Build/Test Job Summary

Build Total Passed Failed Failed After Retries
D51642-arm-mac14-clang-release 17 17 0 0
D51642-mac14-clang-release 2 2 0 0
D51642-arm-alma8-clang21-release 11108 10667 14 14
D51642-alma8-clang21-release 11110 10674 10 10
D51642-alma8-gcc12-fastdebug 11849 11372 14 14
D51642-ubuntu22.04-clang21-debug 2 2 0 0
D51642-alma9-clang21-asan 11726 10981 16 16
D51642-alma8-clang21-tsan 11638 9987 13 13

Full status

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants