[#30180] YSQL: Add attributes in yb_tablet_metadata.tablet_attrs#30861
[#30180] YSQL: Add attributes in yb_tablet_metadata.tablet_attrs#30861anstormx wants to merge 2 commits intoyugabyte:masterfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a tablet_attrs JSONB column to the yb_tablet_metadata system view, exposing disk size metrics for SST, WAL, and uncompressed SST files. The implementation involves updating the PostgreSQL system catalog, adding logic to aggregate these metrics from remote tablet servers, and providing regression tests. Feedback focuses on enhancing the robustness of the enrichment process through consistent error handling, using dynamic buffers for JSON construction to prevent potential truncation, and replacing hardcoded RPC timeouts with named constants.
94bf97d to
976b0f3
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 2 total unresolved issues (including 1 from previous review).
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
976b0f3 to
6a0cbcb
Compare
|
You have used all of your free Bugbot PR reviews. To receive reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial. |
|
/Gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a new tablet_attrs JSONB column to the yb_tablet_metadata view, providing detailed disk size information (SST, WAL, and uncompressed SST sizes) for tablets. The implementation includes backend changes to fetch this data from tablet servers on a best-effort basis, updates to the PostgreSQL catalog and FDW, and new regression tests. Feedback suggests enhancing the regression tests to include a non-negative check for the uncompressed SST size and relaxing a strict null pointer assertion in the C API to better handle cases with zero parameters.
6a0cbcb to
4313d34
Compare
|
You have used all of your free Bugbot PR reviews. To receive reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial. |
✅ Deploy Preview for infallible-bardeen-164bc9 ready!Built without sensitive environment variables
To edit notification comments on pull requests, go to your Netlify project configuration. |
4313d34 to
34c604a
Compare
|
You have used all of your free Bugbot PR reviews. To receive reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial. |
34c604a to
484c271
Compare
|
You have used all of your free Bugbot PR reviews. To receive reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial. |
|
/Gemini review |
There was a problem hiding this comment.
Code Review
This pull request adds a new tablet_attrs JSONB column to the yb_tablet_metadata system view, providing tablet-level disk usage statistics such as SST, WAL, and uncompressed SST file sizes. The implementation includes a best-effort enrichment mechanism in the pg_client_service to aggregate these metrics from both local and remote tablet servers. Feedback suggests that defaulting missing metrics to zero is misleading and recommends using sentinel values to dynamically omit unavailable data from the JSON output. Additionally, the 5-second RPC timeout for data enrichment should be re-evaluated as it may cause excessive latency for synchronous SQL queries.
484c271 to
eecee2c
Compare
|
You have used all of your free Bugbot PR reviews. To receive reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial. |
|
/Gemini review |
There was a problem hiding this comment.
Code Review
This pull request enhances the yb_tablet_metadata system view by adding a tablet_attrs JSONB column. This column provides tablet-level disk usage metrics, including SST file size, WAL file size, and uncompressed SST file size. The implementation includes backend changes to fetch this data from tablet servers via RPCs, updates to the PostgreSQL catalog and views, and new regression tests. I have no feedback to provide as there were no review comments to evaluate.
|
@hari90 please merge if no issues are present |
|
You have used all of your free Bugbot PR reviews. To receive reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial. |
|
Trigger Jenkins |
|
Phorge diff synced: D51642 Jenkins build has been triggered. Results will be posted here once it completes. |
75bac24 to
f427d48
Compare
|
/Gemini review |
There was a problem hiding this comment.
Code Review
This pull request adds a tablet_attrs JSONB column to the yb_tablet_metadata system view to expose tablet-level disk size metrics, including SST and WAL file sizes. The implementation uses a cluster-wide fan-out of RPCs to gather data from TServers, which raised concerns about performance and scalability. Feedback suggests implementing caching or master-side aggregation to mitigate these issues. Additionally, the join logic in the migration script was identified as ambiguous and should be updated to use OIDs to prevent duplicate rows. There is also a suggestion to ensure consistent disk size reporting by prioritizing leader replicas or using aggregates when multiple replicas are present.
src/yb/yql/pgwrapper/ysql_migrations/V100__30860__yb_tablet_metadata_attrs.sql
Outdated
Show resolved
Hide resolved
|
/Gemini review |
There was a problem hiding this comment.
Code Review
This pull request adds a new tablet_attrs JSONB column to the yb_tablet_metadata system view to provide disk size information, including SST and WAL file sizes. The implementation involves updating the PostgreSQL catalog, modifying the yb_get_tablet_metadata function, and introducing a TTL-based caching mechanism in the TServer's PgClientService to aggregate these metrics from across the cluster. Review feedback highlights critical concurrency issues where a mutex is held during synchronous network RPCs, potentially leading to thread starvation, and suggests improving cache availability by using a local map for updates instead of clearing the shared cache during refreshes.
562e725 to
414819c
Compare
|
/Gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a new tablet_attrs JSONB column to the yb_tablet_metadata system view, providing disk size metrics such as SST, WAL, and uncompressed SST file sizes. The implementation involves enriching tablet metadata by querying TServers and utilizing a TTL-based cache in PgClientService to manage RPC overhead. Review feedback identified a potential 'thundering herd' performance issue during cache expiration and recommended using sentinel values (e.g., -1) to accurately distinguish between zero and unknown disk sizes in the final JSON output.
414819c to
de0d98e
Compare
|
/Gemini review |
There was a problem hiding this comment.
Code Review
This pull request enhances the yb_tablet_metadata system view by adding a tablet_attrs JSONB column for disk size metrics and updating the join logic with pg_class to use OIDs derived from tablet object UUIDs. The backend implementation introduces a TTL-based cache in the tablet server to aggregate SST and WAL sizes from cluster nodes via remote RPCs. Review feedback recommends strengthening thundering herd protection for the cache when the TTL is disabled, optimizing remote proxy initialization to improve efficiency, and utilizing standard Postgres JSON APIs instead of manual string concatenation for more robust and idiomatic attribute construction.
src/yb/tserver/pg_client_service.cc
Outdated
| if (now >= disk_sizes_cache_expiry_) { | ||
| needs_refresh = true; | ||
| // Push expiry forward so other threads use stale cache while we refresh. | ||
| disk_sizes_cache_expiry_ = now + std::chrono::seconds( | ||
| FLAGS_tablet_disk_sizes_cache_ttl_sec); | ||
| } |
There was a problem hiding this comment.
The thundering herd protection logic is weak when FLAGS_tablet_disk_sizes_cache_ttl_sec is set to 0. If the TTL is 0, disk_sizes_cache_expiry_ will always be set to now, causing every concurrent request to trigger a new CollectDiskSizes() call, which fans out RPCs to all tservers. This could overwhelm the cluster if many users query the yb_tablet_metadata view simultaneously. Consider enforcing a minimum internal TTL for the refresh logic or using a separate boolean flag to track if a refresh is already in flight.
| auto init_status = remote_tserver->InitProxy(&client()); | ||
| if (!init_status.ok()) { | ||
| LOG(WARNING) << "Failed to init proxy for tserver " << remote_tserver->permanent_uuid() | ||
| << " during disk size enrichment: " << init_status; | ||
| continue; | ||
| } |
There was a problem hiding this comment.
Creating and initializing a proxy for every remote tserver on every cache refresh might be inefficient. While InitProxy usually checks if a proxy already exists, it's generally better to leverage a centralized proxy cache if available in this context, or ensure that remote_tserver already has a warm proxy. Also, consider if ListTabletsAsync should be restricted to only return the necessary metadata to reduce payload size.
| StringInfoData tablet_attrs_buf; | ||
| bool first_attr = true; | ||
|
|
||
| initStringInfo(&tablet_attrs_buf); | ||
| appendStringInfoChar(&tablet_attrs_buf, '{'); | ||
|
|
||
| if (tablet->sst_files_disk_size >= 0) | ||
| { | ||
| appendStringInfo(&tablet_attrs_buf, "\"sst_files_disk_size\": %lld", | ||
| (long long) tablet->sst_files_disk_size); | ||
| first_attr = false; | ||
| } | ||
|
|
||
| if (tablet->wal_files_disk_size >= 0) | ||
| { | ||
| if (!first_attr) | ||
| appendStringInfoString(&tablet_attrs_buf, ", "); | ||
| appendStringInfo(&tablet_attrs_buf, "\"wal_files_disk_size\": %lld", | ||
| (long long) tablet->wal_files_disk_size); | ||
| first_attr = false; | ||
| } | ||
|
|
||
| if (tablet->uncompressed_sst_files_disk_size >= 0) | ||
| { | ||
| if (!first_attr) | ||
| appendStringInfoString(&tablet_attrs_buf, ", "); | ||
| appendStringInfo(&tablet_attrs_buf, | ||
| "\"uncompressed_sst_files_disk_size\": %lld", | ||
| (long long) tablet->uncompressed_sst_files_disk_size); | ||
| first_attr = false; | ||
| } | ||
|
|
||
| if (first_attr) | ||
| { | ||
| /* No disk size data available — return NULL instead of {} */ | ||
| nulls[9] = true; | ||
| } | ||
| else | ||
| { | ||
| appendStringInfoChar(&tablet_attrs_buf, '}'); | ||
| values[9] = DirectFunctionCall1(jsonb_in, | ||
| CStringGetDatum(tablet_attrs_buf.data)); | ||
| } | ||
| pfree(tablet_attrs_buf.data); | ||
| } |
There was a problem hiding this comment.
Manual JSON construction using appendStringInfo is generally discouraged in Postgres code due to potential escaping issues, although it is safe here since the keys are hardcoded and values are numeric. For future-proofing and better adherence to Postgres patterns, consider using the JsonbValue API or datum_to_jsonb if more complex attributes are added. Additionally, ensure that values and nulls arrays are properly initialized to zero/false before this block to avoid garbage data in other columns.
|
/Gemini review |
There was a problem hiding this comment.
Code Review
This pull request adds a tablet_attrs JSONB column to the yb_tablet_metadata system view to expose tablet-level disk usage metrics, such as SST and WAL file sizes. The implementation features a best-effort enrichment process in the PgClientService that aggregates data from across the cluster using a TTL-based cache. Review feedback highlights the need to check for field presence in protobuf messages to prevent incorrect zero-value reporting during rolling upgrades. Additionally, suggestions were made to optimize lock management during metadata enrichment and to ensure explicit null-flag handling in the PostgreSQL backend for improved robustness.
| local_tablet.sst_files_disk_size(), | ||
| local_tablet.wal_files_disk_size(), | ||
| local_tablet.uncompressed_sst_files_disk_size()); |
There was a problem hiding this comment.
The code merges disk sizes from the local tablet metadata without checking if the fields are actually present in the local_tablet protobuf. If the tserver is running an older version (e.g., during a rolling upgrade) that does not populate these fields, the accessors will return a default value of 0. This can lead to misleading metadata where a tablet is reported to have 0 bytes of disk usage when the actual size is unknown. It is better to check has_... and only merge if the data is available.
merge_disk_sizes(local_tablet.tablet_id(),
local_tablet.has_sst_files_disk_size() ? local_tablet.sst_files_disk_size() : -1,
local_tablet.has_wal_files_disk_size() ? local_tablet.wal_files_disk_size() : -1,
local_tablet.has_uncompressed_sst_files_disk_size() ? local_tablet.uncompressed_sst_files_disk_size() : -1);| tablet_status.sst_files_disk_size(), | ||
| tablet_status.wal_files_disk_size(), | ||
| tablet_status.uncompressed_sst_files_disk_size()); |
There was a problem hiding this comment.
Similar to the local metadata collection, the remote tserver responses should be checked for field presence before merging. Relying on default values (0) for missing fields in the TabletStatusPB will result in incorrect '0 byte' reports for tablets on nodes that haven't been upgraded or are otherwise not reporting these metrics.
merge_disk_sizes(tablet_status.tablet_id(),
tablet_status.has_sst_files_disk_size() ? tablet_status.sst_files_disk_size() : -1,
tablet_status.has_wal_files_disk_size() ? tablet_status.wal_files_disk_size() : -1,
tablet_status.has_uncompressed_sst_files_disk_size() ? tablet_status.uncompressed_sst_files_disk_size() : -1);| for (int i = 0; i < resp->tablets_size(); ++i) { | ||
| auto* tablet = resp->mutable_tablets(i); | ||
| auto it = disk_sizes_cache_.find(tablet->tablet_id()); | ||
| if (it != disk_sizes_cache_.end()) { | ||
| if (it->second.sst_files_disk_size >= 0) { | ||
| tablet->set_sst_files_disk_size(it->second.sst_files_disk_size); | ||
| } | ||
| if (it->second.wal_files_disk_size >= 0) { | ||
| tablet->set_wal_files_disk_size(it->second.wal_files_disk_size); | ||
| } | ||
| if (it->second.uncompressed_sst_files_disk_size >= 0) { | ||
| tablet->set_uncompressed_sst_files_disk_size( | ||
| it->second.uncompressed_sst_files_disk_size); | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Holding disk_sizes_cache_mutex_ while iterating over the entire response and performing map lookups can lead to lock contention, especially in large clusters with many tablets (e.g., 100k+). Since GetTabletsMetadata is a synchronous call that can be triggered by multiple clients or monitoring tools, this serialization point might impact performance. Consider copying the cache map or using a std::shared_ptr to a constant map to allow lock-free reads during the enrichment loop.
| if (first_attr) | ||
| { | ||
| /* No disk size data available — return NULL instead of {} */ | ||
| nulls[9] = true; |
There was a problem hiding this comment.
When first_attr is true, nulls[9] is explicitly set to true. However, in the else block where the JSON is successfully built, nulls[9] is not explicitly set to false. While the nulls array is likely zero-initialized at the start of the loop, explicitly setting it to false here improves robustness and consistency with other column handling logic in this file.
nulls[9] = false;
appendStringInfoChar(&tablet_attrs_buf, '}');
values[9] = DirectFunctionCall1(jsonb_in,
CStringGetDatum(tablet_attrs_buf.data));|
Trigger Jenkins |
|
Phorge diff synced: D51642 Jenkins build has been triggered. Results will be posted here once it completes. |
|
❌ Jenkins build for commit Exceptions:
🚫 Failed Landing Requirements'Multi-Build Test Failures' failed
🔨 DB Build/Test Job Summary
|

Note
Medium Risk
Adds a new
jsonboutput column backed by tserver-side fan-out RPCs to enrich tablet metadata, which could impact latency/robustness of the metadata call despite best-effort error handling. Changes touch cross-layer plumbing (tserver RPC -> pggate structs -> Postgres SRF/view) and require careful compatibility with existing callers.Overview
Exposes new per-tablet disk size metadata via
yb_tablet_metadataby adding atablet_attrsjsonbcolumn (also returned fromyb_get_tablet_metadata).Implements end-to-end plumbing to populate these attributes: the tserver
TabletsMetadataRPC now best-effort enriches master tablet metadata by fanning outListTabletsto all tservers (with a 5s timeout) to fetch SST/WAL/uncompressed SST sizes, and the Postgres-side SRF buildstablet_attrsJSONB from the returned sizes.Adds a migration to update
pg_proc/recreate the view, and extends regression tests to validate expected JSON keys and non-negative values.Written by Cursor Bugbot for commit 976b0f3. This will update automatically on new commits. Configure here.
Phorge: D51642