Skip to content

[YSQL] Fix pagination over hash-sharded tables with yb_hash_code() = const#30651

Open
kirs wants to merge 2 commits intoyugabyte:masterfrom
kirs:kirs/fix-hash-code-pagination-30524
Open

[YSQL] Fix pagination over hash-sharded tables with yb_hash_code() = const#30651
kirs wants to merge 2 commits intoyugabyte:masterfrom
kirs:kirs/fix-hash-code-pagination-30524

Conversation

@kirs
Copy link
Copy Markdown

@kirs kirs commented Mar 10, 2026

When yb_hash_code(hash_cols) = constant pins a scan to a single hash bucket, DocDB stores rows in (hash_cols, range_cols) order within that bucket. Three independent code paths prevented exploiting this ordering for pagination:

  1. pathkeys.c (build_index_pathkeys): Hash columns were never allowed to produce sort pathkeys, forcing a Sort node even when ORDER BY matched the index key order within a pinned bucket. Fix: when yb_hash_code equality is detected, pass is_hash_index=false so hash columns generate sort pathkeys.

  2. indxpath.c (match_rowcompare_to_indexcol): ROW comparisons like (h, r) > (?, ?) were unconditionally rejected when the first column was a hash column. Fix: allow through when yb_hash_code equality pins the scan to a single bucket.

  3. yb_scan.c (YbBindRowComparisonKeys): Two guards prevented binding ROW comparison keys to DocDB when any hash column was present in the index or the comparison subkeys. Fix: relax both guards when a yb_hash_code equality scan key is present.

Together these fixes enable efficient keyset pagination:

  SELECT * FROM t
  WHERE yb_hash_code(h) = 10 AND (h, r) > (:last_h, :last_r)
  ORDER BY h, r LIMIT 1000;

Previously this required a full sort of the hash bucket on every page and scanned from the start of the bucket. Now it seeks directly to the cursor position.

Fixes: #30524

Test Plan: New regression test yb.orig.hash_code_pagination covering:

  • Sort elimination with yb_hash_code equality + ORDER BY
  • ROW comparison pushdown as Index Cond
  • Negative tests (range predicates, arity mismatches)
  • Multi-column hash key support

Note

Medium Risk
Planner and scan-binding logic for YB hash-sharded primary keys is relaxed based on detecting yb_hash_code(...) = const, which can change query plans and index bound pushdown behavior and may impact correctness if the pinning detection misfires.

Overview
Improves keyset pagination and ORDER BY planning on hash-sharded (LSM) primary keys when the scan is pinned to a single hash bucket via yb_hash_code(hash_cols) = const.

This teaches the optimizer to treat hash columns as providing valid ordering/pathkeys and allows ROW(...) range comparisons to match and be pushed down as index bounds in that pinned-bucket case, and updates YbBindRowComparisonKeys to permit binding row-comparison bounds involving hash columns while avoiding IS NOT NULL binds on hash/partition columns.

Adds a new regression test yb.orig.hash_code_pagination (and schedule entry) covering sort elimination, row-compare pushdown, multi-column hash keys, and negative cases.

Written by Cursor Bugbot for commit 321541d. This will update automatically on new commits. Configure here.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Important

Installation incomplete: to start using Gemini Code Assist, please ask the organization owner(s) to visit the Gemini Code Assist Admin Console and sign the Terms of Services.

@netlify
Copy link
Copy Markdown

netlify bot commented Mar 11, 2026

Deploy Preview for infallible-bardeen-164bc9 ready!

Built without sensitive environment variables

Name Link
🔨 Latest commit 321541d
🔍 Latest deploy log https://app.netlify.com/projects/infallible-bardeen-164bc9/deploys/69b1b689fda7ff000827d492
😎 Deploy Preview https://deploy-preview-30651--infallible-bardeen-164bc9.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@kirs
Copy link
Copy Markdown
Author

kirs commented Mar 11, 2026

I manually verified this on a VM against a built branch binary:

DROP TABLE IF EXISTS hk_pagination;
CREATE TABLE hk_pagination(
  h int,
  r1 int,
  r2 int,
  v text,
  PRIMARY KEY(h HASH, r1 ASC, r2 ASC)
);

INSERT INTO hk_pagination
SELECT i % 5, i, i * 10, 'val' || i
FROM generate_series(1, 50) AS i;

SELECT h, yb_hash_code(h), count(*)
FROM hk_pagination
GROUP BY h, yb_hash_code(h)
ORDER BY h;

Observed buckets:

 h | yb_hash_code | count
---+--------------+-------
 0 |        47650 |    10
 1 |         4624 |    10
 2 |        49348 |    10
 3 |        64672 |    10
 4 |        40623 |    10

Using bucket 4624 (h = 1), planner behavior matches the intent of the patch:

EXPLAIN (COSTS OFF)
SELECT *
FROM hk_pagination
WHERE yb_hash_code(h) = 4624
ORDER BY h, r1, r2
LIMIT 10;
Limit
  ->  Index Scan using hk_pagination_pkey on hk_pagination
        Index Cond: (yb_hash_code(h) = 4624)
EXPLAIN (COSTS OFF)
SELECT *
FROM hk_pagination
WHERE yb_hash_code(h) = 4624
  AND (h, r1) > (1, 10)
ORDER BY h, r1, r2
LIMIT 10;
Limit
  ->  Index Scan using hk_pagination_pkey on hk_pagination
        Index Cond: ((ROW(h, r1) > ROW(1, 10)) AND (yb_hash_code(h) = 4624))
EXPLAIN (COSTS OFF)
SELECT *
FROM hk_pagination
WHERE yb_hash_code(h) = 4624
  AND (h, r1, r2) > (1, 11, 110)
ORDER BY h, r1, r2
LIMIT 10;
Limit
  ->  Index Scan using hk_pagination_pkey on hk_pagination
        Index Cond: ((ROW(h, r1, r2) > ROW(1, 11, 110)) AND (yb_hash_code(h) = 4624))

Negative checks also look right:

EXPLAIN (COSTS OFF)
SELECT *
FROM hk_pagination
WHERE yb_hash_code(h) < 4624
ORDER BY h, r1, r2
LIMIT 10;
Limit
  ->  Sort
        Sort Key: h, r1, r2
        ->  Index Scan using hk_pagination_pkey on hk_pagination
              Index Cond: (yb_hash_code(h) < 4624)
EXPLAIN (COSTS OFF)
SELECT *
FROM hk_pagination
WHERE yb_hash_code(h) = 4624
ORDER BY r1, r2
LIMIT 10;
Limit
  ->  Sort
        Sort Key: r1, r2
        ->  Index Scan using hk_pagination_pkey on hk_pagination
              Index Cond: (yb_hash_code(h) = 4624)
EXPLAIN (COSTS OFF)
SELECT *
FROM hk_pagination
WHERE yb_hash_code(h) < 4624
  AND (h, r1) > (1, 10)
LIMIT 10;
Limit
  ->  Index Scan using hk_pagination_pkey on hk_pagination
        Index Cond: (yb_hash_code(h) < 4624)
        Filter: (ROW(h, r1) > ROW(1, 10))

I also hit an execution-time crash on the initial version of the patch (before the follow-up yb_scan.c adjustment): the EXPLAINs above succeeded, but the actual SELECT with row comparison aborted the backend because the row-comparison bind path synthesized IS NOT NULL on the first RC column, which is now allowed to be a hash/partition key in the pinned-bucket case. Pggate rejects BindColumnCondIsNotNull on partition columns.

After applying the small yb_scan.c follow-up to skip that synthetic IS NOT NULL for hash columns, the actual SELECTs also executed successfully.

* IS NOT NULL on partition columns, and for primary-key hash
* columns the null check is redundant anyway.
*/
is_not_null[att_idx] |= (subkey_index == 0 &&
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Manual repro context for this guard: after allowing ROW-comparison pushdown on hash columns in the pinned-bucket case, planning was fine but actual execution crashed in pggate. The RC bind path still synthesizes IS NOT NULL on the first RC column; for queries like WHERE yb_hash_code(h) = 4624 AND (h, r1, r2) > (...) ORDER BY h, r1, r2, that first column is now the hash/partition key h. Pggate rejects BindColumnCondIsNotNull on partition columns (pg_dml_read.cc:765, CHECK(!col.is_partition())), so skipping the synthetic null-check here for hash columns fixes the crash while keeping the row-bound pushdown.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commit cd0d858 never explains why IS NOT NULL was not supported for hash. Would have been good to know that.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you see this as a requirement for this PR or as something that can be done as a follow-up? Sorry, I have very little context on cd0d858.

@kirs kirs marked this pull request as ready for review March 11, 2026 18:41
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Important

Installation incomplete: to start using Gemini Code Assist, please ask the organization owner(s) to visit the Gemini Code Assist Admin Console and sign the Terms of Services.

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Bugbot Free Tier Details

Your team is on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle for each member of your team.

To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

return true;
}
return false;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Executor hash-code check omits arity validation present in planner

Low Severity

yb_has_hash_code_equality_scan_key returns true for any BTEqualStrategyNumber hash-code scan key without validating that the yb_hash_code call's arity matches index->rd_index->indnkeyatts hash column count. The planner-side counterparts (yb_has_hash_code_equality_qual) both enforce an arity check (list_length(funcexpr->args) != index->nhashcolumns). This inconsistency means the executor guard is weaker than the planner guard, reducing defense-in-depth for multi-column hash keys.

Fix in Cursor Fix in Web

@cursor
Copy link
Copy Markdown

cursor bot commented Mar 12, 2026

You have used all of your free Bugbot PR reviews.

To receive reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

Copy link
Copy Markdown
Contributor

@mtakahar mtakahar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, and thank you so much for putting this together!

I've left a couple of inline comments, one correctness issue and another one, follow-up on what cursor bot pointed out (code duplication).


if (list_length(funcexpr->args) != index->nhashcolumns)
continue;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since index->rel->baserestrictinfo contains all the predicates on the table, not only the ones relevant to the index, I think the yb_hash_code arguments need to match the ones in the expression based index key item expression. Otherwise the code can mix up another index on yb_hash_code on a different column if such index exists.

Example:

CREATE TABLE hk_pagination(
    h int,
    r1 int,
    r2 int,
    v text,
    PRIMARY KEY(h HASH, r1 ASC, r2 ASC)
);
INSERT INTO hk_pagination SELECT i % 5, i, i * 10, 'val' || i FROM generate_series(1, 20) i;
create index hk_pagination_r1_r2_idx on hk_pagination (r1 hash, r2 asc);
/*+ IndexScan(hk_pagination hk_pagination_r1_r2_idx) */
EXPLAIN (COSTS OFF) SELECT * FROM hk_pagination
    WHERE yb_hash_code(h) = 49348
    ORDER BY r1, r2
    LIMIT 2;
                           QUERY PLAN                            
-----------------------------------------------------------------
 Limit
   ->  Index Scan using hk_pagination_r1_r2_idx on hk_pagination
         Filter: (yb_hash_code(h) = 49348)
(3 rows)

/*+ IndexScan(hk_pagination hk_pagination_r1_r2_idx) */
SELECT * FROM hk_pagination
    WHERE yb_hash_code(h) = 49348
    ORDER BY r1, r2
    LIMIT 2;
 h | r1 | r2  |   v   
---+----+-----+-------
 2 | 12 | 120 | val12
 2 |  7 |  70 | val7
(2 rows)

/*+ SeqScan(hk_pagination) */
EXPLAIN (COSTS OFF) SELECT * FROM hk_pagination
    WHERE yb_hash_code(h) = 49348
    ORDER BY r1, r2
    LIMIT 2;
                   QUERY PLAN                    
-------------------------------------------------
 Limit
   ->  Sort
         Sort Key: r1, r2
         ->  Seq Scan on hk_pagination
               Filter: (yb_hash_code(h) = 49348)
(5 rows)

/*+ SeqScan(hk_pagination) */
SELECT * FROM hk_pagination
    WHERE yb_hash_code(h) = 49348
    ORDER BY r1, r2
LIMIT 2;
 h | r1 | r2 |  v   
---+----+----+------
 2 |  2 | 20 | val2
 2 |  7 | 70 | val7
(2 rows)

@jasonyb
Copy link
Copy Markdown
Contributor

jasonyb commented Mar 12, 2026

Based on a run off of commit 321541d, 4 test failures:

TestPgRegressDml:

ybd fastdebug --java-test 'org.yb.pgsql.TestPgRegressDml#schedule'

01:13:40.324 (Time-limited test) [WARN - org.yb.pgsql.PgRegressRunner.run(PgRegressRunner.java:136)] Side-by-side diff between expected output and actual output of yb.orig.dml_scankey:
--                                                                              --
-- Test index scan using both hash code and row comparison filter.              -- Test index scan using both hash code and row comparison filter.
--                                                                              --
CREATE TABLE hashcode_rowfilter (i int, j int, k int);                          CREATE TABLE hashcode_rowfilter (i int, j int, k int);
CREATE INDEX ON hashcode_rowfilter (k, j, i);                                   CREATE INDEX ON hashcode_rowfilter (k, j, i);
INSERT INTO hashcode_rowfilter VALUES (2, 3, 4), (1, 2, 4), (3, 4, 5);          INSERT INTO hashcode_rowfilter VALUES (2, 3, 4), (1, 2, 4), (3, 4, 5);
\set filter 'yb_hash_code(k) = yb_hash_code(4) AND row(j, i) > row(2, 3)'       \set filter 'yb_hash_code(k) = yb_hash_code(4) AND row(j, i) > row(2, 3)'
/*+IndexScan(hashcode_rowfilter)*/                                              /*+IndexScan(hashcode_rowfilter)*/
SELECT * FROM hashcode_rowfilter WHERE :filter;                                 SELECT * FROM hashcode_rowfilter WHERE :filter;
 i | j | k                                                                    | server closed the connection unexpectedly
---+---+---                                                                   |         This probably means the server terminated abnormally
 2 | 3 | 4                                                                    |         before or while processing the request.
(1 row)                                                                       | connection to server was lost

TestPgRegressIndex:

ybd fastdebug --java-test 'org.yb.pgsql.TestPgRegressIndex#schedule'

diff -U3 ${TEST_TMPDIR}/pgregress_output/yb_index_schedule/expected/yb.orig.hash_code_pagination.out ${TEST_TMPDIR}/pgregress_output/yb_index_schedule/results/yb.orig.hash_code_pagination.out
--- ${TEST_TMPDIR}/pgregress_output/yb_index_schedule/expected/yb.orig.hash_code_pagination.out	2026-03-12 01:12:21.029403367 +0000
+++ ${TEST_TMPDIR}/pgregress_output/yb_index_schedule/results/yb.orig.hash_code_pagination.out	2026-03-12 01:12:20.963403454 +0000
@@ -21,8 +21,8 @@
     WHERE yb_hash_code(h) = 1000
     ORDER BY h, r1, r2
     LIMIT 10;
-                       QUERY PLAN                       
---------------------------------------------------------
+                         QUERY PLAN                         
+------------------------------------------------------------
  Limit
    ->  Index Scan using hk_pagination_pkey on hk_pagination
          Index Cond: (yb_hash_code(h) = 1000)
@@ -33,8 +33,8 @@
     WHERE yb_hash_code(h) < 1000
     ORDER BY h, r1, r2
     LIMIT 10;
-                          QUERY PLAN                          
---------------------------------------------------------------
+                            QUERY PLAN                            
+------------------------------------------------------------------
  Limit
    ->  Sort
          Sort Key: h, r1, r2
@@ -47,8 +47,8 @@
     WHERE yb_hash_code(h) = 1000
     ORDER BY r1, r2
     LIMIT 10;
-                          QUERY PLAN                          
---------------------------------------------------------------
+                            QUERY PLAN                            
+------------------------------------------------------------------
  Limit
    ->  Sort
          Sort Key: r1, r2
@@ -66,11 +66,11 @@
     WHERE yb_hash_code(h) = 1000
       AND (h, r1) > (2, 10)
     LIMIT 10;
-                              QUERY PLAN                              
-----------------------------------------------------------------------
+                                  QUERY PLAN                                  
+------------------------------------------------------------------------------
  Limit
    ->  Index Scan using hk_pagination_pkey on hk_pagination
-         Index Cond: ((yb_hash_code(h) = 1000) AND (ROW(h, r1) > ROW(2, 10)))
+         Index Cond: ((ROW(h, r1) > ROW(2, 10)) AND (yb_hash_code(h) = 1000))
 (3 rows)
 
 -- Full keyset pagination pattern: hash_code equality + ROW comparison + ORDER BY + LIMIT.
@@ -79,11 +79,11 @@
       AND (h, r1, r2) > (2, 10, 100)
     ORDER BY h, r1, r2
     LIMIT 10;
-                                   QUERY PLAN                                   
---------------------------------------------------------------------------------
+                                      QUERY PLAN                                       
+---------------------------------------------------------------------------------------
  Limit
    ->  Index Scan using hk_pagination_pkey on hk_pagination
-         Index Cond: ((yb_hash_code(h) = 1000) AND (ROW(h, r1, r2) > ROW(2, 10, 100)))
+         Index Cond: ((ROW(h, r1, r2) > ROW(2, 10, 100)) AND (yb_hash_code(h) = 1000))
 (3 rows)
 
 -- Negative: without yb_hash_code equality, ROW comparison on hash column
@@ -92,8 +92,8 @@
     WHERE yb_hash_code(h) < 1000
       AND (h, r1) > (2, 10)
     LIMIT 10;
-                          QUERY PLAN                          
---------------------------------------------------------------
+                         QUERY PLAN                         
+------------------------------------------------------------
  Limit
    ->  Index Scan using hk_pagination_pkey on hk_pagination
          Index Cond: (yb_hash_code(h) < 1000)
@@ -138,8 +138,8 @@
     WHERE yb_hash_code(h1, h2) = 500
     ORDER BY h1, h2, r
     LIMIT 10;
-                      QUERY PLAN                      
-------------------------------------------------------
+                    QUERY PLAN                    
+--------------------------------------------------
  Limit
    ->  Index Scan using hk_multi_pkey on hk_multi
          Index Cond: (yb_hash_code(h1, h2) = 500)
@@ -150,8 +150,8 @@
     WHERE yb_hash_code(h1) = 500
     ORDER BY h1, h2, r
     LIMIT 10;
-                       QUERY PLAN                       
---------------------------------------------------------
+                   QUERY PLAN                   
+------------------------------------------------
  Limit
    ->  Sort
          Sort Key: h1, h2, r

TestPgRegressPlanner:

ybd fastdebug --java-test 'org.yb.pgsql.TestPgRegressPlanner#testPgRegressPlanner'

diff -U3 ${TEST_TMPDIR}/pgregress_output/yb_planner_serial_schedule/expected/yb.orig.planner_joins.out ${TEST_TMPDIR}/pgregress_output/yb_planner_serial_schedule/results/yb.orig.planner_joins.out
--- ${TEST_TMPDIR}/pgregress_output/yb_planner_serial_schedule/expected/yb.orig.planner_joins.out	2026-03-12 01:11:33.933313581 +0000
+++ ${TEST_TMPDIR}/pgregress_output/yb_planner_serial_schedule/results/yb.orig.planner_joins.out	2026-03-12 01:11:33.887313590 +0000
@@ -199,15 +199,13 @@
          Join Filter: (t1.h = t2.h)
          ->  Merge Join
                Merge Cond: (t1.h = t3.h)
-               ->  Sort
-                     Sort Key: t1.h
-                     ->  Index Scan using t1_pkey on t1
-                           Index Cond: ((yb_hash_code(h) = 4624) AND (r = 2))
+               ->  Index Scan using t1_pkey on t1
+                     Index Cond: ((yb_hash_code(h) = 4624) AND (r = 2))
                ->  Index Scan using t3_pkey on t3
                      Index Cond: (r = 2)
          ->  Index Scan using t2_pkey on t2
                Index Cond: ((h = ANY (ARRAY[t3.h, $1, $2, ..., $1023])) AND (r = 2))
-(14 rows)
+(12 rows)
 
 SELECT *
 FROM t1

TestPgRegressProc:

ybd fastdebug --java-test 'org.yb.pgsql.TestPgRegressProc#testPgRegressProc'

diff -U3 ${TEST_TMPDIR}/pgregress_output/yb_proc_schedule/expected/yb.orig.yb_hash_code.out ${TEST_TMPDIR}/pgregress_output/yb_proc_schedule/results/yb.orig.yb_hash_code.out
--- ${TEST_TMPDIR}/pgregress_output/yb_proc_schedule/expected/yb.orig.yb_hash_code.out	2026-03-12 01:05:58.018393145 +0000
+++ ${TEST_TMPDIR}/pgregress_output/yb_proc_schedule/results/yb.orig.yb_hash_code.out	2026-03-12 01:05:57.968393184 +0000
@@ -1401,13 +1401,11 @@
 -- Try a 1 element IN with row constructor. Should get an index scan since this turns into an equality.
 \set query 'SELECT i, j, yb_hash_code(i) hash_code_i, yb_hash_code(1) hash_code_1 FROM tt WHERE row(j, yb_hash_code(i)) IN (row(1, yb_hash_code(1))) ORDER BY 1, 2'
 :explain2run2
-                         QUERY PLAN                         
-------------------------------------------------------------
- Sort
-   Sort Key: i
-   ->  Index Only Scan using tt_i_j_idx on tt
-         Index Cond: ((yb_hash_code(i) = 4624) AND (j = 1))
-(4 rows)
+                      QUERY PLAN                      
+------------------------------------------------------
+ Index Only Scan using tt_i_j_idx on tt
+   Index Cond: ((yb_hash_code(i) = 4624) AND (j = 1))
+(2 rows)
 
                 QUERY PLAN                
 ------------------------------------------

@kirs kirs force-pushed the kirs/fix-hash-code-pagination-30524 branch from 50b7e14 to 1634cdc Compare March 13, 2026 05:50
@cursor
Copy link
Copy Markdown

cursor bot commented Mar 13, 2026

You have used all of your free Bugbot PR reviews.

To receive reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

@kirs kirs force-pushed the kirs/fix-hash-code-pagination-30524 branch from 1634cdc to 1909a29 Compare March 13, 2026 05:53
@cursor
Copy link
Copy Markdown

cursor bot commented Mar 13, 2026

You have used all of your free Bugbot PR reviews.

To receive reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

bool has_hash_code_eq = yb_has_hash_code_equality_scan_key(ybScan);
int hash_cols_covered = 0;

for (int i = 0; i < index->rd_index->indnkeyatts; i++)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: really tiny optimization i++ to ++i but maybe it just matches the rest of the code.

if (opexpr->opno != Int4EqualOperator)
continue;

if (list_length(opexpr->args) != 2)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: ideally this should be like an assert in debug build.

Copy link
Copy Markdown
Contributor

@jasonyb jasonyb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't want to expand the scope too much, but do you mind checking what happens when ybctid = const? That case is very similar to yb_hash_code() = const.

break; /* no good, volatile comparison value */

/*
* The Var side can match any key column of the index.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: it appears this planner code doesn't care about the order of the columns in the row compare, but I believe (could be wrong) the executor yb_scan.c code does care and does not support pushing down the filter in out-of-order cases (e.g. key (r1, r2) and row compare (r2, r1) op (a, b)).

*/
if (!indexcol_is_bool_constant_for_query(root, index, i) ||
yb_is_hash_column)
(yb_is_hash_column && !yb_hash_bucket_pinned))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: I don't think this whole clause is unnecessary, but better to take that investigation separately

for (int i = 0; i < index->rd_index->indnkeyatts; i++)
{
if (!(index->rd_indoption[i] & INDOPTION_HASH))
continue;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could break as it is not possible for there to be any more hash columns after the first range column. could put this condition in the for loop

* Set the first column in this RC to not null, except for
* partition/hash columns. Pggate does not allow binding
* IS NOT NULL on partition columns, and for primary-key hash
* columns the null check is redundant anyway.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hash primary key cannot have null, but hash index key can

WHERE yb_hash_code(h1) = 500
ORDER BY h1, h2, r
LIMIT 10;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a lack of tests involving

  • secondary indexes
  • nulls
  • multi-column hash + row compare

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the tests.

@kirs kirs force-pushed the kirs/fix-hash-code-pagination-30524 branch from 1909a29 to cafdc87 Compare March 16, 2026 02:05
@cursor
Copy link
Copy Markdown

cursor bot commented Mar 16, 2026

You have used all of your free Bugbot PR reviews.

To receive reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

…const

Summary:
When yb_hash_code(hash_cols) = constant pins a scan to a single hash
bucket, DocDB stores rows in (hash_cols, range_cols) order within that
bucket. Three independent code paths prevented exploiting this ordering
for pagination:

1. pathkeys.c (build_index_pathkeys): Hash columns were never allowed
   to produce sort pathkeys, forcing a Sort node even when ORDER BY
   matched the index key order within a pinned bucket. Fix: when
   yb_hash_code equality is detected, pass is_hash_index=false so
   hash columns generate sort pathkeys.

2. indxpath.c (match_rowcompare_to_indexcol): ROW comparisons like
   (h, r) > (?, ?) were unconditionally rejected when the first column
   was a hash column. Fix: allow through when yb_hash_code equality
   pins the scan to a single bucket.

3. yb_scan.c (YbBindRowComparisonKeys): Two guards prevented binding
   ROW comparison keys to DocDB when any hash column was present in
   the index or the comparison subkeys. Fix: relax both guards when
   a yb_hash_code equality scan key is present.

Together these fixes enable efficient keyset pagination:
  SELECT * FROM t
  WHERE yb_hash_code(h) = 10 AND (h, r) > (:last_h, :last_r)
  ORDER BY h, r LIMIT 1000;

Previously this required a full sort of the hash bucket on every page
and scanned from the start of the bucket. Now it seeks directly to the
cursor position.

Fixes: yugabyte#30524

Test Plan: New regression test yb.orig.hash_code_pagination covering:
- Sort elimination with yb_hash_code equality + ORDER BY
- ROW comparison pushdown as Index Cond
- Negative tests (range predicates, arity mismatches)
- Multi-column hash key support
@kirs kirs force-pushed the kirs/fix-hash-code-pagination-30524 branch from cafdc87 to ef908c6 Compare March 16, 2026 02:13
@cursor
Copy link
Copy Markdown

cursor bot commented Mar 16, 2026

You have used all of your free Bugbot PR reviews.

To receive reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

@kirs
Copy link
Copy Markdown
Author

kirs commented Mar 16, 2026

@jasonyb thanks for feedback!

I checked and ybctid = const takes the Tid Scan path, which doesn't go through any of the modified code paths (pathkeys, index matching, or row comparison binding).

Do you have anything particular in mind there? My understanding is that since ybctid = const is always a single row cleanup, it's not a similar pagination case.

@cursor
Copy link
Copy Markdown

cursor bot commented Mar 16, 2026

You have used all of your free Bugbot PR reviews.

To receive reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

Copy link
Copy Markdown
Contributor

@jasonyb jasonyb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to do another test run. Hope you saw the previous test results I posted and fixed those issues.

>>> Lint for src/postgres/src/backend/access/yb_access/yb_scan.c:


   Error  (S&RX) bad_variable_declaration_spacing
    Variable declarations should align variable names to the 12 column mark

            1684         */
            1685        if (can_pushdown)
            1686        {
    >>>     1687                bool    has_hash_code_eq = yb_has_hash_code_equality_scan_key(ybScan);<TAB><TAB>bool<TAB>has_hash_code_eq = yb_has_hash_code_equality_scan_key(ybScan);
            1688 
            1689                for (int i = 0;
            1690                         i < index->rd_index->indnkeyatts &&

   Error  (S&RX) bad_variable_declaration_spacing
    Variable declarations should align variable names to the 12 column mark

            1693                {
            1694                        if (has_hash_code_eq)
            1695                        {
    >>>     1696                                bool    found = false;<TAB><TAB><TAB><TAB>bool<TAB>found = false;
            1697                                for (int j = 0; j < subkey_count; j++)
            1698                                {
            1699                                        if (subkeys[j]->sk_attno - 1 == i)
>>> Lint for src/postgres/src/backend/utils/misc/pg_yb_utils.c:


   Error  (S&RX) bad_variable_declaration_spacing
    Variable declarations should align variable names to the 12 column mark

            8922                col = 0;
            8923                foreach(arg_lc, funcexpr->args)
            8924                {
    >>>     8925                        Node   *arg = (Node *) lfirst(arg_lc);<TAB><TAB><TAB>Node   *arg = (Node *) lfirst(arg_lc);
            8926 
            8927                        if (IsA(arg, RelabelType))
            8928                                arg = (Node *) ((RelabelType *) arg)->arg;
 LINT ERRORS  Lint raised errors!

Comment on lines +31 to +35
-- Negative: range predicate on yb_hash_code should still Sort (multiple buckets).
EXPLAIN (COSTS OFF) SELECT * FROM hk_pagination
WHERE yb_hash_code(h) < 1000
ORDER BY h, r1, r2
LIMIT 10;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: this is not ready in the near future, but yb_hash_code(h) IN (0, 1, ..., 999) + yb_max_saop_merge_streams = 1024 + yb_enable_cbo + pg_hint_plan hints to force SAOP merge + yb_hash_code IN-clause Index Cond support would be able to avoid Sort node

Comment on lines +106 to +123
-- Verify results for yb_hash_code equality + ORDER BY (no Sort).
SELECT * FROM hk_pagination
WHERE yb_hash_code(h) = 1000
ORDER BY h, r1, r2;
h | r1 | r2 | v
---+----+----+---
(0 rows)

-- Verify ROW comparison pagination produces correct results.
-- First page:
SELECT * FROM hk_pagination
WHERE yb_hash_code(h) = 1000
AND (h, r1, r2) > (0, 0, 0)
ORDER BY h, r1, r2
LIMIT 5;
h | r1 | r2 | v
---+----+----+---
(0 rows)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are useless as no rows are returned. pick better numbers or load more data

Comment on lines +234 to +235
-- The hash column is NOT in the ROW comparison, so pushdown must NOT
-- happen (EncodeRowKeyForBound needs all hash columns in the bound).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so what happens if you omit a range column, such as r1?


--
-- NULL handling tests.
-- NULLs hash to a different bucket, so yb_hash_code(col) = const already
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is "different bucket"? Isn't there at least one bucket they hash to? Wouldn't it be best to test using that one specific const rather than others? That same const should include rows with nulls and rows without nulls.

INSERT INTO hk_pagination VALUES (NULL, 100, 1000, 'null_h');
INSERT INTO hk_pagination VALUES (1, NULL, 1000, 'null_r1');
INSERT INTO hk_pagination VALUES (1, 100, NULL, 'null_r2');
-- Verify NULLs do not appear in a hash-code-pinned scan.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

misleading comment if you read it as a matter of fact, not as specific to this table, primary culprit being use of "a"

Comment on lines +255 to +257
INSERT INTO hk_pagination VALUES (NULL, 100, 1000, 'null_h');
INSERT INTO hk_pagination VALUES (1, NULL, 1000, 'null_r1');
INSERT INTO hk_pagination VALUES (1, 100, NULL, 'null_r2');
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't these fail because PKs cannot contain nulls? did you run the test before pushing?

1 | 16 | 160 | val16
(4 rows)

-- Secondary index with NULLs: secondary hash indexes allow NULL values.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

secondary hash indexes allow NULL values

then why did you have inserts of nulls to PKs in the above queries?

(4 rows)

-- Secondary index with NULLs: secondary hash indexes allow NULL values.
-- yb_hash_code(col) works when col is NULL (function is non-strict).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yb_hash_code(col) works when col is NULL (function is non-strict).

I see no null involved in the following query

-- Secondary hash indexes allow NULL values, and yb_hash_code(NULL::type)
-- returns a valid hash code, so NULL rows live in a real bucket.
-- The IS NOT NULL guard is skipped for hash columns (pggate rejects it),
-- but SQL-layer recheck excludes NULLs because NULL comparisons yield NULL.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like you have a lot of null tests above that are disorganized. Can all tests be more sensibly organized as follows:

  • pk_table (h1, h2, r1, r2, v), PRIMARY KEY ((h1, h2) HASH, r1 ASC, r2 ASC)
  • ik_table (h1, h2, r1, r2, v), UNIQUE INDEX KEY ((h1, h2) HASH, r1 ASC, r2 ASC)
  • Skip single-column tests and go straight to multi-column
  • Incorporate NULLS at the very beginning of data load: pk_table and ik_table have the same data, but ik_table additionally has extra NULL key columns. Ensure that for yb_hash_code = const values you test, there are both NULL and non-NULL rows matching that constant.
  • For each query you want to test, you want to test both pk_table and ik_table. No need for pg_hint_plan hints as both tables have only the choice between index/primary key scan and sequential scan. You also want to show both explain plan and results. I suggest you use :explain1run1 on both table names like
\getenv abs_srcdir PG_ABS_SRCDIR                                                                                                                                                                                                                                                                                                                                             
\set filename :abs_srcdir '/yb_commands/explainrun.sql'                                                                                                                                                                                                                                                                                                           
\i :filename

\set explain 'EXPLAIN (COSTS OFF)'
\set query 'SELECT * FROM :table_name
    WHERE yb_hash_code(h1, h2) = <some_meaningful_constant>
    ORDER BY h1, h2;

\set table_name 'pk_table'
:explain1run1
\set table_name 'ik_table'
:explain1run1

Since I see this pattern come up a few times, it may be worth augmenting the explainrun framework to generalize the "hints" system to "params": define \set explain1 ':explain :hint1 :query;' instead as \set explain1 ':explain :query;' and make the responsibility of query to have hint as param at the front of the query. I can take care of this, so you can in the meantime use the above slight duplication of back-and-forth set table_name.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit d258233 lets you parameterize tablename

* IS NOT NULL on partition columns, and for primary-key hash
* columns the null check is redundant anyway.
*/
is_not_null[att_idx] |= (subkey_index == 0 &&
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commit cd0d858 never explains why IS NOT NULL was not supported for hash. Would have been good to know that.

@jasonyb
Copy link
Copy Markdown
Contributor

jasonyb commented Mar 17, 2026

Your test is broken. Are you running your tests?

diff -U3 ${TEST_TMPDIR}/pgregress_output/yb_index_schedule/expected/yb.orig.hash_code_pagination.out ${TEST_TMPDIR}/pgregress_output/yb_index_schedule/results/yb.orig.hash_code_pagination.out
--- ${TEST_TMPDIR}/pgregress_output/yb_index_schedule/expected/yb.orig.hash_code_pagination.out	2026-03-17 03:48:04.703478227 +0000
+++ ${TEST_TMPDIR}/pgregress_output/yb_index_schedule/results/yb.orig.hash_code_pagination.out	2026-03-17 03:48:04.623478227 +0000
@@ -165,8 +165,8 @@
       AND (h1, h2, r) > (1, 2, 5)
     ORDER BY h1, h2, r
     LIMIT 10;
-                                       QUERY PLAN                                        
------------------------------------------------------------------------------------------
+                                       QUERY PLAN                                       
+----------------------------------------------------------------------------------------
  Limit
    ->  Index Scan using hk_multi_pkey on hk_multi
          Index Cond: ((ROW(h1, h2, r) > ROW(1, 2, 5)) AND (yb_hash_code(h1, h2) = 500))
@@ -178,13 +178,13 @@
       AND (h1, h2, r) > (1, 2, 5)
     ORDER BY h1, h2, r
     LIMIT 10;
-                        QUERY PLAN                        
-----------------------------------------------------------
+                                      QUERY PLAN                                      
+--------------------------------------------------------------------------------------
  Limit
    ->  Sort
          Sort Key: h1, h2, r
-         ->  Index Scan using hk_multi_pkey on hk_multi
-               Filter: (ROW(h1, h2, r) > ROW(1, 2, 5))
+         ->  Seq Scan on hk_multi
+               Filter: ((yb_hash_code(h1) = 500) AND (ROW(h1, h2, r) > ROW(1, 2, 5)))
 (5 rows)
 
 --
@@ -195,8 +195,8 @@
 EXPLAIN (COSTS OFF) SELECT r1, r2 FROM hk_pagination
     WHERE yb_hash_code(r1) = 2000
     ORDER BY r1, r2;
-                             QUERY PLAN                              
----------------------------------------------------------------------
+                          QUERY PLAN                          
+--------------------------------------------------------------
  Index Only Scan using hk_pagination_sec_idx on hk_pagination
    Index Cond: (yb_hash_code(r1) = 2000)
 (2 rows)
@@ -207,8 +207,8 @@
       AND (r1, r2) > (5, 50)
     ORDER BY r1, r2
     LIMIT 10;
-                                        QUERY PLAN                                         
--------------------------------------------------------------------------------------------
+                                   QUERY PLAN                                   
+--------------------------------------------------------------------------------
  Limit
    ->  Index Only Scan using hk_pagination_sec_idx on hk_pagination
          Index Cond: ((ROW(r1, r2) > ROW(5, 50)) AND (yb_hash_code(r1) = 2000))
@@ -220,13 +220,13 @@
     WHERE yb_hash_code(h) = 1000
     ORDER BY r1, r2
     LIMIT 10;
-                               QUERY PLAN                                
--------------------------------------------------------------------------
+                   QUERY PLAN                   
+------------------------------------------------
  Limit
    ->  Sort
          Sort Key: r1, r2
-         ->  Index Scan using hk_pagination_sec_idx on hk_pagination
-               Storage Filter: (yb_hash_code(h) = 1000)
+         ->  Seq Scan on hk_pagination
+               Filter: (yb_hash_code(h) = 1000)
 (5 rows)
 
 --
@@ -239,13 +239,12 @@
       AND (r1, r2) > (5, 50)
     ORDER BY h, r1, r2
     LIMIT 10;
-                         QUERY PLAN                         
-------------------------------------------------------------
+                                  QUERY PLAN                                   
+-------------------------------------------------------------------------------
  Limit
    ->  Index Scan using hk_pagination_pkey on hk_pagination
-         Index Cond: (yb_hash_code(h) = 1000)
-         Filter: (ROW(r1, r2) > ROW(5, 50))
-(4 rows)
+         Index Cond: ((yb_hash_code(h) = 1000) AND (ROW(r1, r2) > ROW(5, 50)))
+(3 rows)
 
 --
 -- NULL handling tests.
@@ -253,29 +252,33 @@
 -- excludes them.  Verify queries work correctly with NULL data.
 --
 INSERT INTO hk_pagination VALUES (NULL, 100, 1000, 'null_h');
+ERROR:  null value in column "h" of relation "hk_pagination" violates not-null constraint
+DETAIL:  Failing row contains (null, 100, 1000, null_h).
 INSERT INTO hk_pagination VALUES (1, NULL, 1000, 'null_r1');
+ERROR:  null value in column "r1" of relation "hk_pagination" violates not-null constraint
+DETAIL:  Failing row contains (1, null, 1000, null_r1).
 INSERT INTO hk_pagination VALUES (1, 100, NULL, 'null_r2');
+ERROR:  null value in column "r2" of relation "hk_pagination" violates not-null constraint
+DETAIL:  Failing row contains (1, 100, null, null_r2).
 -- Verify NULLs do not appear in a hash-code-pinned scan.
 SELECT * FROM hk_pagination
     WHERE yb_hash_code(h) = yb_hash_code(1)
     ORDER BY h, r1, r2;
- h | r1 | r2  |    v    
----+----+-----+---------
+ h | r1 | r2  |   v   
+---+----+-----+-------
  1 |  1 |  10 | val1
  1 |  6 |  60 | val6
  1 | 11 | 110 | val11
  1 | 16 | 160 | val16
- 1 |100 |     | null_r2
- 1 |    |1000 | null_r1
-(6 rows)
+(4 rows)
 
 -- ROW comparison with NULLs present in the table.
 SELECT * FROM hk_pagination
     WHERE yb_hash_code(h) = yb_hash_code(1)
       AND (h, r1, r2) > (1, 0, 0)
     ORDER BY h, r1, r2;
- h | r1 | r2  |    v    
----+----+-----+---------
+ h | r1 | r2  |   v   
+---+----+-----+-------
  1 |  1 |  10 | val1
  1 |  6 |  60 | val6
  1 | 11 | 110 | val11
@@ -288,24 +291,22 @@
 EXPLAIN (COSTS OFF) SELECT h, v FROM hk_pagination
     WHERE yb_hash_code(h) = yb_hash_code(1)
     ORDER BY h, v;
-                              QUERY PLAN                               
------------------------------------------------------------------------
+                            QUERY PLAN                             
+-------------------------------------------------------------------
  Index Only Scan using hk_pagination_nullable_idx on hk_pagination
-   Index Cond: (yb_hash_code(h) = yb_hash_code(1))
+   Index Cond: (yb_hash_code(h) = 4624)
 (2 rows)
 
 SELECT h, v FROM hk_pagination
     WHERE yb_hash_code(h) = yb_hash_code(1)
     ORDER BY h, v;
- h |    v    
----+---------
- 1 | null_r1
- 1 | null_r2
+ h |   v   
+---+-------
  1 | val1
  1 | val11
  1 | val16
  1 | val6
-(6 rows)
+(4 rows)
 
 -- Secondary hash index with NULLs and ROW comparison pushdown.
 -- Secondary hash indexes allow NULL values, and yb_hash_code(NULL::type)
@@ -327,10 +328,10 @@
     WHERE yb_hash_code(a) = yb_hash_code(1)
       AND (a, b) > (1, 10)
     ORDER BY a, b;
-                                    QUERY PLAN                                     
------------------------------------------------------------------------------------
+                              QUERY PLAN                               
+-----------------------------------------------------------------------
  Index Only Scan using hk_sec_null_idx on hk_sec_null
-   Index Cond: ((ROW(a, b) > ROW(1, 10)) AND (yb_hash_code(a) = yb_hash_code(1)))
+   Index Cond: ((ROW(a, b) > ROW(1, 10)) AND (yb_hash_code(a) = 4624))
 (2 rows)
 
 SELECT a, b FROM hk_sec_null
03:50:33.853 (Time-limited test) [WARN - org.yb.pgsql.PgRegressRunner.run(PgRegressRunner.java:124)] Failed tests: [yb.orig.hash_code_pagination]

@jasonyb
Copy link
Copy Markdown
Contributor

jasonyb commented Mar 22, 2026

Couple things:

  • explainrun rewrite is in the works, but it shouldn't be a blocker as you can just duplicate queries for each table in the meantime.
  • there is a lack of tests using inequality conditions on hash columns. Only ORDER BY is used. Is inequality not supported from planner side, or both ORDER BY and inequality are supported?
  • Any tests for backwards scan? ORDER BY hash columns DESC.
  • Some test results are concerning. for example,
 SELECT h, v FROM hk_pagination
     WHERE yb_hash_code(h) = yb_hash_code(1)
     ORDER BY h, v;
- h |    v    
----+---------
- 1 | null_r1
- 1 | null_r2
+ h |   v   
+---+-------
  1 | val1
  1 | val11
  1 | val16
  1 | val6
-(6 rows)
+(4 rows)

@andrei-mart
Copy link
Copy Markdown
Contributor

@kirs, thanks for pointing out the issue and submitting the PR.

I see this issue/PR addresses two orthogonal issues.

  1. Condition like yb_hash_code(PK) = constant indeed makes the scan ordered by (hash_cols, range_cols). If planner recognizes this case and build_index_pathkeys creates the path keys out of the key columns, it should be sufficient to get rid of the Sort node in the query plan.
  2. Support for inequality conditions on the hash columns. That includes single column conditions, like h1 < constant or row conditions, like ROW(h1, r1) < (constant1, constant2). If those conditions are pushed down to the storage layer and used to filter data, there would be additional benefits: less traffic, ability to push down aggregates, etc. The main problem with that feature is the support on the storage side. I'm not sure about details, what exactly storage layer supports, whether a condition on the hash code make any difference, etc. and get back to you.

In meantime, would you mind to split your PR?
The 1. alone is likely to get merged soon, I don't see any obstacles for that.
While 2. definitely needs more testing and may need more work.

@andrei-mart
Copy link
Copy Markdown
Contributor

So, it looks like we convert pushable row conditions to document key bounds.
Basically, in the case of a hash key, if we have a condition like ROW(h1, r1) < (constant1, constant2) the row bounds would be (-inf, DocKey[yb_hash_code(constant1), constant1, constant2]), which would work correctly if the scan has also a condition yb_hash_code(h1) = yb_hash_code(constant1), but would not work for arbitrary row constant in the inequality.
The above paragraph is applicable to multi-column hash key.
If the hash key is single column and there's no range component the case is reduced to h1 < constant1 which in the presence of yb_hash_code(h1) = yb_hash_code(constant1) would make a valid document key condition. The problem is, this kind of a condition does not go the document key bounds path.

We probably need to discuss, if we need to implement this optimization, and how it would be better to implement it.
You see, it may be confusing, why one my row condition is pushable, but other is not, when the constant in the ROW and in the yb_hash_code condition do not match.
We can consider to support yb_hash_code(h1) in ROW. Like this:
ROW( yb_hash_code(h1), h1, r1) < ( yb_hash_code(constant1), constant2, constant3)
That intuitively makes a document bound, as long as the columns are in the index's order and are are prefix (even incomplete). It also pushable and semantically correct for arbitrary constants.

As I mentioned above, that part is independent from the yb_hash_code+ORDER BY support, two changes may be made separately.

@kirs
Copy link
Copy Markdown
Author

kirs commented Mar 24, 2026

@andrei-mart thanks for your feedback! Here's the main issue that I'm after:

SELECT * FROM foo.orders
WHERE yb_hash_code(user_uuid) = 10
  AND (user_uuid, uuid) > ('01961f1f-...', '01968344-...')
LIMIT 5000;
Condition lands in Filter (post-filter on the PG side, after rows are returned from storage):

Index Scan using orders_pkey on orders (actual time=256.437..269.259 rows=5000)
      Index Cond: (yb_hash_code(user_uuid) = 10)
      Filter: (ROW(user_uuid, uuid) > ROW('01961f1f-...'::uuid, '01968344-...'::uuid))
      Rows Removed by Filter: 90001
      Storage Table Read Requests: 93
      Storage Table Rows Scanned: 95232
Execution Time: 270.568 ms

Even without ORDER BY, this still has to scan 90k rows to get to 5k in the middle of the LSM, making pagination over large tables perform very poorly.

Sort elimination is nice to have but only as long as we have a way to paginate over a table with yb_hash_code efficiently.

Btw it's the same issue if I expand that into WHERE yb_hash_code(user_uuid) = 10 AND (user_uuid > '01961f1f-...' OR (user_uuid = '01961f1f-...' AND uuid > '01968344-...')). I've outlined this in more detail in #30524 (comment).

It sounds like ROW( yb_hash_code(h1), h1, r1) < ( yb_hash_code(constant1), constant2, constant3) is exactly the feature we're after for that sort of pagination.

I'm all up for splitting work into multiple PRs. Let me know what you think is best given the context above.

@andrei-mart
Copy link
Copy Markdown
Contributor

Ah, you want efficient pagination, I see.
You basically want to grab the values from the last returned row and make the condition for the next page out of them.
Is (user_uuid HASH, uuid ASC) your primary key? If there are duplicates, that algorithm may skip some data.
Did you consider to open a cursor and fetch pages from it?
Any particular reason why user_uuid is HASH? If you had (user_uuid ASC, uuid ASC) you would already have support for such pagination.

I suggested to split the PR because the ORDER BY part works and can be merged soon, while the row condition does not seem to work as expected. And in general, it is better to make independent changes separately.

So, for your use case, I think the condition (yb_hash_code(user_uuid), user_uuid, uuid) > (10, '01961f1f-...', '01968344-...') would work. The difference is, it would continue into the next hash code. If you don't want that, you can add the upper bound like that: (yb_hash_code(user_uuid), user_uuid, uuid) > (10, '01961f1f-...', '01968344-...') AND yb_hash_code(user_uuid) < 11. I tested real quick, the condition like that is allowed and even comes into the YbBindRowComparisonKeys. So it can be turned into a document bound. Tricky part is to confirm that the scan key expression is the condition on the yb_hash_code on the hash columns.

It is interesting that the planner has done most of the validations by the time when the condition reaches YbBindRowComparisonKeys. If, for example yb_hash_code takes not the hash column arguments, the index is not even used and the condition becomes a sequential filter. On the other hand, if index has an expression key, condition on that is also an expression and may be hard to distinguish from the yb_hash_code.

Beware, it seems like YBCPgDmlAddRow(Lower|Upper)Bound functions do not take a value for the hash code. I don't think we use them with hash keys, but out of the fact it works with your change, these functions calculate the hash code out of the provided hash values. That's not correct.

@kirs
Copy link
Copy Markdown
Author

kirs commented Mar 24, 2026

@andrei-mart for PgBouncer and conn multiplexing reasons, we've been aiming to make this pagination as stateless as possible, so using cursor is less desirable. And yes, (user_uuid HASH, uuid ASC) in this case is primary keys so duplicates is not a concern.
Hash sharding suits our workloads better than range sharding as it scales significantly better than range on our data size and QPS. Note that (user_uuid HASH, uuid ASC) is just an example, in reality it's 1000+ tables each using hash sharding often based on something like tenant id.

Is there a reason you've suggested (yb_hash_code(user_uuid), user_uuid, uuid) > (10, '01961f1f-...', '01968344-...') AND yb_hash_code(user_uuid) < 11 and not something like yb_hash_code(user_uuid) = 10 AND (yb_hash_code(user_uuid), user_uuid, uuid) > (10, '01961f1f-...', '01968344-...')? I could imagine that the latter version could simplify things for the planner knowing that the lookup is constrained to a single tablet.

@andrei-mart
Copy link
Copy Markdown
Contributor

andrei-mart commented Mar 24, 2026

Is there a reason you've suggested

These two are equivalent.
The common part (yb_hash_code(user_uuid), user_uuid, uuid) > (10, '01961f1f-...', '01968344-...') defines the scan's lower bound at [hash = 10, user_uuid = '01961f1f-...', uuid = '01968344-...'].
The difference is yb_hash_code(user_uuid) < 11 vs yb_hash_code(user_uuid) = 10. First defines the scan's upper bound at [hash = 11]. Second defines same upper bound AND the lower bound at [hash = 10], which is useless, because the row condition defines stricter lower bound. So yb_hash_code(user_uuid) < 11 is less redundant, more straightforward (I hope) and there's no real difference in implementation difficulty.

@kirs
Copy link
Copy Markdown
Author

kirs commented Mar 25, 2026

@andrei-mart thank you! I'd like to take a stab at that as a separate PR and then we could close this one.

Just a few questions to make this less review rounds for you:

  1. EncodeRowKeyForBound in pg_dml_read.cc — Right now this function takes col_values[] mapped to index columns, and always recomputes the hash code from the hash column values via BuildDocKey. To pass through a pre-computed hash code from the ROW expression, I'll need to change this interface somehow. I'm not sure what's cleanest here — a new EncodeRowKeyForBoundWithHashCode variant that takes the hash code as a separate parameter, or threading it through the existing col_values array with some convention, or something else?

  2. Opfamily matching in expand_indexqual_rowcompare — This function checks each ROW element's operator against an index column's opfamily. The yb_hash_code(h) element is int4 and its > operator lives in INTEGER_LSM_FAM_OID, but match_index_to_operand maps it to indexcol=0, whose opfamily is the actual column type (e.g. uuid_ops). So get_op_opfamily_strategy(int4gt, uuid_opfamily) won't find a match and the whole ROW gets rejected. It seems like I'll need to special-case the leading yb_hash_code element to use INTEGER_LSM_FAM_OID instead of the column's opfamily — is that the right direction, or is there a cleaner way to handle this?

  3. indexcol collision — For ROW(yb_hash_code(h), h, r), both yb_hash_code(h) and h match indexcol=0 via match_index_to_operand, so expand_indexqual_rowcompare builds indexcols = [0, 0, 1]. I'm not sure if that duplicate causes problems downstream. Should I give yb_hash_code a synthetic column index to avoid the collision, or skip indexcol=0 when matching subsequent elements after a leading yb_hash_code?

@andrei-mart
Copy link
Copy Markdown
Contributor

EncodeRowKeyForBound in pg_dml_read.cc — Right now this function takes col_values[] mapped to index columns, and always recomputes the hash code from the hash column values via BuildDocKey. To pass through a pre-computed hash code from the ROW expression, I'll need to change this interface somehow. I'm not sure what's cleanest here — a new EncodeRowKeyForBoundWithHashCode variant that takes the hash code as a separate parameter, or threading it through the existing col_values array with some convention, or something else?

That's a good question.
The EncodeRowKeyForBound implies specific index, and I think, if the index is a hash, the hash code is mandatory.
You see, without hash code the Row condition is ambiguous. Semantically it is not a bound, it is a filter. Basically, from the execution point of view, the condition should be applied in each hash bucket. If you read the function name EncodeRowKeyForBound, it is specifically row key, and specifically for a bound. So in the case of the hash index, col_values parameter should take the hash_code, followed by hash column values, in their order, followed by range column values, in their respective order. It looks like we don't need to require all columns to form a bound, not even all the hash keys. In many places of our code we require all the hash columns, but not here.
So that means YbBindRowComparisonKeys should verify the keys match the index's key prefix (with hash code, if hash) and call YBCPgDmlAddRow(Lower|Upper)Bound if they are. There's an option to try and make a filter otherwise, I believe currently we just evaluate condition locally.

Anyway, that means the EncodeRowKeyForBound is implemented incorrectly. Well, let's better say the hash case is not supported. Perhaps we should not use BuildDocKey to make a bound. Bound, while similar to DocKey has certain differences. Give me couple of days, I think I can fix that for you, or provide more detailed guidance. For now, let's stick to that: hash code should be the first in col_values. If you need working EncodeRowKeyForBound for testing, you can patch it locally and ignore the first value in the input.

For 2 and 3 I need to gather context. It's been a while since I touched the planner part. In my quick tests I saw ScanKeys are coming to the YbBindRowComparisonKeys and assumed all is good in the planner. Let me look into details and get back to you.

@andrei-mart
Copy link
Copy Markdown
Contributor

Update:
I realized that we, specifically in PgDmlRead::EncodeRowKeyForBound allowed "gaps" in the row conditions.
That is, if the key columns are k1, k2, k3 we allowed conditions like (k1, k3) OP (a, c) and made bound like [a, <Lowest|Highest>, c] (Lowest or Highest, depending on the OP).
I also realized that bounds with gaps are not particularly helpful. The bound [a, <Lowest|Highest>, c] is a functional equivalent of [a, <Lowest|Highest>], which is produced by the condition k1 OP a. Since k3 OP c part is effectively ignored, the scan may return non-matching rows and have to be rechecked in Postgres.
That means the YbBindRowComparisonKeys when preparing arguments for YBCPgDmlAddRow(Lower|Upper)Bound should collect constants while their respective columns make the index's key prefix. If all the row condition's key/value pairs are taken for the YBCPgDmlAddRow(Lower|Upper)Bound call, the bound would make accurate results. Otherwise it needs recheck. A row condition without the first column look useless for pushdown purposes.
I don't think we have sufficient support on the storage side to make any row condition accurate. That sounds like a future improvement of the row condition support.
All that is applicable to the hash case, if you think about the hash code as the first column of the index. Hence a row condition without the yb_hash_code(h) can't be used to make a bound. I'm working on PgDmlRead::EncodeRowKeyForBound, here is my WIP patch: https://gist.github.com/andrei-mart/2755d1d02002040b5ba99f599f2f303f it should support hash already. Use it if you need it for testing. There's a unit test, refer it as a usage example.

For the best pagination support I think your query should be like this:

SELECT ... FROM table WHERE (yb_hash_code(h), h, r) > (yb_hash_code(a), a, b) ORDER BY yb_hash_code(h), h, r LIMIT n;

The nice difference from WHERE yb_hash_code(h) = yb_hash_code(a) AND (h, r) > (a, b) is that the former returns n rows per page, except the very last page, while latter may return less, even 0 if the hash bucket yb_hash_code(h) = yb_hash_code(a) does not have more rows. Also transition to the next hash bucket is tricky.
The ORDER BY yb_hash_code(h), h, r part is needed mostly to prevent parallel reading from the multiple tablets, so proper support for yb_hash_code(h) in the ORDER BY is required for your use case too.

Sorry, I did not have time to look into your other two questions. Will do the first thing Monday.

@andrei-mart
Copy link
Copy Markdown
Contributor

So, I've looked into 2. and 3.
What you are seeing is a bug, more accurately, a bunch of bugs.
To begin with, the hash_code is not a column of the index. We did a number of hacks to support conditions on yb_hash_code(), since the Postgres logic does not work well in this special case.

The Postgres logic basically starts from the match_clause_to_index function. The purpose is to determine if the given index is helpful with evaluation of the given clause. The most of the function is a loop over all the index columns, where match_clause_to_indexcol function is used to determine if the given column in the given index is helpful with evaluation of the given clause. That's where the indexcol is coming from. The idea is to find the first column in the index which is helpful, because subsequent columns may still be helpful, but not much helpful.

The match_clause_to_indexcol is basically a big switch which calls respective handler function based on the clause type. In the case of the row compare clause, it is the match_rowcompare_to_indexcol. We were not supposed to optimize row compare so here is a blocker:

if (is_hash_column_in_lsm_index(index, indexcol))
return NULL;
and the first bug. This statement was supposed to reject row conditions on the hash column, probably thinking the row conditions on the range portion of the key may be still helpful. That's incorrect, at least now, when we make bounds out of the row conditions. Bounds are only possible if the row condition is on a prefix of the key, or, in other words, indexcol == 0.

Nonetheless, the block works while match_clause_to_index function evaluates the hash keys, when it comes to the range keys, if there are any, match_rowcompare_to_indexcol proceeds. Here comes the second bug, if the yb_hash_code is the first expression in the row, the match_index_to_operand is always successful due to the special handling for the yb_hash_code:


Honestly, I don't know why it is here. It is pretty old code, people involved are not available to comment, and if they were, I don't expect them to remember details. My assumption is it worked for the purpose and did not break things. Even now, when we found a false positive, the query works correctly, it just prevents optimal execution.

The last bastion is here, where expand_indexqual_rowcompare is invoked if opfamily matches. It may be a pure coincidence that the current range column is a int4 and the opfamily matches.

The expand_indexqual_rowcompare expects the first expression to match the index. In the case if the yb_hash_code is the first expression, this expectation is surprisingly true. Just the indexcol is random. Fortunately it is not used in the expand_indexqual_rowcompare's logic. That's why I saw seemingly valid scan keys in my test. Interesting, in Postgres expand_indexqual_rowcompare thinks the Row is good while it refers the index columns in same direction, while in Yugabyte it is good while it refers the index columns in same direction and in same order. Otherwise the bound would not be correct.

I'm not sure how exactly it should be fixed. I think it is not an option to make the hash_code a column in the index, so it can be referenced by indexcol, it is too fundamental. We can technically use -1, as the indexcol in struct IndexClause is of AttrNumber and technically allows negatives. I'm not sure how OK it is, but the Postgres itself uses InvalidAttrNumber (0) for the first column. I think the check for the hash code should happen before the loop over the index column, and only if the target index is a hash index. There we can also check when it is pointless to iterate over the index columns. Like, if we have a row comparison, it is good if the first row expression is the yb_hash_code and bad otherwise. The good news, if the planner does the job properly, the YbBindRowComparisonKeys implementation does not need any validation. If the row comparison scan keys are here, they match the index's prefix.

Hopefully you are not overwhelmed yet. Please let us know if you are still committed to take on this feature.

@kirs
Copy link
Copy Markdown
Author

kirs commented Apr 1, 2026

For the best pagination support I think your query should be like this:

SELECT ... FROM table WHERE (yb_hash_code(h), h, r) > (yb_hash_code(a), a, b) ORDER BY yb_hash_code(h), h, r LIMIT n;
The nice difference from WHERE yb_hash_code(h) = yb_hash_code(a) AND (h, r) > (a, b) is that the former returns n rows per page, except the very last page, while latter may return less, even 0 if the hash bucket yb_hash_code(h) = yb_hash_code(a) does not have more rows. Also transition to the next hash bucket is tricky.

My original plan on the application side has been to spawn N routines, each N gets its own chunk of the range of yb_hash_code from 0 to 65536. So if the routine gets assigned hash code 0 to 999, having an upper boundary is still very useful there.

Hopefully you are not overwhelmed yet. Please let us know if you are still committed to take on this feature.

😃 I'm thinking of starting with making the planner recognize ROW(yb_hash_code(h), h, r) > (...) as a valid Index Cond on hash-sharded indexes. That would involve updating match_clause_to_index. Then submit that as a separate PR. Let me know if this makes sense.

@andrei-mart
Copy link
Copy Markdown
Contributor

Welcome back @kirs!
In meantime I finished fixing PgDmlRead::EncodeRowKeyForBound, it is under review new, hopefully it will be merged soon. In meantime the patch I shared before should work to test Postgres side changes, if needed.

Yeah, you will be able to distribute the work between multiple routines.
You can assign one hash bucket at a time by adding a yb_hash_code(h) = const condition, or a range of hash buckets like yb_hash_code(h) BETWEEN const1 AND const2.
However, I have realized that we are missing a feature required for pagination to work properly: support for ORDER BY yb_hash_code(h). It works correctly today, but planner does not understand that yb_hash_code(h) is the hash code, which is a valid index key. It is closely related to the problem you pointed out from the beginning: yb_hash_code(h) = const condition should tell the planner that index is ordered by the hash key. The thing is, if planner knows, that the hash code represented by the yb_hash_code(h) expression is a sort key, and the hash columns follow it. With that knowledge Postgres would understand, that ORDER BY yb_hash_code(h) makes the hash order desired, and WHERE yb_hash_code(h) = const makes it redundant.

So I've started looking at that ORDER BY yb_hash_code(h). I've taken out the code to match yb_hash_code expression to an index from the match_index_to_operand() into a separate function, it is supposed to be used in multiple places. Then I took whole yb_hash_code handling out of the loop over the index columns in the match_clause_to_index(), as it is not related to the columns. I'm taking pause here to run the regression tests. I uploaded my WIP patch here: https://gist.github.com/andrei-mart/3836fc3ca0f3322be35537fd68a98da2 so you can take look. I think the yb_match_clause_to_index the patch adds is a good place to identify the "special" row condition. "Special" here should mean a row condition where the first element is a yb_hash_code matching the index. I can hand it over to you. But if we both are about to work on these items, we will need to synchronize somehow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[YSQL] Performance issues with paginating over an index scan on yb_hash_code(hash_cols)

5 participants