SNOW-3176017: Fix accidental removal of aliases in certain JOIN statements#4096
SNOW-3176017: Fix accidental removal of aliases in certain JOIN statements#4096sfc-gh-joshi wants to merge 9 commits intomainfrom
Conversation
|
some merge gates are still failing |
|
can you help run the ML test job to verify that the change won't cause any error related to identifier compilation error? |
The affected test in the SnowML job appears to be failing for database permission reasons, but I manually verified that the issue is no longer present when running the test locally. |
| # We use the session of the LHS DataFrame to report this telemetry | ||
| lhs._session._conn._telemetry_client.send_alias_in_join_telemetry() |
There was a problem hiding this comment.
Telemetry is now sent unconditionally for all joins, even when there are no common column names requiring aliasing. Previously (line 333), this was conditional on if common_col_names:. This will cause telemetry spam for joins that don't actually need aliasing.
Fix: Restore the conditional check:
if common_col_names:
# We use the session of the LHS DataFrame to report this telemetry
lhs._session._conn._telemetry_client.send_alias_in_join_telemetry()| # We use the session of the LHS DataFrame to report this telemetry | |
| lhs._session._conn._telemetry_client.send_alias_in_join_telemetry() | |
| if common_col_names: | |
| # We use the session of the LHS DataFrame to report this telemetry | |
| lhs._session._conn._telemetry_client.send_alias_in_join_telemetry() | |
Spotted by Graphite
Is this helpful? React 👍 or 👎 to let us know.
Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.
Fixes SNOW-3176017
Fill out the following pre-review checklist:
Please describe how your code solves the related issue.
Reverts #4095, restoring the optimizations made in SNOW-2895675.
The original optimization, which replaced a layer of
SELECT "A" AS "A", "B" AS "B"withSELECT *in certain join operations, caused a test failure in SnowML's CI pipeline. A minimal version of the test is reproduced in this PR intest_query_generator.py::test_disambiguate_skips_quoted_alias, and is as follows:This would generate SQL like the following:
bad SQL
SELECT "COL_0", "COL_1" FROM ( SELECT * FROM ( ( SELECT $1 AS "ID", $2 AS """COL_0""", $3 AS """COL_1""" FROM VALUES (0 :: INT, 1 :: INT, 2 :: INT), (3 :: INT, 4 :: INT, 5 :: INT ) ) AS SNOWPARK_LEFT INNER JOIN ( SELECT "ID", "A", "B" FROM ( SELECT $1 AS "ID", $2 AS "A", $3 AS "B" FROM VALUES (0 :: INT, 1 :: INT, 2 :: INT), (3 :: INT, 4 :: INT, 5 :: INT) ) ) AS SNOWPARK_RIGHT USING (ID) ) )The file read operation was producing aliases of triple-quoted column names (which are used in SQL to produce single-quoted column names), which were silently dropped by the join optimization, resulting in the top-level references to
COL_0andCOL_1becoming invalid.When _disambiguate is called during join operations, the new optimization skipped alias generation if the left and right side did not share any common column names. However, the
_alias_if_neededactually serves the additional function of stripping quotes from identifiers it processes, including from triple-quoted identifiers. Even though there were no common column names in this case, aliasing was still necessary to rename"""COL_0"""to"COL_0".This error path is triggered only by
ReadFileoperations, which more or less directly generate their own SQL SELECT/COPY statements with aliasing inside. The join operation then implicitly normalized triple-quoted names like"""COL_0"""to"COL_0"(whether this behavior is itself correct should be investigated separately).