fix: add configurable file encoding to resolve #2759 (UnicodeDecodeError on JP Windows)#2791
fix: add configurable file encoding to resolve #2759 (UnicodeDecodeError on JP Windows)#2791sfc-gh-moczko wants to merge 2 commits intomainfrom
Conversation
…ror on JP Windows) On non-UTF-8 Windows locales (e.g., Japanese CP932), `snow sql -f` fails with UnicodeDecodeError because file I/O operations use the platform default encoding instead of UTF-8. This adds a configuration- driven encoding system that lets users set their preferred encoding via SNOWFLAKE_CLI_ENCODING_FILE_IO env var or config.toml without changing behavior for existing users on UTF-8 systems. Closes #2759 .... Generated with [Cortex Code](https://docs.snowflake.com/en/user-guide/cortex-code/cortex-code) Co-Authored-By: Cortex Code <noreply@snowflake.com>
0e10769 to
97c6e56
Compare
tests/sql/test_statement_reader.py writes UTF-8 files then reads them without configuring an encoding. On Windows the platform default is charmap (cp1252), causing a UnicodeDecodeError. Use monkeypatch to set SNOWFLAKE_CLI_ENCODING_FILE_IO=utf-8 so the tests verify the intended configured behaviour on all platforms. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
sfc-gh-olorek
left a comment
There was a problem hiding this comment.
Thank you for the contibution! Left some comments
| prompt = SecurePath(file).read_text( | ||
| file_size_limit_mb=DEFAULT_SIZE_LIMIT_MB, | ||
| encoding=get_file_io_encoding(), | ||
| ) |
There was a problem hiding this comment.
Thanks for adding this! Just to be complete I noticed that nativeapp/artifacts.py and spcs/services/manager.py still uses the old version. Both would still fail on JP Windows. Could you also change it there?
| def get_file_io_encoding() -> Optional[str]: | ||
| """Get configured file I/O encoding, or None for platform default. | ||
|
|
||
| Resolution order: | ||
| 1. SNOWFLAKE_CLI_ENCODING_FILE_IO environment variable | ||
| 2. config.toml [cli.encoding] file_io value | ||
| 3. None (use platform default) | ||
| """ | ||
| env_encoding = os.environ.get("SNOWFLAKE_CLI_ENCODING_FILE_IO") | ||
| if env_encoding: | ||
| return env_encoding | ||
|
|
||
| try: | ||
| from snowflake.cli.api.config import get_config_value | ||
|
|
||
| return get_config_value("cli", "encoding", key="file_io") | ||
| except Exception: | ||
| return None |
There was a problem hiding this comment.
In case of invalid encoding for exmpale: SNOWFLAKE_CLI_ENCODING_FILE_IO=not-a-codec the error will be generated late during file read. Could you add as a last step a verfication that the encoding is valid?:
try:
codecs.lookup(encoding)
except LookupError:
raise CliError(
f"Unknown encoding '{encoding}'. Check SNOWFLAKE_CLI_ENCODING_FILE_IO "
f"or [cli.encoding] file_io in config.toml."
)| def get_subprocess_encoding() -> Optional[str]: | ||
| """Get configured subprocess encoding, or None for platform default. | ||
|
|
||
| Resolution order: | ||
| 1. SNOWFLAKE_CLI_ENCODING_SUBPROCESS environment variable | ||
| 2. config.toml [cli.encoding] subprocess value | ||
| 3. None (use platform default) | ||
| """ | ||
| env_encoding = os.environ.get("SNOWFLAKE_CLI_ENCODING_SUBPROCESS") | ||
| if env_encoding: | ||
| return env_encoding | ||
|
|
||
| try: | ||
| from snowflake.cli.api.config import get_config_value | ||
|
|
||
| return get_config_value("cli", "encoding", key="subprocess") | ||
| except Exception: | ||
| return None |
There was a problem hiding this comment.
This method is never used in the non-test codebase. Can we move it to the test package if it is a helper method or delete it if it not needed at all?
| PLUGINS_SECTION_PATH = [CLI_SECTION, PLUGINS_SECTION] | ||
| PLUGIN_ENABLED_KEY = "enabled" | ||
| FEATURE_FLAGS_SECTION_PATH = [CLI_SECTION, "features"] | ||
| ENCODING_SECTION_PATH = [CLI_SECTION, "encoding"] |
There was a problem hiding this comment.
This variable is never used. Could we delete it or acutally use it? : see comment about: get_config_value
| try: | ||
| from snowflake.cli.api.config import get_config_value | ||
|
|
||
| return get_config_value("cli", "encoding", key="file_io") |
There was a problem hiding this comment.
We can use ENCODING_SECTION_PATH for the hardcoded constants.
| except Exception: | ||
| return None |
There was a problem hiding this comment.
except Exception with return None swallows all exceptions. What do you think about making the catch more granural (for example: KeyError, MissingConfiguration) and log the problem?
| "locale": locale.getpreferredencoding(), | ||
| } | ||
|
|
||
| encodings = {v.lower().replace("-", "") for v in env_info.values()} |
There was a problem hiding this comment.
Nit: codec name can also contain underscore for exmple: utf_8. Could you use Python's codecs.lookup(name).name instead of replacing the -?
Summary
Changes
Test plan
Closes #2759
.... Generated with Cortex Code