Skip to content

fix: add configurable file encoding to resolve #2759 (UnicodeDecodeError on JP Windows)#2791

Open
sfc-gh-moczko wants to merge 2 commits intomainfrom
feature/encoding-fix
Open

fix: add configurable file encoding to resolve #2759 (UnicodeDecodeError on JP Windows)#2791
sfc-gh-moczko wants to merge 2 commits intomainfrom
feature/encoding-fix

Conversation

@sfc-gh-moczko
Copy link
Collaborator

Summary

  • Adds a configuration-driven encoding system (SNOWFLAKE_CLI_ENCODING_FILE_IO env var or [cli.encoding] file_io in config.toml) to allow users on non-UTF-8 locales (e.g., Japanese Windows with CP932) to specify the file I/O encoding
  • Passes configured encoding through SecurePath.read_text() and SecurePath.open() at all file-reading callsites: SQL statement reader, Jinja rendering, Cortex commands, Snowpark requirements, and entity script templates
  • Returns None (platform default) when unconfigured, preserving existing behavior for all current users

Changes

  • New: src/snowflake/cli/api/encoding.py -- get_file_io_encoding(), get_subprocess_encoding(), detect_encoding_environment()
  • Modified: statement_reader.py -- from_file() and files_reader() now pass encoding (root cause of SNOW-3066557: UnicodeDecodeError when executing SQL files with UTF-8 encoding on Japanese Windows #2759)
  • Modified: rendering/jinja.py -- read_file_content() and procedure_from_js_file()
  • Modified: entities/utils.py -- render_script_template()
  • Modified: cortex/commands.py and cortex/manager.py -- 5 read_text() callsites
  • Modified: snowpark/commands.py -- _read_snowflake_requirements_file()
  • Modified: api/config.py -- added ENCODING_SECTION_PATH constant

Test plan

  • 12 unit tests for encoding.py (env var, config, precedence, detection, mismatch warning)
  • 2 new tests in test_statement_reader.py verifying UTF-8 Japanese SQL files work through both files_reader() and ParsedStatement.from_file()
  • 2 new tests in test_secure_path.py verifying read_text(encoding=) and open(encoding=) pass-through
  • Bug reproduction confirmed: CP932 locale + UTF-8 Japanese SQL -> UnicodeDecodeError without fix, succeeds with SNOWFLAKE_CLI_ENCODING_FILE_IO=utf-8
  • All pre-commit hooks pass (ruff, black, mypy, check-ast, codespell, license headers)

Closes #2759

.... Generated with Cortex Code

@sfc-gh-moczko sfc-gh-moczko requested a review from a team as a code owner February 27, 2026 16:45
…ror on JP Windows)

On non-UTF-8 Windows locales (e.g., Japanese CP932), `snow sql -f`
fails with UnicodeDecodeError because file I/O operations use the
platform default encoding instead of UTF-8. This adds a configuration-
driven encoding system that lets users set their preferred encoding
via SNOWFLAKE_CLI_ENCODING_FILE_IO env var or config.toml without
changing behavior for existing users on UTF-8 systems.

Closes #2759

.... Generated with [Cortex Code](https://docs.snowflake.com/en/user-guide/cortex-code/cortex-code)

Co-Authored-By: Cortex Code <noreply@snowflake.com>
tests/sql/test_statement_reader.py writes UTF-8 files then reads them
without configuring an encoding. On Windows the platform default is
charmap (cp1252), causing a UnicodeDecodeError. Use monkeypatch to set
SNOWFLAKE_CLI_ENCODING_FILE_IO=utf-8 so the tests verify the intended
configured behaviour on all platforms.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link

@sfc-gh-olorek sfc-gh-olorek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contibution! Left some comments

Comment on lines +172 to +175
prompt = SecurePath(file).read_text(
file_size_limit_mb=DEFAULT_SIZE_LIMIT_MB,
encoding=get_file_io_encoding(),
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this! Just to be complete I noticed that nativeapp/artifacts.py and spcs/services/manager.py still uses the old version. Both would still fail on JP Windows. Could you also change it there?

Comment on lines +26 to +43
def get_file_io_encoding() -> Optional[str]:
"""Get configured file I/O encoding, or None for platform default.

Resolution order:
1. SNOWFLAKE_CLI_ENCODING_FILE_IO environment variable
2. config.toml [cli.encoding] file_io value
3. None (use platform default)
"""
env_encoding = os.environ.get("SNOWFLAKE_CLI_ENCODING_FILE_IO")
if env_encoding:
return env_encoding

try:
from snowflake.cli.api.config import get_config_value

return get_config_value("cli", "encoding", key="file_io")
except Exception:
return None

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of invalid encoding for exmpale: SNOWFLAKE_CLI_ENCODING_FILE_IO=not-a-codec the error will be generated late during file read. Could you add as a last step a verfication that the encoding is valid?:

try:
  codecs.lookup(encoding)
except LookupError:
   raise CliError(
                f"Unknown encoding '{encoding}'. Check SNOWFLAKE_CLI_ENCODING_FILE_IO "
                f"or [cli.encoding] file_io in config.toml."
            )

Comment on lines +46 to +63
def get_subprocess_encoding() -> Optional[str]:
"""Get configured subprocess encoding, or None for platform default.

Resolution order:
1. SNOWFLAKE_CLI_ENCODING_SUBPROCESS environment variable
2. config.toml [cli.encoding] subprocess value
3. None (use platform default)
"""
env_encoding = os.environ.get("SNOWFLAKE_CLI_ENCODING_SUBPROCESS")
if env_encoding:
return env_encoding

try:
from snowflake.cli.api.config import get_config_value

return get_config_value("cli", "encoding", key="subprocess")
except Exception:
return None

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is never used in the non-test codebase. Can we move it to the test package if it is a helper method or delete it if it not needed at all?

PLUGINS_SECTION_PATH = [CLI_SECTION, PLUGINS_SECTION]
PLUGIN_ENABLED_KEY = "enabled"
FEATURE_FLAGS_SECTION_PATH = [CLI_SECTION, "features"]
ENCODING_SECTION_PATH = [CLI_SECTION, "encoding"]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This variable is never used. Could we delete it or acutally use it? : see comment about: get_config_value

try:
from snowflake.cli.api.config import get_config_value

return get_config_value("cli", "encoding", key="file_io")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use ENCODING_SECTION_PATH for the hardcoded constants.

Comment on lines +42 to +43
except Exception:
return None

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

except Exception with return None swallows all exceptions. What do you think about making the catch more granural (for example: KeyError, MissingConfiguration) and log the problem?

"locale": locale.getpreferredencoding(),
}

encodings = {v.lower().replace("-", "") for v in env_info.values()}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: codec name can also contain underscore for exmple: utf_8. Could you use Python's codecs.lookup(name).name instead of replacing the -?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SNOW-3066557: UnicodeDecodeError when executing SQL files with UTF-8 encoding on Japanese Windows

2 participants