Skip to content

fix: leaking password in logs#75

Open
usernameisnull wants to merge 2 commits intoBaizeAI:mainfrom
usernameisnull:fix/leaking-password-in-log
Open

fix: leaking password in logs#75
usernameisnull wants to merge 2 commits intoBaizeAI:mainfrom
usernameisnull:fix/leaking-password-in-log

Conversation

@usernameisnull
Copy link
Copy Markdown
Contributor

Sanitize sensitive credentials across all data-source loaders and the shared command utility to prevent passwords, tokens, and keys from appearing in logs and error messages.

when use fakeuser/fake/password in git, it outputs the https://fakeuser:fakepass%2F123@gitlab.daocloud.cn/ndx/mcamel/mcamel-mysql.git

➜ kl dataset-git0401-round-1-6sfdk|grep fake
time="2026-04-01T06:34:18Z" level=debug msg="executing command" func="utils.ExecuteCommandWithAllOutput()" file="command.go:16" alteredFromURI="https://******:******@gitlab.daocloud.cn/ndx/mcamel/mcamel-mysql.git" applicationWorkingDirectory=/workspace branch= cloneToPath=. command="/usr/bin/git clone https://fakeuser:fakepass%2F123@gitlab.daocloud.cn/ndx/mcamel/mcamel-mysql.git . -v" commit=HEAD depth= fromURI="https://gitlab.daocloud.cn/ndx/mcamel/mcamel-mysql.git" path=. root=/baize/dataset/data submodules= type=GIT workingDirectory=/baize/dataset/data
time="2026-04-01T06:34:18Z" level=debug msg="command output: " func="utils.ExecuteCommandWithAllOutput()" file="command.go:43" alteredFromURI="https://******:******@gitlab.daocloud.cn/ndx/mcamel/mcamel-mysql.git" applicationWorkingDirectory=/workspace branch= cloneToPath=. command="/usr/bin/git clone https://fakeuser:fakepass%2F123@gitlab.daocloud.cn/ndx/mcamel/mcamel-mysql.git . -v" commit=HEAD depth= fromURI="https://gitlab.daocloud.cn/ndx/mcamel/mcamel-mysql.git" path=. root=/baize/dataset/data submodules= type=GIT workingDirectory=/baize/dataset/data
time="2026-04-01T06:34:18Z" level=error msg="command failed to execute, error: Cloning into '.'...\nremote: HTTP Basic: Access denied\nfatal: Authentication failed for 'https://gitlab.daocloud.cn/ndx/mcamel/mcamel-mysql.git/'\n" func="utils.ExecuteCommandWithAllOutput()" file="command.go:45" alteredFromURI="https://******:******@gitlab.daocloud.cn/ndx/mcamel/mcamel-mysql.git" applicationWorkingDirectory=/workspace branch= cloneToPath=. command="/usr/bin/git clone https://fakeuser:fakepass%2F123@gitlab.daocloud.cn/ndx/mcamel/mcamel-mysql.git . -v" commit=HEAD depth= fromURI="https://gitlab.daocloud.cn/ndx/mcamel/mcamel-mysql.git" path=. root=/baize/dataset/data submodules= type=GIT workingDirectory=/baize/dataset/data
failed to load data: failed to execute command /usr/bin/git clone https://fakeuser:fakepass%2F123@gitlab.daocloud.cn/ndx/mcamel/mcamel-mysql.git . -v, err: exit status 128

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request upgrades the project to Go 1.26.1 and Python 3.14, while also updating several core dependencies. The primary functional change is the implementation of more robust secret masking across various data sources (Git, HTTP, HuggingFace, ModelScope, and S3) to prevent sensitive credentials from leaking into logs and error messages. The feedback focuses on extending this masking to URIs that might contain embedded credentials and optimizing the secret obscuration logic to handle large keys more efficiently and avoid redundant processing.

if err != nil {
logger.Errorf("rclone copy command error: %s", errBuffer)
return fmt.Errorf("failed to copy data from %s to %s with rclone command %s, err: %s", fromURI, toPath, cmd.String(), err)
return fmt.Errorf("failed to copy data from %s to %s with rclone command %s, err: %s", fromURI, toPath, utils.ObscureString(cmd.String(), []string{basicAuthBase64}), err)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The fromURI is included in the error message without being obscured. If the user provides a URI that contains embedded credentials (e.g., http://user:pass@host), these will be leaked in the error logs. Please ensure fromURI is also passed through utils.ObscureString before being included in the error message.

Suggested change
return fmt.Errorf("failed to copy data from %s to %s with rclone command %s, err: %s", fromURI, toPath, utils.ObscureString(cmd.String(), []string{basicAuthBase64}), err)
return fmt.Errorf("failed to copy data from %s to %s with rclone command %s, err: %s", utils.ObscureString(fromURI, []string{basicAuthBase64}), toPath, utils.ObscureString(cmd.String(), []string{basicAuthBase64}), err)

Comment on lines +76 to 90
func obscureSecrets(secrets ...string) []string {
var result []string
for _, s := range secrets {
if s == "" {
continue
}
result = append(result, s)
result = append(result, url.QueryEscape(s))
// In certain scenarios (such as shell script processing, cmd.String() formatting, etc.),
// URL-encoded strings may be encoded again (e.g., % becomes %25).
// Add double-encoded version to defensively cover this case.
result = append(result, url.QueryEscape(url.QueryEscape(s)))
}
return result
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The obscureSecrets function can be improved to avoid redundant string replacements and handle large secrets more efficiently. Currently, it adds multiple versions of every secret, including large blobs like PEM-encoded SSH private keys, which can lead to performance overhead and potential false positives in logs. Additionally, using a map to ensure the returned list contains unique strings will improve the efficiency of the subsequent strings.ReplaceAll calls.

func obscureSecrets(secrets ...string) []string {
	var result []string
	seen := make(map[string]struct{})
	add := func(s string) {
		if s != "" {
			if _, ok := seen[s]; !ok {
				result = append(result, s)
				seen[s] = struct{}{}
			}
		}
	}
	for _, s := range secrets {
		add(s)
		// Only URL-encode secrets that are short enough to likely appear in a URL.
		// Large blobs like PEM-encoded private keys are not used this way and
		// encoding them is expensive and increases false positives.
		if len(s) < 1024 {
			escaped := url.QueryEscape(s)
			add(escaped)
			add(url.QueryEscape(escaped))
		}
	}
	return result
}

if err != nil {
logger.Errorf("rclone copy command error: %s", errBuffer)
return fmt.Errorf("failed to copy data from %s to %s with rclone command %s, err: %s", fromURI, toPath, cmd.String(), err)
return fmt.Errorf("failed to copy data from %s to %s with rclone command %s, err: %s", fromURI, toPath, utils.ObscureString(cmd.String(), []string{accessKeyID, secretAccessKey}), err)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The fromURI is included in the error message without being obscured. Although S3 URIs typically do not contain credentials in the same way HTTP URIs might, it is best practice to obscure it consistently to prevent accidental leaks if a non-standard or proxied URI is used.

Suggested change
return fmt.Errorf("failed to copy data from %s to %s with rclone command %s, err: %s", fromURI, toPath, utils.ObscureString(cmd.String(), []string{accessKeyID, secretAccessKey}), err)
return fmt.Errorf("failed to copy data from %s to %s with rclone command %s, err: %s", utils.ObscureString(fromURI, []string{accessKeyID, secretAccessKey}), toPath, utils.ObscureString(cmd.String(), []string{accessKeyID, secretAccessKey}), err)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant