Skip to content

Cherry-pick CSV parser for GIN (RTA-8526)#81

Open
bkm71A wants to merge 3 commits intopinterest:pinterest-integration-4.1.0-rcfrom
bkm71A:rta-8526
Open

Cherry-pick CSV parser for GIN (RTA-8526)#81
bkm71A wants to merge 3 commits intopinterest:pinterest-integration-4.1.0-rcfrom
bkm71A:rta-8526

Conversation

@bkm71A
Copy link
Copy Markdown

@bkm71A bkm71A commented Mar 30, 2026

Appply "[Feature] Add CSV parser for GIN"/"pinterest-integration-3.5" changes to "pinterest-integration-4.1.0-rc" #79

Why I'm doing:

We want to provide a list of pre-tokenized words that will be used by inverted index, allowing clients to use an external tokenizer to generate data that will be ingested by StarRocks. In some scenarios, we just can't add custom tokenizers into StarRocks due license issues and, therefore, we want to do tokenization in a stage before StarRocks.

Unfortunately Index Index does not support an array of strings, so a comma separated values format is necessary to forward the output of the external tokenizer. Existing tokenizers don't allow us to escape separators when necessary.

What I'm doing:

I am adding a CSV tokenizer that can be used as an intermediary format between external tokenizer and StarRocks. Essentially:

client generates tokens for a given string;
client sends these tokens as a string with comma separated values;
StarRocks ingest those comma separated values (tokens).
As any CSV, comma can be escaped when necessary, allowing client to have full control about the result tokens.

Fixes #issue

What type of PR is this:

  • BugFix
  • Feature
  • Enhancement
  • Refactor
  • UT
  • Doc
  • Tool

Does this PR entail a change in behavior?

  • [] Yes, this PR will result in a change in behavior.
  • No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

  • Interface/UI changes: syntax, type conversion, expression evaluation, display information
  • Parameter changes: default values, similar parameters but with different default values
  • Policy changes: use new policy to replace old one, functionality automatically enabled
  • Feature removed
  • Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

  • I have added test cases for my bug fix or my new feature
  • This pr needs user documentation (for new or modified features or behaviors)
    • I have added documentation for my new feature or new function
    • This pr needs auto generate documentation
  • This is a backport pr

Bugfix cherry-pick branch check:

  • I have checked the version labels which the pr will be auto-backported to the target branch
    • 4.1
    • 4.0
    • 3.5
    • 3.4

Appply "[Feature] Add CSV parser for GIN"/"pinterest-integration-3.5" changes to "pinterest-integration-4.1.0-rc"
pinterest#79
@bkm71A
Copy link
Copy Markdown
Author

bkm71A commented Mar 30, 2026

Hi @ramtinb @henriqueng could you, please, review?

Copy link
Copy Markdown

@ramtinb ramtinb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@bkm71A bkm71A marked this pull request as draft March 30, 2026 20:50
Appply "[Feature] Add CSV parser for GIN"/"pinterest-integration-3.5" changes to "pinterest-integration-4.1.0-rc"
pinterest#79
@bkm71A bkm71A marked this pull request as ready for review March 31, 2026 09:53
Appply "[Feature] Add CSV parser for GIN"/"pinterest-integration-3.5" changes to "pinterest-integration-4.1.0-rc"
pinterest#79
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants