Valkey FullTextSearch RFC by KarthikSubbarao · Pull Request #24 · valkey-io/valkey-rfc

KarthikSubbarao · 2025-07-14T23:21:28Z

Valkey FullTextSearch RFC

Content taken from: https://github.com/valkey-io/valkey-search/tree/fulltext/rfc

I can help update the RFC based on any question / comment added to the PR

Signed-off-by: Karthik Subbarao <karthikrs2021@gmail.com>

madolson

Did a quick read to ask questions I didn't follow. @stockholmux for your reference.

ValkeyFullTextSearch.md

madolson · 2025-07-14T23:42:31Z

ValkeyFullTextSearch.md

+
+### Limits
+
+To avoid combinatorial explosion certain operations have configurable limits applied.


What are the limits?

Default values are TBD.

Include that and at least include how these limits will be derived.

Some of the limits are already implemented in valkey-search and not specific to full text search. (
Query nesting depth, number of nodes in the query syntax tree, etc).

Some full text specific limits (other than what is below) are: MAXEXPANSIONS, MAXPREFIXEXPANSIONS.

We can also consider adding a limit for MAXSEARCHRESULTS

madolson · 2025-07-14T23:47:42Z

ValkeyFullTextSearch.md

+
+## Motivation
+
+Text searching is useful in many applications.


We should identify the persona here. This is one of the most important questions to get the use case right. Is it for caching, primary database, etc.

+1

Coming from the OpenSearch project, the potential scope of this RFC is enormous. OpenSearch leverages Lucene for most of the search functionality described here. Except for stemming, I don't see any other libraries mentioned, so does that mean you plan to implement everything else? More details about the targeted persona and use case could help narrow the scope.

Yes, the plan is to implement the necessary search structures in Valkey code.

This implementation distinguishes itself from OpenSearch primarily in the time/space continuum. In that OpenSearch is more balanced toward saving space while this implementation focusses on low latency. We expect dramatically shorter query and ingestion times than what OpenSearch supports. In general, this is targetting a more dynamic corpus, i.e., very short latency from an insertion until it's visible in a query.

Use cases will differentiate based on the fundamentals:

Lower $/query. Due to time-optimized data structures and reduced parsing/command processing overhead.

Incremental update machinery. Data structures support incremental updates without GC.

Current queries, i.e., query operations include the most recent mutations, i.e., mutation to query visibility in less than a milisecond.

We should identify the persona here.

Thanks everyone. I will update this section with more details

One use-case that I have seen repeated is to search session histories. In this use-case, per-user history is loaded when a session is initiated (either a web-session or sometimes a "log on" of an application).

Another use-case is in RAG operations with filters that have additional text phrasing -- phrasing that can't be predicted and turned into a TAG operation.

ValkeyFullTextSearch.md

msfroh · 2025-07-15T18:12:22Z

Reading through the RFC, I'm curious about the implementation of the term dictionary.

Will it just be a key/value lookup from full terms to postings lists? (I guess that would fit in well to Valkey.) I guess each term is indexed in its forward and reversed representation in the prefix/suffix trees?

Lucene uses finite-state transducers to encode the term dictionary. Similarly, Rust has the fst crate which is used for full-text search in LanceDB and (with some modifications) in the Tantivy library. Just wondering if there's a plan to build a similar FST solution for Valkey. It makes the dictionary pretty compact, while more efficiently handling wildcard/fuzzy matches. I don't know if there are good off-the-shelf FST libraries in C (but I wouldn't be surprised).

stockholmux

Looks like there are some localization-related issues.

ValkeyFullTextSearch.md

stockholmux · 2025-07-16T15:21:35Z

ValkeyFullTextSearch.md

+The tokenization process has four steps.
+
+1. The text is tokenized, removing punctuation and redundant whitespace, generating a vector of words.
+2. Latin upper-case characters are converted to lower-case.


This feels like a miss. What about other scripts with upper/lower case? Cyrillic? Greek?

Will reword, the intention is to be case-insensitive.

Would it make sense to have an option control this?

stockholmux · 2025-07-16T15:22:44Z

ValkeyFullTextSearch.md

+                                                                      └───────────────┘
+```
+
+Step 1 of the tokenization process is controlled by a specified list of punctuation characters. 


is the specified list customizable?

stockholmux · 2025-07-16T15:26:33Z

ValkeyFullTextSearch.md

+The text field-index is a mapping from a term to a list of locations.
+Depending on the configuration, the list of locations may be just a list of keys or a list of key/term-offset pairs. The list of locations is also known as a postings or a postings list.
+
+The mapping structure is built from one or two prefix trees. When present, the second prefix tree is built from each inserted term sequenced in reverse order, effectively providing a suffix tree.


how does this work for LTR and RTL languages?

Not sure that LTR vs RTL makes a difference here. Options like $inorder and $slop are trivially sensitive to ordering.

stockholmux · 2025-07-16T15:56:44Z

ValkeyFullTextSearch.md

+| Suffix Tree       | Boolean           | Controls the presence/absence of a suffix tree.                            |
+| Word Offsets      | Boolean           | Indicates if the postings for a word contain word offsets or just a count. |
+
+### Extension of the ```FT.CREATE``` command


So schema changes require a FT.DROP and complete re-index?

Yes. Right now, there are no plans to mutate existing index definitions.

stockholmux · 2025-07-16T15:59:11Z

ValkeyFullTextSearch.md

+
+## Abstract
+
+The existing search platform is extended with a new field type, ```TEXT``` is defined. The query language is extended to support locating keys with text fields based on term, wildcard, fuzzy and exact phrase matching.


So, using this type of query engine for full text search is next to impossible without FT.EXPLAIN what are the plans for this command?

Support for FT.EXPLAIN makes sense.

using this type of query engine for full text search is next to impossible without FT.EXPLAIN what are the plans for this command?

Basic text queries can be simple to understand. Yes, they can get complex, but so can hybrid vector queries (without text).

FT.EXPLAIN provides value for all these cases, so it makes sense to support it. We can work towards adding support for it

stockholmux · 2025-07-16T16:02:15Z

ValkeyFullTextSearch.md

+```
+[PUNCTUATION <string>]
+```
+The characters of this string are used to split the input string into words. Note, the splitting process allows escaping of input characters using the usual backslash notation. This string cannot be empty. Default value is: _TBD_


This is language dependent, so like the problems global language defaults in stopwords (see below), the default punctuation should probably be linked with stemming language

Agreed, I believe that ultimately the input tokenizer will have substantial per-language dependent functionality.

stockholmux · 2025-07-16T16:06:44Z

ValkeyFullTextSearch.md

+The text searching facility provides machinery that decomposes text fields into terms and field-indexes them.
+The query facility of the ```FT.SEARCH``` and ```FT.AGGREGATE``` commands is enhanced to select keys based on combinations of term, fuzzy and phrase matching together with the existing selection operators of TAG, Numeric range and Vector searches.
+
+### Tokenization process


How does this work with CJK and similar languages?

We are planning to support english first, and the tokenization process can be extended in the future to support other languages

stockholmux · 2025-07-16T16:07:57Z

ValkeyFullTextSearch.md

+
+Controls the stemming algorithm. Supported languages are: 
+
+**none, arabic, armenian, basque, catalan, danish, dutch, english, estonian, finnish, french, german, greek, hindi, hungarian, indonesian, irish, italian, lithuanian, nepali, norwegian, porter, portuguese, romanian, russian, scripts, serbian, spanish, swedish, tamil, turkish, yiddish**


miss: hebrew. Consider something pluggable in cases where Snowball doesn't cover.

I can include this detail. We are only planning on supporting english in the initial version, but we can make a note that hebrew needs a different handling.

allenss-amazon · 2025-07-18T23:08:11Z

FST data structure was examined and rejected due to the high run-time costs. Applications like OpenSearch that are FST based use a log-structure-merge update technique that causes both significant write-amplification as well as longer query operations. This choice of more time efficient (but less space efficient) data structures will be how this implementation distinguishes from OpenSearch (and equivalents)

allenss-amazon · 2025-07-20T16:36:13Z

@msfroh (Sorry, the comment is located here, github wouldn't let me comment directly on your comment).

Yes, the intent is to implement word dictionaries using forward and reverse direction radix trees. Each word entry in the dictionary references a full postings list.

Signed-off-by: Karthik Subbarao <karthikrs2021@gmail.com>

rikkarth · 2025-07-28T14:10:43Z

Hey I'm currently using valkey (AWS) as the backend for my "redis-like" cache in an A.I. related project where valkey acts as the memory cache of this agent.

Because we are using langgraph, we thought we could use https://github.com/redis-developer/langgraph-redis, but using this library with a Valkey backend proves to be incompatible.

So we had to implement our own custom BaseCheckpointSaver (see https://github.com/redis-developer/langgraph-redis), since the redis one isn't compatible with Valkey's API.

First error we had was on redisconnectionRedisConnectionFactory.validate_sync_redis has been workarounded.

Then other errors were raised after connection redis.exceptions.ResponseError: unknown command 'ft._list''

Description

from langgraph.checkpoint.redis import RedisSaver

with RedisSaver.from_conn_string("redis://localhost:6379") as checkpointer:
    return checkpointer

Produces

Exception Type: <class 'redisvl.exceptions.RedisModuleVersionError'>
Unexpected error occurred: Required Redis db module search >= 20600 OR searchlight >= 20600 not installed. See Redis Stack docs at https://redis.io/docs/latest/operate/oss_and_stack/install/install-stack/.
Traceback (most recent call last):
... 
  File "./test.py", line XX, in _redis_cache
    with RedisSaver.from_conn_string("redis://localhost:6379") as checkpointer:
         ~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "./.pyenv/versions/3.13.1/lib/python3.13/contextlib.py", line 141, in __enter__
    return next(self.gen)
  File "./lib/python3.13/site-packages/langgraph/checkpoint/redis/__init__.py", line 497, in from_conn_string
    saver = cls(
        redis_url=redis_url,
    ...<2 lines>...
        ttl=ttl,
    )
  File "./lib/python3.13/site-packages/langgraph/checkpoint/redis/__init__.py", line 55, in __init__
    super().__init__(
    ~~~~~~~~~~~~~~~~^
        redis_url=redis_url,
        ^^^^^^^^^^^^^^^^^^^^
    ...<2 lines>...
        ttl=ttl,
        ^^^^^^^^
    )
    ^
  File "./lib/python3.13/site-packages/langgraph/checkpoint/redis/base.py", line 125, in __init__
    self.configure_client(
    ~~~~~~~~~~~~~~~~~~~~~^
        redis_url=redis_url,
        ^^^^^^^^^^^^^^^^^^^^
        redis_client=redis_client,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^
        connection_args=connection_args or {},
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "./lib/python3.13/site-packages/langgraph/checkpoint/redis/__init__.py", line 70, in configure_client
    self._redis = redis_client or RedisConnectionFactory.get_redis_connection(
                                  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
        redis_url, **connection_args
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "./lib/python3.13/site-packages/redisvl/redis/connection.py", line 261, in get_redis_connection
    RedisConnectionFactory.validate_sync_redis(
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
        client, required_modules=required_modules
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "./lib/python3.13/site-packages/redisvl/redis/connection.py", line 411, in validate_sync_redis
    validate_modules(installed_modules, required_modules)
    ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "./lib/python3.13/site-packages/redisvl/redis/connection.py", line 189, in validate_modules
    raise RedisModuleVersionError(error_message)
redisvl.exceptions.RedisModuleVersionError: Required Redis db module search >= 20600 OR searchlight >= 20600 not installed. See Redis Stack docs at https://redis.io/docs/latest/operate/oss_and_stack/install/install-stack/.

Adding this bellow fixes this issue

from redisvl.redis.connection import RedisConnectionFactory

RedisConnectionFactory.validate_sync_redis = (
    lambda redis_client, lib_name=None, required_modules=None: None
)
with RedisSaver.from_conn_string(
    "redis://localhost:6379"
) as checkpointer:
    return checkpointer

Then

Exception Type: <class 'redisvl.exceptions.RedisSearchError'>
Error while searching: unknown command 'FT.SEARCH', with args beginning with: 'checkpoints' '(@thread_id:{15cb9aad\-5021\-4dbb\-a5cc\-2bb7a350c34e} @checkpoint_ns:{__empty__})' 'RETURN' '6' 'thread_id' 'chec' 
Traceback (most recent call last):
  File "./lib/python3.13/site-packages/redisvl/index/index.py", line 934, in search
    return self._redis_client.ft(self.schema.index.name).search(
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
        *args, **kwargs
        ^^^^^^^^^^^^^^^
    )  # type: ignore
    ^
  File "./lib/python3.13/site-packages/redis/commands/search/commands.py", line 519, in search
    res = self.execute_command(SEARCH_CMD, *args, **options)
  File "./lib/python3.13/site-packages/redis/client.py", line 622, in execute_command
    return self._execute_command(*args, **options)
           ~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^
  File "./lib/python3.13/site-packages/redis/client.py", line 633, in _execute_command
    return conn.retry.call_with_retry(
           ~~~~~~~~~~~~~~~~~~~~~~~~~~^
        lambda: self._send_command_parse_response(
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ...<2 lines>...
        lambda _: self._close_connection(conn),
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "./lib/python3.13/site-packages/redis/retry.py", line 87, in call_with_retry
    return do()
  File "./lib/python3.13/site-packages/redis/client.py", line 634, in <lambda>
    lambda: self._send_command_parse_response(
            ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
        conn, command_name, *args, **options
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ),
    ^
  File "./lib/python3.13/site-packages/redis/client.py", line 605, in _send_command_parse_response
    return self.parse_response(conn, command_name, **options)
           ~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "./lib/python3.13/site-packages/redis/client.py", line 649, in parse_response
    response = connection.read_response(disable_decoding=True)
  File "./lib/python3.13/site-packages/redis/connection.py", line 666, in read_response
    raise response
redis.exceptions.ResponseError: unknown command 'FT.SEARCH', with args beginning with: 'checkpoints' '(@thread_id:{15cb9aad\-5021\-4dbb\-a5cc\-2bb7a350c34e} @checkpoint_ns:{__empty__})' 'RETURN' '6' 'thread_id' 'chec' 

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  ...
  File "./lib/python3.13/site-packages/langgraph/checkpoint/redis/__init__.py", line 361, in get_tuple
    results = self.checkpoints_index.search(checkpoints_query)
  File "./lib/python3.13/site-packages/redisvl/index/index.py", line 942, in search
    raise RedisSearchError(f"Error while searching: {str(e)}") from e
redisvl.exceptions.RedisSearchError: Error while searching: unknown command 'FT.SEARCH', with args beginning with: 'checkpoints' '(@thread_id:{15cb9aad\-5021\-4dbb\-a5cc\-2bb7a350c34e} @checkpoint_ns:{__empty__})' 'RETURN' '6' 'thread_id' 'chec'

allenss-amazon · 2025-08-03T17:16:43Z

This would be more appropriate in the valeky-search repo issues. which it looks like it was referenced in.

sachinvmurthy · 2025-08-11T15:48:12Z

ValkeyFullTextSearch.md

+```
+
+Step 1 of the tokenization process is controlled by a specified list of punctuation characters. 
+Sequences of one or more punctuations characters delimit word boundaries.


How will hyphenated words behave ("batman-dark-knight-rises")?
Will "batman", "dark", "knight", "rises" be separate? Or should certain punctuation be considered “in-word” optionally?

Yes, it will tokenized into "batman", "dark", "knight", "rises" be separate in ingestion IF it is a punctuation character (by default it is) specified in the index creation.

On the query side, if - is a punctuation character, in an exact phrase, it is simply stripped out. If it is not an exact phrase, it will be parsed as: Term(batman), Negate(Term(dark)), Negate(Term(knight)), Negate(Term(rises)). This is because - is a query syntax character indicating negation. The user can escape this by using the backslash if needed, if their punctuation does not include -, and they want the query to match the ingestion for searches to match

Signed-off-by: Karthik Subbarao <karthikrs2021@gmail.com>

KarthikSubbarao added 2 commits July 14, 2025 23:20

Valkey FullTextSearch RFC

4e34c63

Signed-off-by: Karthik Subbarao <karthikrs2021@gmail.com>

Update RFC number

9718e96

Signed-off-by: Karthik Subbarao <karthikrs2021@gmail.com>

madolson reviewed Jul 14, 2025

View reviewed changes

stockholmux suggested changes Jul 16, 2025

View reviewed changes

KarthikSubbarao added 3 commits July 23, 2025 21:25

Fix typos

6d57c81

Signed-off-by: Karthik Subbarao <karthikrs2021@gmail.com>

Address comments

4ebb910

Signed-off-by: Karthik Subbarao <karthikrs2021@gmail.com>

Updates

2812b2f

Signed-off-by: Karthik Subbarao <karthikrs2021@gmail.com>

rikkarth mentioned this pull request Jul 28, 2025

Issue Valkey incompatibility with langgraph valkey-io/valkey-search#257

Closed

sachinvmurthy reviewed Aug 11, 2025

View reviewed changes

KarthikSubbarao added 2 commits October 30, 2025 15:40

Updates

701056a

Signed-off-by: Karthik Subbarao <karthikrs2021@gmail.com>

Add Handling multi language indexes section

e3dbc36

Signed-off-by: Karthik Subbarao <karthikrs2021@gmail.com>


		### Limits

		To avoid combinatorial explosion certain operations have configurable limits applied.


		## Abstract

		The existing search platform is extended with a new field type, ```TEXT``` is defined. The query language is extended to support locating keys with text fields based on term, wildcard, fuzzy and exact phrase matching.


		Controls the stemming algorithm. Supported languages are:

		none, arabic, armenian, basque, catalan, danish, dutch, english, estonian, finnish, french, german, greek, hindi, hungarian, indonesian, irish, italian, lithuanian, nepali, norwegian, porter, portuguese, romanian, russian, scripts, serbian, spanish, swedish, tamil, turkish, yiddish

Conversation

KarthikSubbarao commented Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

madolson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

msfroh commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stockholmux left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

allenss-amazon commented Jul 18, 2025

Uh oh!

allenss-amazon commented Jul 20, 2025

Uh oh!

rikkarth commented Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

KarthikSubbarao commented Jul 14, 2025 •

edited

Loading

msfroh commented Jul 15, 2025 •

edited

Loading

rikkarth commented Jul 28, 2025 •

edited

Loading