Skip to content

Valkey FullTextSearch RFC#24

Open
KarthikSubbarao wants to merge 7 commits intovalkey-io:mainfrom
KarthikSubbarao:main
Open

Valkey FullTextSearch RFC#24
KarthikSubbarao wants to merge 7 commits intovalkey-io:mainfrom
KarthikSubbarao:main

Conversation

@KarthikSubbarao
Copy link
Member

@KarthikSubbarao KarthikSubbarao commented Jul 14, 2025

Valkey FullTextSearch RFC

Content taken from: https://github.com/valkey-io/valkey-search/tree/fulltext/rfc

I can help update the RFC based on any question / comment added to the PR

Signed-off-by: Karthik Subbarao <karthikrs2021@gmail.com>
Signed-off-by: Karthik Subbarao <karthikrs2021@gmail.com>
Copy link
Member

@madolson madolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a quick read to ask questions I didn't follow. @stockholmux for your reference.


### Limits

To avoid combinatorial explosion certain operations have configurable limits applied.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the limits?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default values are TBD.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Include that and at least include how these limits will be derived.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the limits are already implemented in valkey-search and not specific to full text search. (
Query nesting depth, number of nodes in the query syntax tree, etc).

Some full text specific limits (other than what is below) are: MAXEXPANSIONS, MAXPREFIXEXPANSIONS.

We can also consider adding a limit for MAXSEARCHRESULTS


## Motivation

Text searching is useful in many applications.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should identify the persona here. This is one of the most important questions to get the use case right. Is it for caching, primary database, etc.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Coming from the OpenSearch project, the potential scope of this RFC is enormous. OpenSearch leverages Lucene for most of the search functionality described here. Except for stemming, I don't see any other libraries mentioned, so does that mean you plan to implement everything else? More details about the targeted persona and use case could help narrow the scope.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the plan is to implement the necessary search structures in Valkey code.

This implementation distinguishes itself from OpenSearch primarily in the time/space continuum. In that OpenSearch is more balanced toward saving space while this implementation focusses on low latency. We expect dramatically shorter query and ingestion times than what OpenSearch supports. In general, this is targetting a more dynamic corpus, i.e., very short latency from an insertion until it's visible in a query.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use cases will differentiate based on the fundamentals:

  1. Lower $/query. Due to time-optimized data structures and reduced parsing/command processing overhead.
  2. Incremental update machinery. Data structures support incremental updates without GC.
  3. Current queries, i.e., query operations include the most recent mutations, i.e., mutation to query visibility in less than a milisecond.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should identify the persona here.

Thanks everyone. I will update this section with more details

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One use-case that I have seen repeated is to search session histories. In this use-case, per-user history is loaded when a session is initiated (either a web-session or sometimes a "log on" of an application).

Another use-case is in RAG operations with filters that have additional text phrasing -- phrasing that can't be predicted and turned into a TAG operation.

@msfroh
Copy link

msfroh commented Jul 15, 2025

Reading through the RFC, I'm curious about the implementation of the term dictionary.

Will it just be a key/value lookup from full terms to postings lists? (I guess that would fit in well to Valkey.) I guess each term is indexed in its forward and reversed representation in the prefix/suffix trees?

Lucene uses finite-state transducers to encode the term dictionary. Similarly, Rust has the fst crate which is used for full-text search in LanceDB and (with some modifications) in the Tantivy library. Just wondering if there's a plan to build a similar FST solution for Valkey. It makes the dictionary pretty compact, while more efficiently handling wildcard/fuzzy matches. I don't know if there are good off-the-shelf FST libraries in C (but I wouldn't be surprised).

Copy link
Member

@stockholmux stockholmux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there are some localization-related issues.

The tokenization process has four steps.

1. The text is tokenized, removing punctuation and redundant whitespace, generating a vector of words.
2. Latin upper-case characters are converted to lower-case.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like a miss. What about other scripts with upper/lower case? Cyrillic? Greek?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will reword, the intention is to be case-insensitive.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to have an option control this?

└───────────────┘
```

Step 1 of the tokenization process is controlled by a specified list of punctuation characters.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the specified list customizable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

The text field-index is a mapping from a term to a list of locations.
Depending on the configuration, the list of locations may be just a list of keys or a list of key/term-offset pairs. The list of locations is also known as a postings or a postings list.

The mapping structure is built from one or two prefix trees. When present, the second prefix tree is built from each inserted term sequenced in reverse order, effectively providing a suffix tree.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does this work for LTR and RTL languages?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure that LTR vs RTL makes a difference here. Options like $inorder and $slop are trivially sensitive to ordering.

| Suffix Tree | Boolean | Controls the presence/absence of a suffix tree. |
| Word Offsets | Boolean | Indicates if the postings for a word contain word offsets or just a count. |

### Extension of the ```FT.CREATE``` command
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So schema changes require a FT.DROP and complete re-index?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Right now, there are no plans to mutate existing index definitions.


## Abstract

The existing search platform is extended with a new field type, ```TEXT``` is defined. The query language is extended to support locating keys with text fields based on term, wildcard, fuzzy and exact phrase matching.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, using this type of query engine for full text search is next to impossible without FT.EXPLAIN what are the plans for this command?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Support for FT.EXPLAIN makes sense.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using this type of query engine for full text search is next to impossible without FT.EXPLAIN what are the plans for this command?

Basic text queries can be simple to understand. Yes, they can get complex, but so can hybrid vector queries (without text).

FT.EXPLAIN provides value for all these cases, so it makes sense to support it. We can work towards adding support for it

```
[PUNCTUATION <string>]
```
The characters of this string are used to split the input string into words. Note, the splitting process allows escaping of input characters using the usual backslash notation. This string cannot be empty. Default value is: _TBD_
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is language dependent, so like the problems global language defaults in stopwords (see below), the default punctuation should probably be linked with stemming language

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I believe that ultimately the input tokenizer will have substantial per-language dependent functionality.

The text searching facility provides machinery that decomposes text fields into terms and field-indexes them.
The query facility of the ```FT.SEARCH``` and ```FT.AGGREGATE``` commands is enhanced to select keys based on combinations of term, fuzzy and phrase matching together with the existing selection operators of TAG, Numeric range and Vector searches.

### Tokenization process
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this work with CJK and similar languages?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are planning to support english first, and the tokenization process can be extended in the future to support other languages


Controls the stemming algorithm. Supported languages are:

**none, arabic, armenian, basque, catalan, danish, dutch, english, estonian, finnish, french, german, greek, hindi, hungarian, indonesian, irish, italian, lithuanian, nepali, norwegian, porter, portuguese, romanian, russian, scripts, serbian, spanish, swedish, tamil, turkish, yiddish**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

miss: hebrew. Consider something pluggable in cases where Snowball doesn't cover.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can include this detail. We are only planning on supporting english in the initial version, but we can make a note that hebrew needs a different handling.

@allenss-amazon
Copy link
Member

FST data structure was examined and rejected due to the high run-time costs. Applications like OpenSearch that are FST based use a log-structure-merge update technique that causes both significant write-amplification as well as longer query operations. This choice of more time efficient (but less space efficient) data structures will be how this implementation distinguishes from OpenSearch (and equivalents)

@allenss-amazon
Copy link
Member

@msfroh (Sorry, the comment is located here, github wouldn't let me comment directly on your comment).

Yes, the intent is to implement word dictionaries using forward and reverse direction radix trees. Each word entry in the dictionary references a full postings list.

Signed-off-by: Karthik Subbarao <karthikrs2021@gmail.com>
Signed-off-by: Karthik Subbarao <karthikrs2021@gmail.com>
Signed-off-by: Karthik Subbarao <karthikrs2021@gmail.com>
@rikkarth
Copy link

rikkarth commented Jul 28, 2025

Hey I'm currently using valkey (AWS) as the backend for my "redis-like" cache in an A.I. related project where valkey acts as the memory cache of this agent.

Because we are using langgraph, we thought we could use https://github.com/redis-developer/langgraph-redis, but using this library with a Valkey backend proves to be incompatible.

So we had to implement our own custom BaseCheckpointSaver (see https://github.com/redis-developer/langgraph-redis), since the redis one isn't compatible with Valkey's API.

First error we had was on redisconnectionRedisConnectionFactory.validate_sync_redis has been workarounded.

Then other errors were raised after connection redis.exceptions.ResponseError: unknown command 'ft._list''

recommended to post here by @madolson

Thank you so much, would be amazing if we could make this work for Valkey!

Description

from langgraph.checkpoint.redis import RedisSaver

with RedisSaver.from_conn_string("redis://localhost:6379") as checkpointer:
    return checkpointer

Produces

Exception Type: <class 'redisvl.exceptions.RedisModuleVersionError'>
Unexpected error occurred: Required Redis db module search >= 20600 OR searchlight >= 20600 not installed. See Redis Stack docs at https://redis.io/docs/latest/operate/oss_and_stack/install/install-stack/.
Traceback (most recent call last):
... 
  File "./test.py", line XX, in _redis_cache
    with RedisSaver.from_conn_string("redis://localhost:6379") as checkpointer:
         ~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "./.pyenv/versions/3.13.1/lib/python3.13/contextlib.py", line 141, in __enter__
    return next(self.gen)
  File "./lib/python3.13/site-packages/langgraph/checkpoint/redis/__init__.py", line 497, in from_conn_string
    saver = cls(
        redis_url=redis_url,
    ...<2 lines>...
        ttl=ttl,
    )
  File "./lib/python3.13/site-packages/langgraph/checkpoint/redis/__init__.py", line 55, in __init__
    super().__init__(
    ~~~~~~~~~~~~~~~~^
        redis_url=redis_url,
        ^^^^^^^^^^^^^^^^^^^^
    ...<2 lines>...
        ttl=ttl,
        ^^^^^^^^
    )
    ^
  File "./lib/python3.13/site-packages/langgraph/checkpoint/redis/base.py", line 125, in __init__
    self.configure_client(
    ~~~~~~~~~~~~~~~~~~~~~^
        redis_url=redis_url,
        ^^^^^^^^^^^^^^^^^^^^
        redis_client=redis_client,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^
        connection_args=connection_args or {},
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "./lib/python3.13/site-packages/langgraph/checkpoint/redis/__init__.py", line 70, in configure_client
    self._redis = redis_client or RedisConnectionFactory.get_redis_connection(
                                  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
        redis_url, **connection_args
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "./lib/python3.13/site-packages/redisvl/redis/connection.py", line 261, in get_redis_connection
    RedisConnectionFactory.validate_sync_redis(
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
        client, required_modules=required_modules
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "./lib/python3.13/site-packages/redisvl/redis/connection.py", line 411, in validate_sync_redis
    validate_modules(installed_modules, required_modules)
    ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "./lib/python3.13/site-packages/redisvl/redis/connection.py", line 189, in validate_modules
    raise RedisModuleVersionError(error_message)
redisvl.exceptions.RedisModuleVersionError: Required Redis db module search >= 20600 OR searchlight >= 20600 not installed. See Redis Stack docs at https://redis.io/docs/latest/operate/oss_and_stack/install/install-stack/.

Adding this bellow fixes this issue

from redisvl.redis.connection import RedisConnectionFactory

RedisConnectionFactory.validate_sync_redis = (
    lambda redis_client, lib_name=None, required_modules=None: None
)
with RedisSaver.from_conn_string(
    "redis://localhost:6379"
) as checkpointer:
    return checkpointer

Then

Exception Type: <class 'redisvl.exceptions.RedisSearchError'>
Error while searching: unknown command 'FT.SEARCH', with args beginning with: 'checkpoints' '(@thread_id:{15cb9aad\-5021\-4dbb\-a5cc\-2bb7a350c34e} @checkpoint_ns:{__empty__})' 'RETURN' '6' 'thread_id' 'chec' 
Traceback (most recent call last):
  File "./lib/python3.13/site-packages/redisvl/index/index.py", line 934, in search
    return self._redis_client.ft(self.schema.index.name).search(
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
        *args, **kwargs
        ^^^^^^^^^^^^^^^
    )  # type: ignore
    ^
  File "./lib/python3.13/site-packages/redis/commands/search/commands.py", line 519, in search
    res = self.execute_command(SEARCH_CMD, *args, **options)
  File "./lib/python3.13/site-packages/redis/client.py", line 622, in execute_command
    return self._execute_command(*args, **options)
           ~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^
  File "./lib/python3.13/site-packages/redis/client.py", line 633, in _execute_command
    return conn.retry.call_with_retry(
           ~~~~~~~~~~~~~~~~~~~~~~~~~~^
        lambda: self._send_command_parse_response(
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ...<2 lines>...
        lambda _: self._close_connection(conn),
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "./lib/python3.13/site-packages/redis/retry.py", line 87, in call_with_retry
    return do()
  File "./lib/python3.13/site-packages/redis/client.py", line 634, in <lambda>
    lambda: self._send_command_parse_response(
            ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
        conn, command_name, *args, **options
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ),
    ^
  File "./lib/python3.13/site-packages/redis/client.py", line 605, in _send_command_parse_response
    return self.parse_response(conn, command_name, **options)
           ~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "./lib/python3.13/site-packages/redis/client.py", line 649, in parse_response
    response = connection.read_response(disable_decoding=True)
  File "./lib/python3.13/site-packages/redis/connection.py", line 666, in read_response
    raise response
redis.exceptions.ResponseError: unknown command 'FT.SEARCH', with args beginning with: 'checkpoints' '(@thread_id:{15cb9aad\-5021\-4dbb\-a5cc\-2bb7a350c34e} @checkpoint_ns:{__empty__})' 'RETURN' '6' 'thread_id' 'chec' 

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  ...
  File "./lib/python3.13/site-packages/langgraph/checkpoint/redis/__init__.py", line 361, in get_tuple
    results = self.checkpoints_index.search(checkpoints_query)
  File "./lib/python3.13/site-packages/redisvl/index/index.py", line 942, in search
    raise RedisSearchError(f"Error while searching: {str(e)}") from e
redisvl.exceptions.RedisSearchError: Error while searching: unknown command 'FT.SEARCH', with args beginning with: 'checkpoints' '(@thread_id:{15cb9aad\-5021\-4dbb\-a5cc\-2bb7a350c34e} @checkpoint_ns:{__empty__})' 'RETURN' '6' 'thread_id' 'chec' 

@allenss-amazon
Copy link
Member

This would be more appropriate in the valeky-search repo issues. which it looks like it was referenced in.

```

Step 1 of the tokenization process is controlled by a specified list of punctuation characters.
Sequences of one or more punctuations characters delimit word boundaries.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will hyphenated words behave ("batman-dark-knight-rises")?
Will "batman", "dark", "knight", "rises" be separate? Or should certain punctuation be considered “in-word” optionally?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it will tokenized into "batman", "dark", "knight", "rises" be separate in ingestion IF it is a punctuation character (by default it is) specified in the index creation.

On the query side, if - is a punctuation character, in an exact phrase, it is simply stripped out. If it is not an exact phrase, it will be parsed as: Term(batman), Negate(Term(dark)), Negate(Term(knight)), Negate(Term(rises)). This is because - is a query syntax character indicating negation. The user can escape this by using the backslash if needed, if their punctuation does not include -, and they want the query to match the ingestion for searches to match

Signed-off-by: Karthik Subbarao <karthikrs2021@gmail.com>
Signed-off-by: Karthik Subbarao <karthikrs2021@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants