Conversation
Signed-off-by: Karthik Subbarao <karthikrs2021@gmail.com>
Signed-off-by: Karthik Subbarao <karthikrs2021@gmail.com>
madolson
left a comment
There was a problem hiding this comment.
Did a quick read to ask questions I didn't follow. @stockholmux for your reference.
ValkeyFullTextSearch.md
Outdated
|
|
||
| ### Limits | ||
|
|
||
| To avoid combinatorial explosion certain operations have configurable limits applied. |
There was a problem hiding this comment.
Include that and at least include how these limits will be derived.
There was a problem hiding this comment.
Some of the limits are already implemented in valkey-search and not specific to full text search. (
Query nesting depth, number of nodes in the query syntax tree, etc).
Some full text specific limits (other than what is below) are: MAXEXPANSIONS, MAXPREFIXEXPANSIONS.
We can also consider adding a limit for MAXSEARCHRESULTS
|
|
||
| ## Motivation | ||
|
|
||
| Text searching is useful in many applications. |
There was a problem hiding this comment.
We should identify the persona here. This is one of the most important questions to get the use case right. Is it for caching, primary database, etc.
There was a problem hiding this comment.
+1
Coming from the OpenSearch project, the potential scope of this RFC is enormous. OpenSearch leverages Lucene for most of the search functionality described here. Except for stemming, I don't see any other libraries mentioned, so does that mean you plan to implement everything else? More details about the targeted persona and use case could help narrow the scope.
There was a problem hiding this comment.
Yes, the plan is to implement the necessary search structures in Valkey code.
This implementation distinguishes itself from OpenSearch primarily in the time/space continuum. In that OpenSearch is more balanced toward saving space while this implementation focusses on low latency. We expect dramatically shorter query and ingestion times than what OpenSearch supports. In general, this is targetting a more dynamic corpus, i.e., very short latency from an insertion until it's visible in a query.
There was a problem hiding this comment.
Use cases will differentiate based on the fundamentals:
- Lower $/query. Due to time-optimized data structures and reduced parsing/command processing overhead.
- Incremental update machinery. Data structures support incremental updates without GC.
- Current queries, i.e., query operations include the most recent mutations, i.e., mutation to query visibility in less than a milisecond.
There was a problem hiding this comment.
We should identify the persona here.
Thanks everyone. I will update this section with more details
There was a problem hiding this comment.
One use-case that I have seen repeated is to search session histories. In this use-case, per-user history is loaded when a session is initiated (either a web-session or sometimes a "log on" of an application).
Another use-case is in RAG operations with filters that have additional text phrasing -- phrasing that can't be predicted and turned into a TAG operation.
|
Reading through the RFC, I'm curious about the implementation of the term dictionary. Will it just be a key/value lookup from full terms to postings lists? (I guess that would fit in well to Valkey.) I guess each term is indexed in its forward and reversed representation in the prefix/suffix trees? Lucene uses finite-state transducers to encode the term dictionary. Similarly, Rust has the |
stockholmux
left a comment
There was a problem hiding this comment.
Looks like there are some localization-related issues.
ValkeyFullTextSearch.md
Outdated
| The tokenization process has four steps. | ||
|
|
||
| 1. The text is tokenized, removing punctuation and redundant whitespace, generating a vector of words. | ||
| 2. Latin upper-case characters are converted to lower-case. |
There was a problem hiding this comment.
This feels like a miss. What about other scripts with upper/lower case? Cyrillic? Greek?
There was a problem hiding this comment.
Will reword, the intention is to be case-insensitive.
There was a problem hiding this comment.
Would it make sense to have an option control this?
| └───────────────┘ | ||
| ``` | ||
|
|
||
| Step 1 of the tokenization process is controlled by a specified list of punctuation characters. |
There was a problem hiding this comment.
is the specified list customizable?
| The text field-index is a mapping from a term to a list of locations. | ||
| Depending on the configuration, the list of locations may be just a list of keys or a list of key/term-offset pairs. The list of locations is also known as a postings or a postings list. | ||
|
|
||
| The mapping structure is built from one or two prefix trees. When present, the second prefix tree is built from each inserted term sequenced in reverse order, effectively providing a suffix tree. |
There was a problem hiding this comment.
how does this work for LTR and RTL languages?
There was a problem hiding this comment.
Not sure that LTR vs RTL makes a difference here. Options like $inorder and $slop are trivially sensitive to ordering.
| | Suffix Tree | Boolean | Controls the presence/absence of a suffix tree. | | ||
| | Word Offsets | Boolean | Indicates if the postings for a word contain word offsets or just a count. | | ||
|
|
||
| ### Extension of the ```FT.CREATE``` command |
There was a problem hiding this comment.
So schema changes require a FT.DROP and complete re-index?
There was a problem hiding this comment.
Yes. Right now, there are no plans to mutate existing index definitions.
|
|
||
| ## Abstract | ||
|
|
||
| The existing search platform is extended with a new field type, ```TEXT``` is defined. The query language is extended to support locating keys with text fields based on term, wildcard, fuzzy and exact phrase matching. |
There was a problem hiding this comment.
So, using this type of query engine for full text search is next to impossible without FT.EXPLAIN what are the plans for this command?
There was a problem hiding this comment.
Support for FT.EXPLAIN makes sense.
There was a problem hiding this comment.
using this type of query engine for full text search is next to impossible without FT.EXPLAIN what are the plans for this command?
Basic text queries can be simple to understand. Yes, they can get complex, but so can hybrid vector queries (without text).
FT.EXPLAIN provides value for all these cases, so it makes sense to support it. We can work towards adding support for it
ValkeyFullTextSearch.md
Outdated
| ``` | ||
| [PUNCTUATION <string>] | ||
| ``` | ||
| The characters of this string are used to split the input string into words. Note, the splitting process allows escaping of input characters using the usual backslash notation. This string cannot be empty. Default value is: _TBD_ |
There was a problem hiding this comment.
This is language dependent, so like the problems global language defaults in stopwords (see below), the default punctuation should probably be linked with stemming language
There was a problem hiding this comment.
Agreed, I believe that ultimately the input tokenizer will have substantial per-language dependent functionality.
| The text searching facility provides machinery that decomposes text fields into terms and field-indexes them. | ||
| The query facility of the ```FT.SEARCH``` and ```FT.AGGREGATE``` commands is enhanced to select keys based on combinations of term, fuzzy and phrase matching together with the existing selection operators of TAG, Numeric range and Vector searches. | ||
|
|
||
| ### Tokenization process |
There was a problem hiding this comment.
How does this work with CJK and similar languages?
There was a problem hiding this comment.
We are planning to support english first, and the tokenization process can be extended in the future to support other languages
ValkeyFullTextSearch.md
Outdated
|
|
||
| Controls the stemming algorithm. Supported languages are: | ||
|
|
||
| **none, arabic, armenian, basque, catalan, danish, dutch, english, estonian, finnish, french, german, greek, hindi, hungarian, indonesian, irish, italian, lithuanian, nepali, norwegian, porter, portuguese, romanian, russian, scripts, serbian, spanish, swedish, tamil, turkish, yiddish** |
There was a problem hiding this comment.
miss: hebrew. Consider something pluggable in cases where Snowball doesn't cover.
There was a problem hiding this comment.
I can include this detail. We are only planning on supporting english in the initial version, but we can make a note that hebrew needs a different handling.
|
FST data structure was examined and rejected due to the high run-time costs. Applications like OpenSearch that are FST based use a log-structure-merge update technique that causes both significant write-amplification as well as longer query operations. This choice of more time efficient (but less space efficient) data structures will be how this implementation distinguishes from OpenSearch (and equivalents) |
|
@msfroh (Sorry, the comment is located here, github wouldn't let me comment directly on your comment). Yes, the intent is to implement word dictionaries using forward and reverse direction radix trees. Each word entry in the dictionary references a full postings list. |
|
Hey I'm currently using valkey (AWS) as the backend for my "redis-like" cache in an A.I. related project where valkey acts as the memory cache of this agent. Because we are using So we had to implement our own custom BaseCheckpointSaver (see https://github.com/redis-developer/langgraph-redis), since the redis one isn't compatible with Valkey's API. First error we had was on Then other errors were raised after connection recommended to post here by @madolson Thank you so much, would be amazing if we could make this work for Valkey! Descriptionfrom langgraph.checkpoint.redis import RedisSaver
with RedisSaver.from_conn_string("redis://localhost:6379") as checkpointer:
return checkpointerProducesException Type: <class 'redisvl.exceptions.RedisModuleVersionError'>
Unexpected error occurred: Required Redis db module search >= 20600 OR searchlight >= 20600 not installed. See Redis Stack docs at https://redis.io/docs/latest/operate/oss_and_stack/install/install-stack/.
Traceback (most recent call last):
...
File "./test.py", line XX, in _redis_cache
with RedisSaver.from_conn_string("redis://localhost:6379") as checkpointer:
~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^
File "./.pyenv/versions/3.13.1/lib/python3.13/contextlib.py", line 141, in __enter__
return next(self.gen)
File "./lib/python3.13/site-packages/langgraph/checkpoint/redis/__init__.py", line 497, in from_conn_string
saver = cls(
redis_url=redis_url,
...<2 lines>...
ttl=ttl,
)
File "./lib/python3.13/site-packages/langgraph/checkpoint/redis/__init__.py", line 55, in __init__
super().__init__(
~~~~~~~~~~~~~~~~^
redis_url=redis_url,
^^^^^^^^^^^^^^^^^^^^
...<2 lines>...
ttl=ttl,
^^^^^^^^
)
^
File "./lib/python3.13/site-packages/langgraph/checkpoint/redis/base.py", line 125, in __init__
self.configure_client(
~~~~~~~~~~~~~~~~~~~~~^
redis_url=redis_url,
^^^^^^^^^^^^^^^^^^^^
redis_client=redis_client,
^^^^^^^^^^^^^^^^^^^^^^^^^^
connection_args=connection_args or {},
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
)
^
File "./lib/python3.13/site-packages/langgraph/checkpoint/redis/__init__.py", line 70, in configure_client
self._redis = redis_client or RedisConnectionFactory.get_redis_connection(
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
redis_url, **connection_args
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
)
^
File "./lib/python3.13/site-packages/redisvl/redis/connection.py", line 261, in get_redis_connection
RedisConnectionFactory.validate_sync_redis(
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
client, required_modules=required_modules
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
)
^
File "./lib/python3.13/site-packages/redisvl/redis/connection.py", line 411, in validate_sync_redis
validate_modules(installed_modules, required_modules)
~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "./lib/python3.13/site-packages/redisvl/redis/connection.py", line 189, in validate_modules
raise RedisModuleVersionError(error_message)
redisvl.exceptions.RedisModuleVersionError: Required Redis db module search >= 20600 OR searchlight >= 20600 not installed. See Redis Stack docs at https://redis.io/docs/latest/operate/oss_and_stack/install/install-stack/.Adding this bellow fixes this issue from redisvl.redis.connection import RedisConnectionFactory
RedisConnectionFactory.validate_sync_redis = (
lambda redis_client, lib_name=None, required_modules=None: None
)
with RedisSaver.from_conn_string(
"redis://localhost:6379"
) as checkpointer:
return checkpointerThen Exception Type: <class 'redisvl.exceptions.RedisSearchError'>
Error while searching: unknown command 'FT.SEARCH', with args beginning with: 'checkpoints' '(@thread_id:{15cb9aad\-5021\-4dbb\-a5cc\-2bb7a350c34e} @checkpoint_ns:{__empty__})' 'RETURN' '6' 'thread_id' 'chec'
Traceback (most recent call last):
File "./lib/python3.13/site-packages/redisvl/index/index.py", line 934, in search
return self._redis_client.ft(self.schema.index.name).search(
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
*args, **kwargs
^^^^^^^^^^^^^^^
) # type: ignore
^
File "./lib/python3.13/site-packages/redis/commands/search/commands.py", line 519, in search
res = self.execute_command(SEARCH_CMD, *args, **options)
File "./lib/python3.13/site-packages/redis/client.py", line 622, in execute_command
return self._execute_command(*args, **options)
~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^
File "./lib/python3.13/site-packages/redis/client.py", line 633, in _execute_command
return conn.retry.call_with_retry(
~~~~~~~~~~~~~~~~~~~~~~~~~~^
lambda: self._send_command_parse_response(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...<2 lines>...
lambda _: self._close_connection(conn),
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
)
^
File "./lib/python3.13/site-packages/redis/retry.py", line 87, in call_with_retry
return do()
File "./lib/python3.13/site-packages/redis/client.py", line 634, in <lambda>
lambda: self._send_command_parse_response(
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
conn, command_name, *args, **options
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
),
^
File "./lib/python3.13/site-packages/redis/client.py", line 605, in _send_command_parse_response
return self.parse_response(conn, command_name, **options)
~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "./lib/python3.13/site-packages/redis/client.py", line 649, in parse_response
response = connection.read_response(disable_decoding=True)
File "./lib/python3.13/site-packages/redis/connection.py", line 666, in read_response
raise response
redis.exceptions.ResponseError: unknown command 'FT.SEARCH', with args beginning with: 'checkpoints' '(@thread_id:{15cb9aad\-5021\-4dbb\-a5cc\-2bb7a350c34e} @checkpoint_ns:{__empty__})' 'RETURN' '6' 'thread_id' 'chec'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
...
File "./lib/python3.13/site-packages/langgraph/checkpoint/redis/__init__.py", line 361, in get_tuple
results = self.checkpoints_index.search(checkpoints_query)
File "./lib/python3.13/site-packages/redisvl/index/index.py", line 942, in search
raise RedisSearchError(f"Error while searching: {str(e)}") from e
redisvl.exceptions.RedisSearchError: Error while searching: unknown command 'FT.SEARCH', with args beginning with: 'checkpoints' '(@thread_id:{15cb9aad\-5021\-4dbb\-a5cc\-2bb7a350c34e} @checkpoint_ns:{__empty__})' 'RETURN' '6' 'thread_id' 'chec' |
|
This would be more appropriate in the valeky-search repo issues. which it looks like it was referenced in. |
| ``` | ||
|
|
||
| Step 1 of the tokenization process is controlled by a specified list of punctuation characters. | ||
| Sequences of one or more punctuations characters delimit word boundaries. |
There was a problem hiding this comment.
How will hyphenated words behave ("batman-dark-knight-rises")?
Will "batman", "dark", "knight", "rises" be separate? Or should certain punctuation be considered “in-word” optionally?
There was a problem hiding this comment.
Yes, it will tokenized into "batman", "dark", "knight", "rises" be separate in ingestion IF it is a punctuation character (by default it is) specified in the index creation.
On the query side, if - is a punctuation character, in an exact phrase, it is simply stripped out. If it is not an exact phrase, it will be parsed as: Term(batman), Negate(Term(dark)), Negate(Term(knight)), Negate(Term(rises)). This is because - is a query syntax character indicating negation. The user can escape this by using the backslash if needed, if their punctuation does not include -, and they want the query to match the ingestion for searches to match
Signed-off-by: Karthik Subbarao <karthikrs2021@gmail.com>
Valkey FullTextSearch RFC
Content taken from: https://github.com/valkey-io/valkey-search/tree/fulltext/rfc
I can help update the RFC based on any question / comment added to the PR