-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Description
History
A while ago we introduced CodecServiceFactory to provide an ability for the plugins to supply own custom codecs. It does the job and is actively used by:
custom-codecsk-nnneural-searchsecurity-analyticsjvector
Is your feature request related to a problem? Please describe
The plugins from above supply own codecs using CodecServiceFactory. One of the major limitations that CodecServiceFactory has is composability - for particular index: only one CodecServiceFactory could be provided, otherwise the exception is going to be raised (see please https://forum.opensearch.org/t/knn-doesnt-coexist-with-seismic-sparse/27709 as one of the recent examples).
Currently, there are two major uses of CodecServiceFactory, mutually exclusive in the scope of the same index:
- Register additional codec with non-trivial instantiation (
custom-codecsfe) - Wrap every single codec (
k-nn,neural-search,security-analytics,jvector)
The first issue is already addressed by #20411, the second one needs more upfront design and decisions, hence this proposal.
Where We Are
The majority of the plugins (k-nn, neural-search, security-analytics, jvector) follow the same pattern of wrapping whatever codec index uses, for example:
public class KNNCodecService extends CodecService {
...
@Override
public Codec codec(String name) {
return KNN1030Codec.builder()
.delegate(super.codec(name))
.mapperService(mapperService)
.knnVectorsFormat(new KNN9120PerFieldKnnVectorsFormat(Optional.ofNullable(mapperService), nativeIndexBuildStrategyFactory))
.build();
}
}Or
public class SparseCodecService extends CodecService {
...
@Override
public Codec codec(String name) {
return new SparseCodec(super.codec(name));
}
}It somewhat works when index is referencing only one plugin (fe k-nn), but falls apart if index whats to benefit from several plugins (neural-search, k-nn), see please https://forum.opensearch.org/t/knn-doesnt-coexist-with-seismic-sparse/27709.
In fact, what most of the plugins need is just to encode / decode one single field (knn_vector, or sparse_vector_field), and that could be done in the scope of the same index (since every field has only one type, one of the exceptions here are stored fields which we will touch upon later). It is not possible at the moment but we believe it could be solved.
Describe the solution you'd like
There are multiple ways we could approach the problem.
Option 1. Support deep nested wrappings (fe new SparseCodec(new KNN1030Codec(super.codec(name)))). At high level, it looks like an easy fix but:
a) since there is no way to control plugins initialization sequence / ordering, it is very fragile and difficult to fix, and
b) there could be conflicts when different plugins alter the same stored fields (like _source fe)
Option 2. Introduce per-field customizations (see please #20411 (comment) for the context) that would allow to reliably collect per field PostingFormat / KnnVectorFormat / DocValuesFormat, and per-index StoredFieldsFormat (still backed by per-field customizations that would help in detecting conflicts).
The 2nd option is clearly superior although it requires large scale changes (that, luckily, we could rollout incrementally). To reduce the scope of the changes but keep the door fully open for any future enhancements, we are going to:
- focus on
KnnVectorFormat/PostingsFormat, and - focus on binary fields only (since this is how vectors are stored now)
- synthetic per-field
StoredFieldsFormatwill not be supported
This is what k-nn, neural-search, security-analytics, jvector plugins do override - there are no limitations in supporting all features but only efforts and amount of additional APIs.
Initial Design
Apache Lucene has pretty good out-of-the-box support of per-field abstraction (PerFieldKnnVectorsFormat, PerFieldPostingFormat, ...) which we could leverage (and do already in different places). What we need it to allow plugins to signal that they have their own formats and allow those to be injected into the default index codec. For this reasons, we may introduce the new PerFieldCodecBuilder abstraction:
import org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat;
import org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat;
import org.apache.lucene.codecs.perfield.PerFieldPostingsFormat;
public interface PerFieldCodecBuilder {
Optional<PerFieldDocValuesFormat> docValuesFormat();
Optional<PerFieldKnnVectorsFormat> knnVectorsFormat();
Optional<PerFieldPostingsFormat> postingsFormat();
}And new extension points for it in Plugin APIs:
Optional<PerFieldCodecBuilder> getPerFieldCodecBuilder(@Nullable MapperService mapperService, IndexSettings indexSettings);The APIs purposedly use Optional and not Collection since all PerFieldXxx expose abstractions getXxxForField methods, naturally supporting multiple fields. It is also doubtful that individual plugin needs multiple PerFieldCodecBuilder. The DelegatingPerFieldCodecBuilder would collect the PerFieldCodecBuilder intances from all plugins:
public class DelegatingPerFieldCodecBuilder {
private final Collection<PerFieldCodecBuilder> builders;
public DelegatingPerFieldCodecBuilder(Collection<PerFieldCodecBuilder> builders) {
this.builders = builders;
}
public DocValuesFormat buildDocValues(DocValuesFormat delegate) {
return new CompositeDelegatingPerFieldDocValues(
delegate,
builders.stream().map(PerFieldCodecBuilder::docValuesFormat)
);
}
public KnnVectorsFormat buildKnnVectors(KnnVectorsFormat delegate) {
return new CompositeDelegatingPerFieldKnnVectors(
delegate,
builders.stream().map(PerFieldCodecBuilder::knnVectorsFormat)
);
}
public PostingsFormat postingsFormat(PostingsFormat delegate) {
return new CompositeDelegatingPerFieldPostingsFormat(
delegate,
builders.stream().map(PerFieldCodecBuilder::postingsFormat)
);
}
}And used by CompositeDelegatingPerFieldCodec later on:
public class CompositeDelegatingPerFieldCodec extends FilterCodec {
private static final String NAME = "CompositeDelegating103PerFieldCodec";
private final DelegatingPerFieldCodecBuilder builder;
public CompositeDelegatingPerFieldCodec(Codec delegate, DelegatingPerFieldCodecBuilder builder) {
super(NAME, delegate);
this.builder = builder;
}
@Override
public DocValuesFormat docValuesFormat() {
return builder.buildDocValues(delegate.docValuesFormat());
}
@Override
public KnnVectorsFormat knnVectorsFormat() {
return builder.buildKnnVectors(delegate.knnVectorsFormat());
}
@Override
public PostingsFormat postingsFormat() {
return builder.postingsFormat(delegate.postingsFormat());
}
}This is roughly the high level idea and the key APIs.
Backward Compatibility
One complications from the solution from above is evolution of the formats and backward compatibility. Since we have detouched the plugins from actual codec, the versions (or its equivalents) may need to be included along with the fields info (using attributes() for example). This area is still under exploration right now.
Related component
Plugins
Describe alternatives you've considered
Keep things unchanged