Skip to content

[RFC] Support pluggable and composable per-field codecsΒ #20491

@reta

Description

@reta

History

A while ago we introduced CodecServiceFactory to provide an ability for the plugins to supply own custom codecs. It does the job and is actively used by:

  • custom-codecs
  • k-nn
  • neural-search
  • security-analytics
  • jvector

Is your feature request related to a problem? Please describe

The plugins from above supply own codecs using CodecServiceFactory. One of the major limitations that CodecServiceFactory has is composability - for particular index: only one CodecServiceFactory could be provided, otherwise the exception is going to be raised (see please https://forum.opensearch.org/t/knn-doesnt-coexist-with-seismic-sparse/27709 as one of the recent examples).

Currently, there are two major uses of CodecServiceFactory, mutually exclusive in the scope of the same index:

  1. Register additional codec with non-trivial instantiation (custom-codecs fe)
  2. Wrap every single codec (k-nn, neural-search, security-analytics, jvector)

The first issue is already addressed by #20411, the second one needs more upfront design and decisions, hence this proposal.

Where We Are

The majority of the plugins (k-nn, neural-search, security-analytics, jvector) follow the same pattern of wrapping whatever codec index uses, for example:

public class KNNCodecService extends CodecService {
    ...
    @Override
    public Codec codec(String name) {
        return KNN1030Codec.builder()
            .delegate(super.codec(name))
            .mapperService(mapperService)
            .knnVectorsFormat(new KNN9120PerFieldKnnVectorsFormat(Optional.ofNullable(mapperService), nativeIndexBuildStrategyFactory))
            .build();
    }
}

Or

public class SparseCodecService extends CodecService {
    ...
    @Override
    public Codec codec(String name) {
        return new SparseCodec(super.codec(name));
    }
}

It somewhat works when index is referencing only one plugin (fe k-nn), but falls apart if index whats to benefit from several plugins (neural-search, k-nn), see please https://forum.opensearch.org/t/knn-doesnt-coexist-with-seismic-sparse/27709.

In fact, what most of the plugins need is just to encode / decode one single field (knn_vector, or sparse_vector_field), and that could be done in the scope of the same index (since every field has only one type, one of the exceptions here are stored fields which we will touch upon later). It is not possible at the moment but we believe it could be solved.

Describe the solution you'd like

There are multiple ways we could approach the problem.

Option 1. Support deep nested wrappings (fe new SparseCodec(new KNN1030Codec(super.codec(name)))). At high level, it looks like an easy fix but:
a) since there is no way to control plugins initialization sequence / ordering, it is very fragile and difficult to fix, and
b) there could be conflicts when different plugins alter the same stored fields (like _source fe)

Option 2. Introduce per-field customizations (see please #20411 (comment) for the context) that would allow to reliably collect per field PostingFormat / KnnVectorFormat / DocValuesFormat, and per-index StoredFieldsFormat (still backed by per-field customizations that would help in detecting conflicts).

The 2nd option is clearly superior although it requires large scale changes (that, luckily, we could rollout incrementally). To reduce the scope of the changes but keep the door fully open for any future enhancements, we are going to:

  • focus on KnnVectorFormat / PostingsFormat, and
  • focus on binary fields only (since this is how vectors are stored now)
  • synthetic per-field StoredFieldsFormat will not be supported

This is what k-nn, neural-search, security-analytics, jvector plugins do override - there are no limitations in supporting all features but only efforts and amount of additional APIs.

Initial Design

Apache Lucene has pretty good out-of-the-box support of per-field abstraction (PerFieldKnnVectorsFormat, PerFieldPostingFormat, ...) which we could leverage (and do already in different places). What we need it to allow plugins to signal that they have their own formats and allow those to be injected into the default index codec. For this reasons, we may introduce the new PerFieldCodecBuilder abstraction:

import org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat;
import org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat;
import org.apache.lucene.codecs.perfield.PerFieldPostingsFormat;

public interface PerFieldCodecBuilder {
    Optional<PerFieldDocValuesFormat> docValuesFormat();
    Optional<PerFieldKnnVectorsFormat> knnVectorsFormat();
    Optional<PerFieldPostingsFormat> postingsFormat();
}

And new extension points for it in Plugin APIs:

Optional<PerFieldCodecBuilder> getPerFieldCodecBuilder(@Nullable MapperService mapperService, IndexSettings indexSettings);

The APIs purposedly use Optional and not Collection since all PerFieldXxx expose abstractions getXxxForField methods, naturally supporting multiple fields. It is also doubtful that individual plugin needs multiple PerFieldCodecBuilder. The DelegatingPerFieldCodecBuilder would collect the PerFieldCodecBuilder intances from all plugins:

public class DelegatingPerFieldCodecBuilder {
    private final Collection<PerFieldCodecBuilder> builders;

    public DelegatingPerFieldCodecBuilder(Collection<PerFieldCodecBuilder> builders) {
        this.builders = builders;
    }

    public DocValuesFormat buildDocValues(DocValuesFormat delegate) {
        return new CompositeDelegatingPerFieldDocValues(
            delegate,
            builders.stream().map(PerFieldCodecBuilder::docValuesFormat)
        );
    }

    public KnnVectorsFormat buildKnnVectors(KnnVectorsFormat delegate) {
        return new CompositeDelegatingPerFieldKnnVectors(
            delegate,
            builders.stream().map(PerFieldCodecBuilder::knnVectorsFormat)
        );
    }

    public PostingsFormat postingsFormat(PostingsFormat delegate) {
        return new CompositeDelegatingPerFieldPostingsFormat(
            delegate,
            builders.stream().map(PerFieldCodecBuilder::postingsFormat)
        );
    }

}

And used by CompositeDelegatingPerFieldCodec later on:

public class CompositeDelegatingPerFieldCodec extends FilterCodec {
    private static final String NAME = "CompositeDelegating103PerFieldCodec";
    private final DelegatingPerFieldCodecBuilder builder;

    public CompositeDelegatingPerFieldCodec(Codec delegate, DelegatingPerFieldCodecBuilder builder) {
        super(NAME, delegate);
        this.builder = builder;
    }

    @Override
    public DocValuesFormat docValuesFormat() {
        return builder.buildDocValues(delegate.docValuesFormat());
    }

    @Override
    public KnnVectorsFormat knnVectorsFormat() {
        return builder.buildKnnVectors(delegate.knnVectorsFormat());
    }

    @Override
    public PostingsFormat postingsFormat() {
      return builder.postingsFormat(delegate.postingsFormat());
    }
}

This is roughly the high level idea and the key APIs.

Backward Compatibility

One complications from the solution from above is evolution of the formats and backward compatibility. Since we have detouched the plugins from actual codec, the versions (or its equivalents) may need to be included along with the fields info (using attributes() for example). This area is still under exploration right now.

Related component

Plugins

Describe alternatives you've considered

Keep things unchanged

Additional context

Metadata

Metadata

Assignees

Labels

PluginsenhancementEnhancement or improvement to existing feature or requestlucene

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions