Investigate support for streaming mode

Previously reported in https://github.com/speechmatics/ctranslate2_triton_backend/issues/2#issuecomment-1546889761 by @aamir-s18

Streaming mode could be useful for very big models. It can help in real-time use cases where we can improve User Experience by generating one token at a time. 

Triton Server supports streaming with [decoupled models](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/decoupled_models.md)

It needs to be investigated how CTranslate can be used to get decoded tokens one-by-one. Additionally this might be trickier in a beam decode setting, unless we are willing to always return the best guess which could flip previous words

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate support for streaming mode #6

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Investigate support for streaming mode #6

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions