-
Notifications
You must be signed in to change notification settings - Fork 5
Open
Description
Previously reported in #2 (comment) by @aamir-s18
Streaming mode could be useful for very big models. It can help in real-time use cases where we can improve User Experience by generating one token at a time.
Triton Server supports streaming with decoupled models
It needs to be investigated how CTranslate can be used to get decoded tokens one-by-one. Additionally this might be trickier in a beam decode setting, unless we are willing to always return the best guess which could flip previous words
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels