I have discovered that running the same model with the same parameters from llm (gguf branch) and llama.cpp results in a different behavior. llm seems to have not been reading EOS token and thus the model creates output until max tokens is reached.
Here is llama.cpp:

And the same model from llm:

According to discord "discussion" it might be indeed a bug.
I have discovered that running the same model with the same parameters from llm (gguf branch) and llama.cpp results in a different behavior. llm seems to have not been reading EOS token and thus the model creates output until max tokens is reached.


Here is llama.cpp:
And the same model from llm:
According to discord "discussion" it might be indeed a bug.