This is the changelog for the open source version of tiktoken.
- Add
encoding_name_for_model, undo some renames to variables that are implementation details
- Add
tiktoken._educationalsubmodule to better document how byte pair encoding works - Ensure
encoding_for_modelknows about several new models - Add
decode_with_offets - Better error for failures with the plugin mechanism
- Make more tests public
- Update versions of dependencies
- Add
decode_batchanddecode_bytes_batch - Improve error messages and handling
tiktokenwill now make a best effort attempt to replace surrogate pairs with the corresponding Unicode character and will replace lone surrogates with the Unicode replacement character.
- Add encoding for GPT-4
- Build aarch64 wheels
- Make
blobfilean optional dependency
Thank you to @messense for the environment variable that makes cargo not OOM under emulation!
- Improve performance by 5-20%; thank you to @nistath!
- Add
gpt-3.5-turbomodels toencoding_for_model - Add prefix matching to
encoding_for_modelto better support future model versions - Fix a bug in the README instructions on extending tiktoken
- Update the set of available encodings
- Add packaging metadata
- Add
tiktoken.encoding_for_modelto get the encoding for a specific model - Improve portability of caching logic
Thank you to @fritzo, @arvid220u, @khanhvu207, @henriktorget for various small corrections
- Avoid use of
blobfilefor public files - Add support for Python 3.8
- Add py.typed
- Improve the public tests
- Initial release