OpenAI's new embeddings API uses the cl100k_base tokenizer. I'm calling it from the Node.js client, but I don't see any easy way of slicing my strings so they don't exceed the OpenAI limit of 8192 tokens.
This would be trivial if I could first encode the string, slice it to the limit, then decode it and send it to the API.
Update: David Duong created a JavaScript port of openai/tiktoken with JS/WASM bindings. The package can be installed via npm:
npm install tiktoken
Credit to Lars Grammel's answer below for the discovery/update.
Original interim solution (before the aforementioned package was available):
There is a general rule of thumb that one token corresponds to approximately four characters of common English text. This roughly translates to one token being equal to 3/4 of a word. So in your case, a limit of 8,192 tokens ~= 6,144 words. Therefore, you could slice your strings such that they don't exceed ~6,144 words (e.g., set a 6,100 word limit. If that fails, reduce the limit further until you find one that is suitable).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With