Is there a JavaScript implementation of cl100k_base tokenizer?

Question

OpenAI's new embeddings API uses the cl100k_base tokenizer. I'm calling it from the Node.js client, but I don't see any easy way of slicing my strings so they don't exceed the OpenAI limit of 8192 tokens.

This would be trivial if I could first encode the string, slice it to the limit, then decode it and send it to the API.

Kyle F. Hartzenberg · Accepted Answer

Update: David Duong created a JavaScript port of openai/tiktoken with JS/WASM bindings. The package can be installed via npm:

npm install tiktoken

Credit to Lars Grammel's answer below for the discovery/update.

Original interim solution (before the aforementioned package was available):

There is a general rule of thumb that one token corresponds to approximately four characters of common English text. This roughly translates to one token being equal to 3/4 of a word. So in your case, a limit of 8,192 tokens ~= 6,144 words. Therefore, you could slice your strings such that they don't exceed ~6,144 words (e.g., set a 6,100 word limit. If that fails, reduce the limit further until you find one that is suitable).

Is there a JavaScript implementation of cl100k_base tokenizer?

Tags:

openai-api

node.js

machine-learning

tokenize

nlp

Daniel Patrick

1 Answers

Kyle F. Hartzenberg

Recent Activity

Donate For Us

Is there a JavaScript implementation of cl100k_base tokenizer?

Tags:

openai-api

node.js

machine-learning

tokenize

nlp

Daniel Patrick

1 Answers

Kyle F. Hartzenberg

Related questions

Recent Activity

Donate For Us