Embeddings
Here's how to use EnergeticAI's pre-trained embeddings model.
Need an introduction to embeddings? Check out this overview.
About the model
EnergeticAI uses the lightweight, English-only version of the Universal Sentence Encoder from Google. This model is trained on a variety of data sources, and is designed to be a general-purpose model that can be used for a variety of tasks.
Given a sentence or short paragraph in English, the model will return a 512-dimensional vector that represents the meaning of the text.
Creating embeddings from text
You can install the embeddings package using npm:
npm install --save @energetic-ai/core @energetic-ai/embeddings
The embeddings package can compute embeddings for a single string or multiple strings at once. If you pass in an array of strings, you'll get an array of embeddings back. If you pass a single string, you'll get a single embedding back.
import { initModel } from "@energetic-ai/embeddings";
(async () => {
const model = await initModel();
// You can also embed a single string
const embedding = await model.embed(
"Embeddings are a powerful machine learning tool"
);
// Embed multiple strings at once
const [healthy, delicious] = await model.embed([
"Fruit is healthy",
"Fruit is delicious",
]);
})();
Improving cold-start performance
The first time you call initModel()
, it will download the model weights from the internet. This can take a few seconds, but you can speed it up by installing the English language model weights:
npm install --save @energetic-ai/model-embeddings-en
Then, you can pass the model weights directly into initModel()
:
import { initModel } from "@energetic-ai/embeddings";
import { modelSource } from "@energetic-ai/model-embeddings-en";
(async () => {
const model = await initModel(modelSource);
// ... snip ...
})();
Comparing embeddings
As we alluded to above, there's a convenient distance()
function that computes the cosine similarity between two embeddings. The result is a number between 0 and 1, where 0 means the embeddings are completely different, and 1 means the embeddings are identical.
import { initModel, distance } from "@energetic-ai/embeddings";
(async () => {
const model = await initModel();
const [healthy, delicious, embeddings] = await model.embed([
"Fruit is healthy",
"Fruit is delicious",
"Embeddings are a powerful machine learning tool",
]);
console.log(distance(healthy, delicious)); // 0.89 (high similarity)
console.log(distance(healthy, embeddings)); // 0.24 (low similarity)
})();
If you're building something simple, it's worth starting with this function before you try to build something more complex (e.g. using a vector database). It's pretty fast, and it's often good enough.
Storing embeddings
If you're trying to make comparisons with millions of items or more, it's worth considering a vector database such as Postgres, Redis, or Milvus, which can perform these comparisons efficiently.
Open-source vector databases
General purpose databases with vector support:
- Postgres: You can store and index vectors in Postgres using the pgvector extension. A number of managed Postgres providers including Supabase, Neon, and Fly.io support this extension. Some of the managed Postgres offerings from larger cloud providers also support
pgvector
as well, including Amazon's RDS. - Redis: You can store an index vectors in Redis using the RediSearch module.
- SQLite: You can store and index vectors in SQLite using the sqlite-vss extension, which leverages Meta's Faiss library under the hood.
Dedicated vector databases:
- Chroma: The open-source vector database called Chroma is designed for AI use-cases and has an official JavaScript client.
- Milvus: The dedicated open-source vector database Milvus comes with a managed cloud offering and has an official JavaScript client.
- Weaviate: The open-source vector database Weaviate has an official TypeScript client.
Proprietary vector databases
- Pinecone: The proprietary database Pinecone is a managed vector database that supports text embeddings out of the box, and has official JavaScript bindings.
Limitations
English-only
The model is currently English-only. Please chime in on the GitHub issue if you'd like to see support for one of the pre-trained multilingual models.
Handling longer text
This embedding model performs best on sentences and short paragraphs. If you have longer text, consider:
- Splitting the text: If you have a long document, you can split it into sentences or paragraphs and embed each one separately.
- Truncating the text: If you have a long document, you can truncate it to a specific section that represents the meaning well (e.g. an abstract section).
- Averaging the embeddings: If you have a long document, you can embed each sentence or paragraph separately, then average the embeddings together to find the collective meaning.