Similarity evals help you assess how close two pieces of text are in meaning, even if they use different wording. This is especially useful when exact matches aren’t expected, but you want to make sure a model is producing outputs that convey the right ideas.

This article explains how we calculate semantic similarity at Arato, using embedding vectors and cosine similarity to compare texts.

What Is Semantic Similarity?

Semantic similarity captures the meaning of text (the essence), rather than just matching words. Instead of comparing text character-by-character or token-by-token, we transform text into a vector that represents its meaning in a high-dimensional space. Then, we compare those vectors to see how close the meanings are.

How We Calculate Semantic Similarity?

To run a similarity eval, you first need to select what two texts you want to compare, typically the model’s response and your expected response. These will be used as input for the evaluation.

Here’s how the similarity evaluation works step-by-step:

1. Embedding the Texts

Each text is passed through an embedding model (typically based on a large language model). The model outputs a vector, a list of floating-point numbers that represents the semantic meaning of the text.

Each vector is 256 to 1024 floating point numbers, normalized to have a length of 1 (unit vector), which is important for comparing them using cosine similarity.

2. Measuring the Cosine Similarity

Once we have the two embedding vectors, we calculate the cosine of the "angle" value between them.

Cosine similarity measures how aligned the two vectors are.

In geometric terms:

If two vectors point in the same direction (meaning the texts are semantically similar), the "angle" between them is close to 0°, and the cosine is close to 1.
If they are orthogonal (unrelated), the angle is close to 90°, and the cosine is around 0.
If they point in opposite directions (semantically opposite), the angle is near 180°, and the cosine is close to -1.

Note: In practice, it’s rare to get strongly negative values from real-world texts. When this happens, we treat them as completely dissimilar and assign a score of 0.

3. Converting to a User-Friendly Scale

We rescale the cosine similarity value to a 0–100 scale, so it’s easier to interpret:

100 = Highly similar
0 = Completely different or unrelated

This is not a percentage, just a normalized score to make the results easier to work with.

Summary

Semantic similarity allows you to evaluate model responses in a more human-like way, capturing intent and meaning rather than rigid word patterns. This makes your evaluations more robust, especially for open-ended or generative tasks.

It’s particularly useful when you have an expected response/value in mind and want to check if the model’s output is meaningfully aligned with it, even if the wording is different. This helps you catch subtle differences or validate that the output is “close enough” for your use case.

Understanding Prompts and Variables