The rise of Large Language Models (LLMs) like GPT-4, Claude, and Llama has reshaped technology—from writing code and emails to powering advanced chatbots. Their abilities often feel magical, but for developers, product leaders, and researchers tasked with integrating this power into real-world applications, a critical question emerges: How do you move beyond impressive demos and truly measure the effectiveness of an LLM? In this context, Evaluating Large Language Models becomes essential for understanding their true capabilities.

A model that dazzles with poetry or summaries might just as easily produce a factual error, a biased remark, or an irrelevant answer. The challenge is that LLM performance isn’t one-dimensional. Strength in creativity doesn’t guarantee reliability in precise, task-oriented work. Evaluating an LLM isn’t about finding “the best” model. It’s about asking: Is this model accurate and reliable for our specific use case? Is it safe and unbiased enough for our users? Does its value outweigh its computational cost? Without structured evaluation, adopting an LLM is a risky bet—jeopardizing trust, quality, and resources.
This blog post serves as the guide to navigating this complex terrain. We will move beyond the hype and delve into the practical metrics that matter, the evaluation methods — from mathematical measures like Perplexity, which peek under the hood of probability distributions, to reference-based metrics like BLEU and ROUGE, and even the emerging trend of LLMs evaluating other LLMs and the best practices for building a robust evaluation framework. Whether you’re selecting a model from an API provider or fine-tuning your own, this guide will provide you with the tools to make informed, confident decisions and deploy LLMs that are not just powerful, but also predictable and trustworthy.
1. Perplexity
Perplexity is perhaps the most fundamental metric in the language modeling world – think of it as a measure of how “surprised” your model is by a piece of text. At its core, perplexity quantifies how well a language model predicts the next word in a sequence. A model with low perplexity confidently assigns high probabilities to the actual next words that appear in real text, while a high perplexity indicates the model is frequently caught off-guard by what comes next. Mathematically elegant and computationally straightforward, perplexity gives us a direct window into how well our model has learned the statistical patterns of language – making it an essential first stop in any evaluation pipeline.
- Perplexity measures uncertainty: It tells us how “surprised” a model is by the actual sequence of words it’s asked to predict. Lower perplexity means higher confidence and better predictive performance from the model.
- Interpretation: If a model has a perplexity of 10, it is as “confused” as if it had to choose among 10 equally probable options for each word.
High Perplexity means model is unsure about the next word prediction as it has almost equal probabilities for more words and model is being confused in turn has less confidence in the output generation.
Lower Perplexity means model is confident about next word prediction as it has words which are having higher probability assigned to single word and there is significant difference in other words probabilities.
How to calculate Perplexity (Step by step)
- Tokenize your evaluation text with the same tokenizer as the model.
- For each token position \( x_t \), get the model’s probability for the true next token $$ p(x_t \mid x_{<t}) $$
- Average negative log-likelihood \( \text{NLL} \) over all predicted tokens: $$ \text{NLL} = -\frac{1}{N} \sum_{t=1}^{N} \log p(x_t \mid x_{<t}) $$
- Perplexity is given by: $$ \text{PPL} = \exp(\text{NLL}) $$
Example → Lets say we want to calculate the Perplexity of below generated text,
The cat sat on the mat
Step 1: The probabilities and log probabilities are given as below in table, which is calculated for each word in the generated response.
[The] | [cat] | [sat] | [on] | [the] | [mat] |
---|---|---|---|---|---|
p = 0.25 | p = 0.40 | p = 0.60 | p = 0.12 | p = 0.80 | p = 0.35 |
ln(p) = -1.38 | ln(p) = -0.91 | ln(p) = -0.51 | ln(p) = -2.12 | ln(p) = -0.22 | ln(p) = -1.04 |
Step 2: Sum of all log probabilities,
$$-1.38-0.91-0.51-2.12-0.22-1.04 = -6.18$$
Step 3: Average Negative Log Likelihood
$$ \begin{aligned} \text{NLL} &= -\frac{1}{N} \sum_{i=1}^{N} \ln(p_i) \\ &= -\frac{1}{6} \times (-6.18) \\ &= 1.03 \end{aligned} $$
Step 4: Perplexity (PPL):
$$ e^{1.03} \approx 2.80 $$
Interpretation → Given the prompt, the model’s uncertainty over the answer tokens averages out to “about 2.80 equally likely choices per token.” This sample was for only one sentence but its usually calculated for entire dataset.
When we talk about dataset-level perplexity, we’re looking beyond a single sentence and asking, “How well does the model handle an entire test set of text?” Instead of stopping at one example, we calculate the negative log-likelihood (NLL) for every sequence in the dataset. Then, we take the average of these values, which gives us a sense of the model’s overall prediction ability. Finally, by exponentiating this average NLL, we arrive at the dataset-level perplexity—a single number that captures how confidently the model navigates through a wide variety of sentences. This makes it a much more reliable measure of performance than just testing on one or two examples. Perplexity is usually reported on a validation/test dataset.

Paper : Language Models are Unsupervised Multitask Learners

Paper : One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling
Python Code for calculating Perplexity (using Transformers Library)
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import math
# Load a small model
model_name = "gpt2" # Consider the model to check
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
# Text(s) you want to evaluate perplexity on
texts = [
"India is a beautiful country.",
"The cat sat on the mat."
]
def calculate_perplexity(text):
encodings = tokenizer(text, return_tensors="pt")
input_ids = encodings.input_ids
with torch.no_grad():
outputs = model(input_ids, labels=input_ids)
loss = outputs.loss # this is mean NLL
return math.exp(loss.item())
for t in texts:
print(f"Perplexity('{t}') = {calculate_perplexity(t):.3f}")
# Perplexity('India is a beautiful country.') = 22.309
# Perplexity('The cat sat on the mat.') = 90.243
# mean Perplexity = 56.276
Limitations of Perplexity
- Narrow scope – only measures next-token prediction, not truthfulness or usefulness.
- Poor alignment with humans – low perplexity ≠ high-quality or factual text.
- Dataset dependent – results vary across domains and are not always generalizable.
- Not comparable across models – tokenization differences make cross-model PPL unfair.
- Biased to frequent words – ignores rare/long-tail tokens.
- Weak on long contexts – hides reasoning failures over longer sequences.
- Relative metric – no universal “good” or “bad” threshold.
2. MMLU(Massive Multitask Language Understanding)
One of the most widely used benchmarks for evaluating large language models is MMLU (Massive Multitask Language Understanding). Unlike metrics such as perplexity or BLEU that focus on text quality and fluency, MMLU is designed to test whether an LLM can reason and recall knowledge across diverse academic and professional domains.

The benchmark consists of over 15,000 multiple-choice questions spanning 57 subjects — everything from mathematics, history, and computer science to law, medicine, and even philosophy. The idea is simple: if a model truly “understands” language in a meaningful way, it should be able to apply its knowledge across different fields, not just generate fluent text.
Evaluation on MMLU is straightforward: the model is presented with a multiple-choice question and must select the correct option. Accuracy is then calculated as the percentage of correct answers. While this might sound like a simple test, its strength lies in the breadth and difficulty of the subjects covered. High performance on MMLU suggests that a model isn’t just memorising surface patterns but can generalise knowledge to new contexts.
$$
\text{Accuracy} = \frac{\text{Number of correct answers}}{\text{Total questions}} \times 100
$$
In practice, MMLU has become a gold standard for comparing cutting-edge LLMs. For example, newer frontier models like GPT-4 or Claude are often benchmarked on MMLU to demonstrate their superiority over older versions. Still, MMLU has its limitations: it emphasizes factual recall and reasoning but doesn’t fully capture skills like creativity, dialogue flow, or ethical judgment.
Model | MMLU Accuracy | MMLU Pro Accuracy | Key Notes |
---|---|---|---|
GPT-4.5 (OpenAI) | 85.1% | – | General-purpose model with reduced hallucinations |
Claude 3.7 Sonnet (Anthropic) | 91.0% | 82.7% | Excels in coding and reasoning; strong in step-by-step problem-solving |
Gemini 2.5 Pro (Google) | – | 84.1% | Leads MMLU Pro benchmark (esp. physics, chemistry) |
Grok-3 (xAI) | – | ~80% | Excels in real-time data processing, but lags behind Pro |
o1 (OpenAI) | – | 83.5% | High performance but expensive |
GPT-4o (OpenAI) | 86.4% | – | Tested in 2023; strong in humanities and specialized topics |
Gemini Ultra (2024) | 90.0% | – | Reported in 2024 but lacks transparency in testing |
3. BLEU Score
The BLEU (Bilingual Evaluation Understudy) score is one of the most widely used metrics for evaluating machine translation and other natural language generation tasks. It measures how closely a machine-generated output matches one or more human reference translations by comparing overlapping n-grams. A higher BLEU score indicates closer similarity to human translations, while a lower score reflects greater divergence. Although it doesn’t capture semantic meaning or fluency perfectly, BLEU provides a simple, quantitative way to benchmark language models and translation systems.
The Core Idea → The BLEU (Bilingual Evaluation Understudy) score provides an answer by asking one simple question:
It does this by comparing the LLM’s output (the “candidate” text) against one or more high-quality human-written reference texts.
How Does BLEU Work? The Key Components
BLEU doesn’t just look at exact matches. It breaks down the comparison into smaller pieces, called n-grams.
- Unigram (1-gram): Single words (e.g., “the”, “cat”)
- Bigram (2-gram): Pairs of words (e.g., “the cat”, “cat sat”)
- Trigram (3-gram): Triplets of words (e.g., “the cat sat”)
- 4-gram: Four-word sequences
1. Modified n-gram Precision
Basic precision would be:
$$
\text{Precision} \;=\;
\frac{\text{Number of $n$-grams in Candidate that appear in Reference}}{\text{Total $n$-grams in Candidate}}
$$
However, this has a problem. A candidate text could just repeat one correct word (“the the the the”) and get a perfect precision score, which is useless.
BLEU uses a “modified” precision: It clips the count of each candidate word by the maximum number of times it appears in any single reference sentence. This prevents gaming the system by repeating correct words.
2. The Brevity Penalty (BP)
Precision favors short outputs. A very short candidate might have high precision (e.g., just one correct word) but misses most of the meaning. The Brevity Penalty penalizes candidates that are shorter than the reference text. It’s a multiplier that reduces the score if the candidate is too short.
\[
BP =
\begin{cases}
1 & \text{if } c > r \\[6pt]
e^{\left(1 – \tfrac{r}{c}\right)} & \text{if } c \leq r
\end{cases}
\]
Where,
- 𝑐 = length of candidate translation
- 𝑟 = length of reference translation
3. Combining It All: The BLEU Formula
The final BLEU score is a weighted geometric mean of the modified n-gram precisions (for n=1 to 4), multiplied by the Brevity Penalty.
$$BLEU = BP \cdot \exp\left(\sum_{n=1}^{N} w_n \log p_n\right)$$
Where:
- \(BP\) is the Brevity Penalty.
- \(N\) is the maximum n-gram order (almost always 4).
- \(w_n\) is the weight for each n-gram (usually 1/4 for each, so equal weight).
- \(p_n\) is the modified precision for n-grams of order \(n\).
The output is a number between 0 and 1, which is almost always expressed as a percentage between 0 and 100. A higher score indicates a better match to the reference quality.
Simple Example
Scenario: Suppose a large language model generates a sentence, and we want to evaluate it against a reference sentence using the 1-gram BLEU score.
Reference Sentence: "The cat is on the mat"
Generated Sentence: "The cat sits on mat"
Step-by-Step 1-Gram BLEU Score Calculation
- Tokenize the Sentences:
- Reference: [“The”, “cat”, “is”, “on”, “the”, “mat”]
- Generated: [“The”, “cat”, “sits”, “on”, “mat”]
- Both sentences are tokenized into individual words (1-grams).
- Count 1-Grams in Generated Sentence:
- Generated 1-grams: [“The”, “cat”, “sits”, “on”, “mat”]
- Total 1-grams in generated sentence: 5
- Count Matches of 1-Grams:
- For each 1-gram in the generated sentence, check if it appears in the reference sentence.
- Generated 1-grams:
- “The”: Appears in reference (2 times, but we count matches up to the number in generated, i.e., 1 match).
- “cat”: Appears in reference (1 time, 1 match).
- “sits”: Does not appear in reference (0 matches).
- “on”: Appears in reference (1 time, 1 match).
- “mat”: Appears in reference (1 time, 1 match).
- Total matches: “The” (1) + “cat” (1) + “sits” (0) + “on” (1) + “mat” (1) = 4 matches
- Calculate 1-Gram Precision:
- Precision = (Number of matching 1-grams) / (Total 1-grams in generated sentence)
- Precision = 4 / 5 = 0.8
- Apply Brevity Penalty (BP):
- BLEU includes a brevity penalty to penalize generated sentences that are too short.
- Reference length: 6 words
- Generated length: 5 words
- BP = 1 if generated length ≥ reference length, else $$BP = \exp\left(1 – \frac{\text{reference_length}}{\text{generated_length}}\right)$$
- Since 5 < 6,
$$\begin{aligned} BP &= \exp\left(1 – \frac{6}{5}\right) \\ &= \exp(-0.2) \\ &\approx 0.8187 \end{aligned} $$
- Calculate BLEU Score:
- For 1-gram BLEU, the score is the precision multiplied by the brevity penalty.
- BLEU can be calculated as below,
$$
\begin{aligned}
\text{BLEU} &= \text{Precision} \times BP \\
&= 0.8 \times 0.8187 \\
&\approx 0.655
\end{aligned}
$$
Final Answer → The 1-gram BLEU score for the generated sentence “The cat sits on mat” compared to the reference sentence “The cat is on the mat” is approximately 0.655. Things to note here is that:
- This is a simplified example focusing only on 1-grams. In practice, BLEU often considers higher n-grams (e.g., 2-grams, 3-grams) and combines their precisions.
- The BLEU score ranges from 0 to 1, where higher scores indicate better similarity to the reference.
- This example assumes basic tokenization and no smoothing. In real-world applications, you might use libraries like NLTK or SacreBLEU for precise calculations.
Python code for calculating BLEU score
# !pip install evaluate
import evaluate
bleu = evaluate.load("bleu")
llm_output = ["The cat is on the mat"]
references = ["The cat sits on mat"]
results = bleu.compute(predictions=llm_output,
references=references,
max_order=1)
print("BLEU Score:", results["bleu"])
# Output : BLEU Score: 0.667
4. ROUGE Score
The ROUGE (Recall-Oriented Understudy for Gisting Evaluation) score is a popular metric for evaluating automatic text summarization and other natural language generation tasks. Unlike BLEU, which emphasizes precision, ROUGE focuses more on recall, measuring how much of the reference content is captured by the generated text. It does this by comparing overlapping units such as n-grams, word sequences, and word pairs between candidate and reference summaries. A higher ROUGE score indicates that the generated output covers more of the important information from the reference. While it does not fully capture fluency or meaning, ROUGE remains a simple and widely used benchmark for summarization quality.
Core Idea → While BLEU asks, “How much of the generated text appears in a human reference?”, ROUGE flips the question. It was designed for evaluating summarization, so its core question is:
“How much of the human-written reference summary is captured by the machine-generated summary?”
Think of it this way:
- BLEU (for Translation): Is the machine’s output precise and does it avoid adding extra, fluff words? (Precision-oriented)
- ROUGE (for Summarization): Did the machine’s summary recall all the important facts from the source document? (Recall-oriented)
A good summary must include the key points. Missing a crucial point (low recall) is a bigger failure than including a minor, irrelevant detail (lower precision).
Versions of ROUGE
- ROUGE-N → n-gram overlap (like BLEU but recall-focused).
- ROUGE-1 = unigram recall
- ROUGE-2 = bigram recall
- ROUGE-L → based on the Longest Common Subsequence (LCS).
- Captures fluency and order.
- ROUGE-S → skip-grams (allows gaps between words).
Most common in papers: ROUGE-1, ROUGE-2, ROUGE-L.
- The Formula (for ROUGE-N): Sample code calculating Rouge Score$$ \text{ROUGE-N} = \frac{\sum_{S \in {\text{Ref Summaries}}} \sum_{\text{gram}n \in S} \text{Count}{\text{match}}(\text{gram}n)}{\sum{S \in {\text{Ref Summaries}}} \sum_{\text{gram}_n \in S} \text{Count}(\text{gram}_n)} $$ In plain English: It’s the number of overlapping n-grams divided by the total number of n-grams in the reference summary(s). This is the definition of Recall.
BLEU vs ROUGE — The Core Difference
Metric | Denominator | Focus | Intuition |
---|---|---|---|
BLEU | Candidate (generated text) n-grams | Precision | “Of the words I wrote, how many were correct?” |
ROUGE | Reference (gold text) n-grams | Recall | “Of the gold words, how many did I manage to cover?” |
When it comes to evaluating text generation, BLEU and ROUGE often appear side by side. At first glance they seem identical — both count overlapping n-grams between a model’s output and a reference text. But the key difference lies in what they measure:
- BLEU (Bilingual Evaluation Understudy) focuses on precision: “Of the words I generated, how many were correct?”
- ROUGE (Recall-Oriented Understudy for Gisting Evaluation) focuses on recall: “Of the important reference words, how many did I include?”
This subtle denominator change flips the entire perspective.
Example
Reference - "the cat is on the mat"
Candidate A - "the cat sat on the mat" #(longer, with a mistake)
Candidate B - "the cat is" #(short, all correct)
- BLEU (precision-oriented):
- Candidate A has a few errors but many overlaps → moderately high score.
- Candidate B is short and exact → also a high score.
- BLEU rewards being concise and correct, but may penalize longer answers with extra words.
- For machine translation, we prefer BLEU since extra words hurt meaning.
- ROUGE (recall-oriented):
- Candidate A covers almost all of the reference (even with “sat” instead of “is”) → high score.
- Candidate B misses half the reference → low score.
- ROUGE rewards covering as much of the reference as possible, even if noisy.
- For summarisation, we prefer ROUGE since missing key ideas is worse than including a few redundant words.
Python code for calculating Rouge Score
#!pip install rouge_score
from pprint import pprint
from rouge_score import rouge_scorer
candidate = "the cat sat on the mat"
reference = "the cat is on the mat"
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
scores = scorer.score(reference, candidate)
pprint(scores)
#{'rouge1': Score(precision=0.8333333333333334, recall=0.8333333333333334, fmeasure=0.8333333333333334),
#'rouge2': Score(precision=0.6, recall=0.6, fmeasure=0.6),
#'rougeL': Score(precision=0.8333333333333334, recall=0.8333333333333334, fmeasure=0.8333333333333334)}
5. METEOR: Looking Beyond Exact Matches
When we judge the quality of text generated by an AI, metrics like BLEU and ROUGE often fall short because they only care about exact word overlap. But humans don’t work that way — we know that “boy” and “kid” mean the same thing, or that “fast” and “quick” are synonyms. That’s where METEOR (Metric for Evaluation of Translation with Explicit ORdering) comes in.
METEOR was designed to better reflect human judgment by rewarding not only exact word matches, but also stem matches (like “run” vs. “running”) and synonyms (like “fast” vs. “quick”). On top of that, it also cares about word order, so scrambled sentences won’t get a perfect score even if all the right words are there.
A Simple Example
Let’s compare two candidate sentences against the reference sentence:
- Reference: “the boy is quick”
- Candidate 1: “the kid is fast”
- Candidate 2: “fast is the kid”
Candidate 1 is a great match. Even though it uses synonyms (kid instead of boy, fast instead of quick), METEOR recognizes the meaning is the same. It gives this a score close to 1.0, or almost perfect.
Candidate 2, however, scrambles the order of words: “fast is the kid.” All the right ideas are there, but the phrasing feels odd. METEOR penalizes this by lowering the score, showing that word order matters for readability.
How the Score is Calculated → The core of METEOR combines precision and recall, but in a weighted way:
$$ F_{\text{mean}} = \frac{10\, P .\, R}{R + 9P} $$
- P = Precision (fraction of candidate words that match the reference)
- R = Recall (fraction of reference words that are covered by candidate)
This favours recall a bit more (notice the “9” weight in the denominator). Then, METEOR applies a penalty if the matched words are far apart or out of order:
$$ \text{Penalty} = 0.5 \left( \frac{\text{chunks}}{\text{matches}} \right)^3 $$
- matches = number of words matched (exact, stem, or synonym)
- chunks = groups of matched words that are in the same order
Finally, the METEOR score is:
$$ \text{Score} = \left( 1 – \text{Penalty} \right) \cdot F_{\text{mean}} $$
So Candidate 1 (“the kid is fast”) has 4 matches in order, so penalty is tiny. Candidate 2 (“fast is the kid”) has the same words but broken into multiple chunks → bigger penalty → lower score.
Python code for calculating Meteor score
# pip install evaluate
import evaluate
# Load METEOR metric
meteor = evaluate.load("meteor")
preds = ["India is a beautiful country"]
refs = ["India is a wonderful nation"]
# Compute METEOR
results = meteor.compute(predictions=preds, references=refs)
print("METEOR score:", results["meteor"])
# Output -> METEOR score: 0.588
Why not 1.0?
- Because METEOR considers synonyms and stemming (like “kid” ≈ “boy”, “fast” ≈ “quick”),
- But partial similarity, not exact match → gives a decent score rather than perfect.
6. BERT Score
When we evaluate text generation, traditional metrics like BLEU or ROUGE often fall short because they only look at word overlap. But language is much richer—two sentences can have different words yet mean the same thing. This is where BERTScore shines.
BERTScore leverages pretrained language models (like BERT, RoBERTa, DeBERTa) to evaluate how semantically close a generated sentence is to a reference. Instead of just counting matching words, it compares vector embeddings of words to capture meaning.
Example →
- Reference: “India is a beautiful country.”
- Candidate: “India is a wonderful nation.”
- BLEU/ROUGE may penalize this because of low word overlap.
- BERTScore will give a high score because “beautiful ↔ wonderful” and “country ↔ nation” are semantically similar in embedding space.
How BERTScore Works
- Embed Token → Use a pretrained LM (BERT, RoBERTa, etc.) to get contextual embeddings for each token in candidate and reference.
- Token matching via cosine similarity → Compare candidate tokens with reference tokens using cosine similarity.
- Precision, Recall, and F1 →
- Precision (P): Candidate’s semantic overlap with reference.
- Recall (R): Reference’s semantic overlap with candidate.
- F1: Balanced measure(Harmonic mean of P and R.)
Mathematical Formulation
Let \(x = {\{x_1, x_2, \ldots, x_m}\}\) be candidate tokens,
and \(y = {\{y_1, y_2, \ldots, y_n}\}\) be reference tokens
$$
\begin{aligned}
\textbf{Precision:}\quad
P &= \frac{1}{m} \sum_{i=1}^{m} \max_{j} \cos \big( e(x_i), e(y_j) \big) \\[6pt]
\textbf{Recall:}\quad
R &= \frac{1}{n} \sum_{j=1}^{n} \max_{i} \cos \big( e(y_j), e(x_i) \big) \\[6pt]
\textbf{F1:}\quad
F1 &= \frac{2PR}{P+R}
\end{aligned}
$$
Python code to calculate BERTScore
from evaluate import load
# Load BERTScore metric
bertscore = load("bertscore")
# Candidate (model output)
candidate = ["India is a wonderful nation."]
# Reference (ground truth)
reference = ["India is a beautiful country."]
# Compute BERTScore
results = bertscore.compute(predictions=candidate, references=reference, lang="en")
print("Precision:", results["precision"][0])
print("Recall:", results["recall"][0])
print("F1 Score:", results["f1"][0])
# Outputs
# Precision: 0.9834926128387451
# Recall: 0.9834926128387451
# F1 Score: 0.9834926128387451
The output will be three values (P, R, F1), usually ranging between 0 and 1, where higher is better. For our example, you’ll see high values because “wonderful nation” is semantically close to “beautiful country.”
Why Use BERTScore?
- Captures semantic similarity beyond word overlap.
- Works well across tasks: summarization, translation, dialogue, text generation.
- Correlates better with human judgment than BLEU/ROUGE.
Though keep in mind that BERTScore calculation is computationally heavier than BLEU/ROUGE since it needs a pretrained LM to generate embeddings.
7. Human Evaluation for Large Language Models
While automatic metrics like BLEU, ROUGE, METEOR, BERTScore help us quickly measure LLM performance, they don’t always tell the full story. Why? Because language is subjective — sometimes a response might be semantically correct but still score low on automatic metrics. That’s where Human Evaluation comes in.
Human evaluation means asking real people (often domain experts or annotators) to judge the outputs of an LLM. Instead of only relying on numbers, we rely on human judgment of qualities like correctness, fluency, style, or usefulness.
Common Dimensions in Human Evaluation
- Fluency / Grammar – Does the response sound natural and grammatically correct?
- Relevance – Does it actually answer the user’s question?
- Coherence – Does the response make sense as a whole, without contradictions?
- Factual Accuracy – Are the facts correct and not hallucinated?
- Helpfulness – Is the response useful to the user’s needs?
- Safety / Bias – Does it avoid harmful, offensive, or biased content?
Methods of Human Evaluation
- Likert Scale Ratings (1–5 or 1–7): Annotators rate responses on a scale (e.g., 1 = terrible, 5 = excellent).
- Pairwise Comparison (A/B Testing): Show two responses from different models and ask “Which is better?”
- Ranking: Rank multiple responses from best to worst.
- Task Success Rate: For applied use-cases (like customer support), measure whether the model helps complete the task.
Example → Suppose we ask an LLM:
- Model A Response: “Photosynthesis is a process by which plants convert carbon dioxide and water into glucose and oxygen using sunlight.”
- Model B Response: “Plants eat sunlight and breathe out oxygen. This process, called photosynthesis, helps them grow and gives us fresh air.”
Now let’s rate them on fluency, relevance, and understandability (scale 1–5):
\[
\begin{array}{|l|c|c|}
\hline
\textbf{Criteria} & \textbf{Model A} & \textbf{Model B} \\
\hline
Fluency & 5 & 5 \\
\hline
Relevance & 5 & 5 \\
\hline
Understandability & 2 & 5 \\
\hline
\end{array}
\]
Here, Model B scores higher for a 10-year-old audience, even though Model A may score better on BLEU/ROUGE. This shows why human evaluation is critical.
Pros and Cons of Human Evaluation
Pros of Human Evaluation
- Captures nuances missed by automatic metrics.
- Can judge style, tone, helpfulness, and correctness.
- More aligned with real-world user experience.
Cons of Human Evaluation
- Expensive (requires paying human annotators).
- Time-consuming compared to automatic metrics.
- Results can vary across evaluators (subjectivity).
8. Using LLM-as-a-Judge to Evaluate LLMs
At its core, LLM-as-a-Judge is a method where a highly capable, powerful LLM (like GPT-4) is prompted to evaluate the outputs of other LLMs. It acts as an automated, scalable stand-in for human judges. Think of it as a wise, automated referee in a debate tournament between AI models. You give the referee (the judge model) a question, the answers from two contestants (e.g., GPT-4 and Llama 3), and a clear set of rules. The referee then deliberates and declares a winner, often with a detailed explanation.
How Does It Work? The Anatomy of a Judgment
The magic isn’t just in asking “which is better?” It’s in the carefully crafted prompt that guides the judge model. A typical judge prompt includes:
- The System Prompt (The Rulebook): This sets the context and the rules of the game.
- “You are a helpful and impartial assistant designed to evaluate AI models.”
- The User Query (The Topic): The original question or instruction given to the models.
- “Please write a creative introduction for a blog post about quantum computing.”
- The Model Responses (The Contestants): The outputs from the models being evaluated, clearly labeled (e.g., Response A and Response B).
- The Evaluation Rubric (The Scoring Criteria): Detailed instructions on what to look for.
- “Evaluate based on creativity, clarity, accuracy, and engagement. Identify any factual inaccuracies.”
- The Output Format (The Verdict Form): A strict instruction on how to deliver the judgment, often requiring JSON or a specific string to make parsing easy.
- “Output your final verdict as JSON:
{"winner": "A" or "B" or "tie", "reason": "..."}
“
- “Output your final verdict as JSON:
This structured approach forces the judge model to reason step-by-step (a technique called Chain-of-Thought) before delivering a verdict, significantly improving the reliability of its judgments.
Python based Prompt structure
# A simplified conceptual example of a judge prompt
judge_prompt = f"""
### INSTRUCTION:
You are an impartial judge. You will be given a USER QUESTION and two AI RESPONSES.
Your job is to determine which response is better.
### USER QUESTION:
{user_question}
### RESPONSE A:
{response_from_model_a}
### RESPONSE B:
{response_from_model_b}
### CRITERIA:
- Helpfulness: Does the response fulfill the user's request?
- Accuracy: Are the facts presented correct?
- Creativity: Is the response original and engaging?
### OUTPUT FORMAT:
First, explain your reasoning. Then, output a JSON object with the key "winner"
(values: "A", "B", or "tie") and "reason".
Your judgment:
"""
Advantages of LLM-as-a-Judge
- Scalable: Evaluate thousands of examples quickly.
- Cost-effective: No need for large human annotator teams.
- Consistent: Reduces variability from different human raters.
Limitations
- Bias: Judge LLM may favor responses written in its own “style.”
- Hallucination risk: It might misjudge correctness if it doesn’t “know” the fact.
- Reliance on Judge Quality: If the judge LLM isn’t very strong, the evaluation will be unreliable.
Best Practices using the LLM-as-a-judge
To get the most reliable results, follow these guidelines:
- Use a Powerful Judge: The quality of the judgment is directly tied to the capability of the judge model. GPT-4 is currently the consensus best judge.
- Calibrate with Humans: Always run a smaller “golden set” of evaluations with human judges to check the agreement between the LLM and humans. High agreement means you can trust the LLM to scale.
- Randomize Response Order: Always shuffle whether your model’s output is shown as Response A or B to avoid positional bias.
- Use Detailed Rubrics: Vague prompts lead to vague judgments. Be as specific as possible in your criteria.
- Think Chain-of-Thought: Force the judge to explain its reasoning before giving a verdict. This often leads to more accurate and thoughtful outcomes.
- Don’t Use it for Everything: This method is best for relative judgments (is A better than B?). It’s weaker for absolute scoring (on a scale of 1-10, how good is this?).
Human evaluation is still the gold standard, but LLM-as-a-Judge is quickly becoming the practical standard for large-scale evaluation of LLMs.
Summary of LLM Evaluation Metrics
Evaluating Large Language Models (LLMs) requires looking at them from multiple angles—no single metric captures the full picture.
- Perplexity → Measures how well a model predicts text. Great for fluency but doesn’t reflect meaning or usefulness.
- BLEU / ROUGE / METEOR → Compare model outputs with references using word overlap. Useful for translation/summarization but limited in judging creativity or reasoning.
- Embedding-based Metrics (e.g., BERTScore) → Capture semantic similarity beyond surface word matches, making them more robust than BLEU/ROUGE.
- MMLU (Massive Multitask Language Understanding) → Benchmark across 57 subjects to test reasoning and knowledge breadth.
- Human Evaluation → Gold standard for judging style, tone, correctness, and real-world usefulness. However, it’s expensive and time-consuming.
For a reliable assessment, combine automatic metrics for scale with selective human evaluation for depth. Together, they provide a balanced view of how well an LLM performs in real-world scenarios. Ultimately, Evaluating Large Language Models provides clarity on their performance across diverse tasks.
References
- Evaluating Large Language Models: A Comprehensive Survey [Link]
- Chang, Y., et al. (2024). “A Survey on Evaluation of Large Language Models.” ACM Transactions on Intelligent Systems and Technology[Link]
- Microsoft (2024) A list of metrics for evaluating LLM-generated content [Link]
- OpenAI. (2023). GPT-4 Technical Report. [Link]
2 thoughts on “Evaluating Large Language Models: A Comprehensive guide on Metrics, Methods, and Best Practices”
Thanks for the good writeup. It in fact was once a amusement
account it. Glance advanced to far introduced agreeable from you!
By the way, how can we keep up a correspondence?
Thanks for your kind words! I’m glad you found the post helpful. If you’d like to stay in touch, you can reach me through the contact page on my blog or connect with me on LinkedIn.