Model Evaluation Metrics

You built a model. It runs. It produces output. But is it any good? That question — deceptively simple — is where model evaluation metrics come in. Picking the wrong metric is one of the most common mistakes in applied ML, and it can make a terrible model look great or a solid model look broken. Here's what you actually need to know.

The Two Worlds: Classification vs. Regression

Every supervised ML metric lives in one of two buckets: classification (predicting categories) or regression (predicting numbers). Mixing them up is an instant red flag. You'd never use RMSE to evaluate a spam filter, and you'd never use F1 to evaluate a house price predictor. The task dictates the metric.

Classification Metrics

The Confusion Matrix — Your Starting Point

Before you touch any single-number metric, understand the confusion matrix. It's a 2x2 grid (for binary classification) that shows:

True Positives (TP): model said yes, actually yes
True Negatives (TN): model said no, actually no
False Positives (FP): model said yes, actually no (Type I error)
False Negatives (FN): model said no, actually yes (Type II error)

Every classification metric below is just a different way of slicing these four numbers. For multiclass problems, you get an NxN matrix — same idea, more cells.

Accuracy

(TP + TN) / (TP + TN + FP + FN)

The proportion of all predictions that were correct. Accuracy is the most intuitive metric, and it's fine when your dataset is balanced — roughly equal numbers of positive and negative examples. A loan approval model with 50/50 approved vs. denied applications? Accuracy works.

But here's the trap: accuracy is misleading on imbalanced datasets. If 99% of your emails are not spam, a model that predicts "not spam" for everything gets 99% accuracy while catching zero actual spam. It's technically accurate and completely useless.

Precision

TP / (TP + FP)

Of all the things the model said were positive, how many actually were? Precision answers: "When the model raises an alarm, can I trust it?"

High precision matters when false positives are expensive. Think about a content moderation system flagging user posts — if precision is low, you're censoring legitimate content and frustrating users. Or a fraud detection system that freezes bank accounts: too many false positives and you're locking out real customers.

Recall (Sensitivity)

TP / (TP + FN)

Of all the actual positives, how many did the model catch? Recall answers: "Is the model missing things it shouldn't?"

High recall matters when false negatives are dangerous. Medical diagnosis is the classic example — if your model classifies tumors and misses actual cancers (false negatives), people could die. You'd rather flag some benign cases for review (lower precision) than miss a malignant one.

The Precision-Recall Tradeoff

You almost never get both high. Tuning a model's decision threshold trades one for the other. Lower the threshold and you catch more positives (recall goes up), but you also catch more false positives (precision goes down). The right balance depends on the cost of each error type in your specific domain.

F1 Score

2 × (Precision × Recall) / (Precision + Recall)

The harmonic mean of precision and recall. F1 gives you a single number that balances both, and it's the go-to metric for imbalanced datasets where accuracy would be misleading. If someone asks you "what metric for imbalanced classification?" — F1 is almost always the answer.

Why harmonic mean instead of arithmetic? Because it punishes extreme imbalances. If precision is 1.0 and recall is 0.01, the arithmetic mean is 0.505 (looks okay), but the harmonic mean is 0.02 (correctly reflects that the model is terrible).

AUC-ROC

The Receiver Operating Characteristic (ROC) curve plots true positive rate (recall) against false positive rate at every possible decision threshold. The Area Under this Curve (AUC) gives you a threshold-independent measure of model quality. AUC = 1.0 means perfect separation; AUC = 0.5 means the model is no better than random.

AUC-ROC is useful when you haven't committed to a specific threshold yet, or when you want to compare models irrespective of threshold choice. It's the standard metric for many binary classification benchmarks.

Regression Metrics

When your model predicts continuous values — prices, temperatures, quantities — you need different tools.

RMSE (Root Mean Squared Error)

sqrt(mean((predicted - actual)²))

The square root of the average squared prediction error. RMSE penalizes large errors more heavily than small ones because of the squaring, which makes it sensitive to outliers. If you're predicting house prices and occasionally being off by $500K, RMSE will punish that hard — which is often what you want.

RMSE is in the same units as your target variable, so it's interpretable: "on average, our predictions are off by X dollars."

MAE (Mean Absolute Error)

mean(|predicted - actual|)

The average magnitude of prediction errors, without squaring. MAE treats all error sizes linearly — being off by $100K twice is the same as being off by $200K once. Use MAE when outliers are expected and you don't want them dominating your evaluation.

R² (R-Squared)

The proportion of variance in the target variable explained by the model. R² = 1.0 means the model perfectly explains all variance; R² = 0 means it's no better than predicting the mean. R² can even go negative if your model is worse than just guessing the average.

R² is useful for understanding how much of the underlying signal your model captures, but it doesn't tell you the magnitude of errors — combine it with RMSE or MAE for the full picture.

Generative AI Metrics

Foundation models and LLMs need their own evaluation toolkit because their output is open-ended text, not a label or number. These metrics compare generated text against reference text using different strategies.

ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

The go-to metric for text summarization. ROUGE measures the overlap between a generated summary and a reference summary. Variants include:

ROUGE-N: n-gram overlap (ROUGE-1 for unigrams, ROUGE-2 for bigrams)
ROUGE-L: longest common subsequence

ROUGE is recall-oriented — it asks "how much of the reference content did the generated text capture?" Higher ROUGE means the summary is covering more of the key points from the reference.

BLEU (Bilingual Evaluation Understudy)

The standard metric for machine translation. BLEU compares a machine-generated translation against one or more human reference translations by analyzing n-gram overlap. A BLEU score ranges from 0 to 1 — higher means the translation more closely matches the reference.

BLEU is precision-oriented (the opposite of ROUGE): it asks "how much of the generated text's content appears in the reference?" This makes sense for translation where you want the output to stick close to known-good translations.

Key distinction: ROUGE for summarization, BLEU for translation. They show up on exams and in interviews as a pair — know which is which.

BERTScore

Uses pre-trained contextual embeddings from BERT to compute semantic similarity between generated and reference text. Unlike ROUGE and BLEU which do surface-level n-gram matching, BERTScore captures meaning — "the cat sat on the mat" and "a feline rested on the rug" would score low on ROUGE but high on BERTScore.

BERTScore is used in Amazon Bedrock's automatic model evaluation to assess foundation model outputs, particularly for summarization tasks.

Perplexity

Measures how well a language model predicts a sequence of words. Lower perplexity = better model. Intuitively, perplexity represents how "surprised" the model is by the text — a good language model assigns high probability to real text and thus has low perplexity.

Perplexity is primarily used for comparing language models against each other on the same dataset, not for evaluating task-specific output quality.

Choosing the Right Metric — The Decision Framework

This is where engineers get tripped up. The metric must match the problem:

SituationMetricWhyBalanced binary classificationAccuracySimple, interpretable, no class imbalance to skew itImbalanced classificationF1 ScoreBalances precision/recall, doesn't get fooled by majority classFalse positives are costlyPrecisionMinimize incorrect positive predictionsFalse negatives are dangerousRecallMinimize missed positivesMulticlass classificationConfusion MatrixSee per-class performance breakdownThreshold-independent comparisonAUC-ROCCompare models without committing to a thresholdRegression with outlier sensitivityRMSELarge errors penalized quadraticallyRegression, robust to outliersMAEAll errors weighted equallyText summarizationROUGERecall-oriented n-gram overlapMachine translationBLEUPrecision-oriented n-gram overlapSemantic text similarityBERTScoreEmbedding-based meaning comparisonLanguage model qualityPerplexityLower = better word predictionRuntime efficiencyAverage Response TimeHow fast predictions are generated

Model Evaluation on AWS

Amazon Bedrock offers two evaluation paths: automatic and human. Automatic evaluation computes quantitative scores using metrics like BERTScore and F1 — you supply a dataset (or use a built-in one) and get a report card. Human evaluation brings in subject matter experts for qualitative assessment: coherence, relevance, accuracy, overall quality.

Use automatic evaluation for fast, reproducible benchmarking. Use human evaluation when you need judgment calls that statistics can't capture — like whether a response is actually helpful or just technically correct.

For custom models, benchmark datasets (ImageNet, CIFAR-10, SQuAD, etc.) provide standardized evaluation that lets you compare against published results. Always evaluate on held-out data the model hasn't seen during training — otherwise you're just measuring memorization.

The Bottom Line

Metrics are a lens, not a verdict. A single number never tells the whole story. The best practice is to track multiple complementary metrics, understand what each one reveals and hides, and always tie your metric choice back to the real-world cost of different error types. A model with 95% accuracy and 12% recall on the minority class isn't a good model — it's a model that's good at ignoring the hard cases. The metric you choose determines whether you catch that or not.