Beyond Accuracy: A Practical Guide to Model Evaluation Metrics.

Precision, Recall, F1-Score for classification; MAE, RMSE for regression; Perplexity for LLMs.

In many ML applications, accuracy alone is misleading – especially with imbalanced data or asymmetric error costs. This guide covers key metrics beyond accuracy, combining theory and Python examples. We’ll examine classification metrics (precision, recall, F1), regression metrics (MAE, RMSE), and LLM evaluation via perplexity. Each metric is defined, explained with use cases, and demonstrated in code.

📊 Classification Metrics: Precision, Recall, F1 Score

Precision (P) measures the fraction of positive predictions that are actually correct. Formally,
Precision = TP / (TP + FP). Intuitively, precision answers “Of all items labeled positive, how many truly are?” A high precision means few false positives. Precision is crucial when false alarms are costly – for example, flagging a healthy patient as sick (a false positive) may require expensive follow-up.
Recall (R, or True Positive Rate) measures the fraction of actual positives that are identified. Formally,
Recall = TP / (TP + FN). It answers “Of all true positive cases, how many did we catch?” High recall means few false negatives. Optimize recall when missing a positive is costly – e.g. failing to detect a disease (a false negative) can be dangerous.
F1 Score is the harmonic mean of precision and recall:
F1 = 2 * (P * R) / (P + R). F1 balances precision vs. recall and is most useful when both types of error matter equally. It is especially recommended for imbalanced classes, where accuracy can be misleading.
Precision vs. Recall Tradeoff: Increasing precision (by raising the decision threshold) typically reduces recall, and vice versa. Use precision when false positives are very undesirable, and recall when false negatives hurt more. The F1 score provides a single summary when both matters.

from sklearn.metrics import precision_score, recall_score, f1_score

y_true = [0, 1, 1, 0, 1]  # Ground truth labels
y_pred = [0, 0, 1, 1, 0]  # Model predictions

print("Precision:", precision_score(y_true, y_pred))  # TP/(TP+FP)
print("Recall:   ", recall_score(y_true, y_pred))     # TP/(TP+FN)
print("F1 Score: ", f1_score(y_true, y_pred))         # 2*P*R/(P+R)

Sample output: Precision: 0.50, Recall: 0.33, F1 Score: 0.40.

🧮 Regression Metrics: MAE vs RMSE

Mean Absolute Error (MAE): the average absolute difference between predictions and true values.
MAE = (1/n) * sum(|y_i - y_hat_i|). MAE is in the same units as the target, making it easily interpretable. It penalizes all errors equally and is robust to outliers.
Root Mean Squared Error (RMSE): the square root of average squared error.
RMSE = sqrt((1/n) * sum((y_i - y_hat_i)^2)). RMSE penalizes larger errors more heavily. RMSE’s units match the target like MAE’s, but its value is usually larger (except when all errors are equal).

Choosing between MAE and RMSE:

Use MAE for uniform error treatment and robustness to outliers.
Use RMSE when large errors should be penalized or residuals are expected to be Gaussian.
Computing both is often informative: RMSE >> MAE indicates the presence of large errors.

from sklearn.metrics import mean_absolute_error, mean_squared_error

y_true = [3.0, -0.5, 2.0, 7.0]
y_pred = [2.5, 0.0, 2.0, 8.0]

mae = mean_absolute_error(y_true, y_pred)
rmse = mean_squared_error(y_true, y_pred, squared=False)  # RMSE
print(f"MAE:  {mae:.2f}")
print(f"RMSE: {rmse:.2f}")

Sample output: MAE: 0.50, RMSE: 0.61.

🤖 LLM Evaluation: Perplexity

Perplexity (PPL) measures how well a language model predicts a sequence. It is the exponentiated average negative log-likelihood of the tokens. Intuitively, it measures how “surprised” the model is by the text: lower is better.

Usage: Compare LLMs on held-out text. Lower perplexity means better prediction of token sequences.
Limitations: Sensitive to tokenization, vocabulary, and sequence length. Does not directly measure understanding. Should be complemented with task-specific evaluation.

import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load GPT-2 model and tokenizer
model_name = "gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)

text = "Transformers models are powerful for natural language processing."
inputs = tokenizer.encode(text, return_tensors="pt")

# Compute loss (cross-entropy) and perplexity
with torch.no_grad():
    outputs = model(inputs, labels=inputs)
    loss = outputs.loss
perplexity = torch.exp(loss)
print(f"Perplexity: {perplexity.item():.2f}")

Beyond Accuracy: A Practical Guide to Model Evaluation Metrics.

📊 Classification Metrics: Precision, Recall, F1 Score

🧮 Regression Metrics: MAE vs RMSE

🤖 LLM Evaluation: Perplexity

References

Read more

Tutorial: Predicting House Prices in Sydney with Machine Learning.

Production-ready RAG: Investor Q&A over Financial Filings

LLM: Prompt Optimization

Attention is All You Need - Key concepts.