Our Classification Metrics

In many functions of the Forest Foresight package and in our public communications we use the F0.5 score to compare our predictive power compared to the baseline, over time and across different areas and landscapes. In this article we explain the different aspects of this metric.

Basic Classification Outcomes

To understand F0.5 and related metrics, we first need to define the four possible outcomes in a binary classification problem:

True Positive (TP): The model correctly predicts the positive class.
False Positive (FP): The model incorrectly predicts the positive class when it's actually negative.
True Negative (TN): The model correctly predicts the negative class.
False Negative (FN): The model incorrectly predicts the negative class when it's actually positive.

These outcomes are often presented in a confusion matrix:

              Predicted Positive | Predicted Negative
Actual Positive       TP         |        FN
Actual Negative       FP         |        TN

Precision and Recall

Precision and recall are two fundamental metrics in classification problems:

Precision

Precision measures the accuracy of positive predictions.

Precision = TP / (TP + FP)

It answers the question: "Of all the instances the model labeled as positive, what fraction was actually positive?"

Recall

Recall measures the completeness of positive predictions.

Recall = TP / (TP + FN)

It answers the question: "Of all the actual positive instances, what fraction did the model correctly identify?"

F-Score

The F-score is a metric that combines precision and recall into a single value. The general formula for the F-score is:

F_β = (1 + β²) * (precision * recall) / (β² * precision + recall)

Where β is a parameter that determines the weight of precision vs. recall.

F1 Score

The F1 score is the harmonic mean of precision and recall, giving equal weight to both:

F1 = 2 * (precision * recall) / (precision + recall)

This is equivalent to the F-score formula with β = 1.

F0.5 Score

The F0.5 score gives more weight to precision than to recall:

F0.5 = 1.25 * (precision * recall) / (0.25 * precision + recall)

This is equivalent to the F-score formula with β = 0.5.

Why Use F0.5 Instead of F1?

The choice between F0.5 and F1 depends on the specific requirements of your classification problem:

Use F1 when you want to balance precision and recall equally.
Use F0.5 when precision is more important than recall.

F0.5 is particularly useful in scenarios where the cost of false positives is higher than the cost of false negatives. For example:

In spam detection, it might be better to let a few spam emails through (false negatives) than to incorrectly flag legitimate emails as spam (false positives).
In forest foresight, it might be more critical to accurately identify areas of deforestation (high precision) even if it means missing some potential deforestation events (lower recall).

Area Under the Curve (AUC)

During training of a model we calculate the area under the curve, a combination of precision and recall. We do this because we have seen that it does not result in different F0.5 scores and it is a lot faster than calculating F0.5 for every step, especially with a lot of training data.

The Area Under the Curve typically refers to the area under the Receiver Operating Characteristic (ROC) curve, often called AUC-ROC.

The ROC curve plots the True Positive Rate (Recall) against the False Positive Rate at various classification thresholds.
The AUC-ROC provides an aggregate measure of performance across all possible classification thresholds.

AUC-ROC ranges from 0 to 1:

0.5 represents a model that performs no better than random guessing
1.0 represents a perfect model
Values below 0.5 suggest the model is worse than random guessing

Advantages of AUC-ROC:

It's threshold-invariant: it measures model performance across all possible thresholds.
It's scale-invariant: it measures how well predictions are ranked, rather than their absolute values.

However, AUC-ROC may not be suitable when:

There's a large class imbalance
The costs of false positives and false negatives are significantly different

In these cases, metrics like F0.5 or F1 might be more appropriate, as they focus more on the positive class and can be tuned to emphasize precision or recall as needed.

Conclusion

Understanding these metrics is crucial for evaluating and comparing classification models:

Precision and recall provide insights into different aspects of model performance.
F-scores (like F1 and F0.5) offer a balanced view, combining precision and recall.
The choice between F1 and F0.5 depends on whether you prioritize balance or precision.
AUC provides a threshold-independent measure of model performance but may not be suitable for all scenarios.

When working on a classification problem, it's important to consider the specific requirements and constraints of your application when choosing which metrics to prioritize.