Technical Deep Dive
This section provides an in-depth look at the technology and methodology behind Forest Foresight, offering insights into our tech stack, data sources, preprocessing techniques, and machine learning approach.
Tech Stack
Our technology stack is designed to provide robust backend processing with versatile frontend options:
Frontend:
ArcGIS Online Dashboards: Primary visualization tool
WRI Map Builder App: Enhanced visualization capabilities
Web Interface: User-friendly access for custom area analysis
Open Access Data Repository: Direct data access for researchers and developers
Backend:
Server/Computer running the ForestForesight R package
AWS S3 Bucket: Cloud storage for prediction data
This architecture allows for scalable processing and flexible data access, catering to a wide range of user needs.
Data Integration and Prediction
Forest Foresight combines multiple data sources to create accurate deforestation predictions:
Near Real-Time Deforestation Data:
Optical satellite imagery
Radar satellite data
Contextual Data:
Forest height
Distance to oil palm mills
Distance to roads
Elevation and slope
Agricultural potential
Various other relevant factors
This diverse dataset is fed into our XGBoost algorithm, which is trained on historical data with a 6-month gap. This approach allows the model to learn patterns that predict deforestation 6 months into the future.
Data Sources
Forest Foresight utilizes a wide array of data sources to ensure comprehensive and accurate predictions. A detailed Excel spreadsheet is available, listing all resources used in the model. This includes satellite imagery providers, geospatial databases, and various environmental and socio-economic datasets. The full list can be found here: List of ForestForesight features
Data Processing and Model Evaluation
All input data is resampled to a consistent resolution of 0.004 degrees latitude and longitude (approximately 410x410 meters at the equator). This standardization ensures compatibility across different data sources and enables consistent analysis.
Model evaluation is performed using historical data, calculating:
Precision: Measure of false positive rate
Recall: Measure of false negative rate
F0.5 score: Balanced measure of precision and recall, with more emphasis on precision
These metrics allow us to assess and refine the model's performance continually.
Machine Learning Algorithm Comparison
Through rigorous testing, we've found that XGBoost's ensemble trees algorithm consistently outperforms other methods such as Support Vector Machines (SVM) and Neural Networks for our specific use case. XGBoost offers superior predictive power and robustness, making it ideal for the complex task of deforestation prediction.
Data Preprocessing Techniques
Several preprocessing techniques are employed to ensure data quality and compatibility:
Reprojection: Aligning all data to a common coordinate reference system
Reclassification: Grouping or categorizing data values for simplified analysis
Rasterization: Converting vector data to raster format for uniform analysis
Resampling: Adjusting the resolution of raster data to our standard 0.004 degree grid
Data Flattening: Transforming multi-dimensional data into a 2D format for machine learning input
These techniques are crucial for creating a consistent and high-quality dataset for our model.
Time Gap in Model Training
A six-month gap is maintained between the training data and the prediction target. This gap serves two crucial purposes:
It prevents the model from "cheating" by using information that wouldn't be available in a real-world prediction scenario.
It accounts for the lag in confirming deforestation events, ensuring that our training data is complete and accurate.
This approach ensures that our model learns to make genuine predictions rather than simply recapitulating known data.
Automatic Feature Selection
Our software incorporates an intelligent feature selection mechanism:
For each feature, the system automatically selects the most appropriate dataset based on the date in the filename.
It chooses the data closest to, but never after, the chosen date for training, validation, or testing.
This ensures that the model always uses the most relevant and temporally appropriate data for each prediction task, maintaining the integrity of the time-based prediction model.
By combining these advanced techniques and careful methodological considerations, Forest Foresight delivers highly accurate and reliable deforestation predictions, providing valuable insights for conservation efforts and land management strategies.
XGBoost
XGBoost (eXtreme Gradient Boosting) is a non-spatial machine learning algorithm that uses gradient boosted decision trees, where each new tree is trained to correct the mistakes of the previous trees by minimizing a loss function. It builds these trees sequentially, with each subsequent tree focusing on the residual errors (the differences between predictions and actual values) from the previous trees, while using techniques like regularization to prevent overfitting. Unlike spatial algorithms that explicitly consider geographic relationships, XGBoost makes predictions based purely on feature values, employing an efficient implementation that includes unique optimizations like handling sparse data and using parallel processing.
To further illustrate our machine learning approach, we present three key concepts through visual representations:
Decision Tree Principles
The first image demonstrates the fundamental concept of decision trees:
A decision tree is a flowchart-like structure where each internal node represents a "test" on an attribute (e.g., "Is the distance to the nearest road less than 1 km?").
Each branch represents the outcome of the test.
Each leaf node represents a class label (the decision taken after computing all attributes).
In the context of Forest Foresight, a decision tree might use factors like proximity to roads, forest density, and historical deforestation patterns to classify areas as high or low risk for future deforestation.
XGBoost Performance Comparison
The second image illustrates XGBoost's superior performance compared to single decision trees and random forests:
Single Decision Tree: Shows moderate performance but is prone to overfitting.
Random Forest: Improves upon single trees by aggregating multiple trees, reducing overfitting.
XGBoost: Demonstrates the highest performance, particularly in complex scenarios like deforestation prediction.
This visualization underscores why Forest Foresight utilizes XGBoost for its predictive modeling, as it consistently outperforms other tree-based methods in both accuracy and robustness.
XGBoost Basic Principles
The third image outlines the fundamental principles of XGBoost (eXtreme Gradient Boosting):
Sequential Tree Building: XGBoost builds trees one at a time, where each new tree helps to correct errors made by previously trained trees.
Gradient Boosting: It uses gradient descent to minimize errors, optimizing the loss function with each new tree.
Regularization: XGBoost employs regularization techniques to prevent overfitting, balancing model complexity with predictive power.
Feature Importance: The algorithm automatically calculates feature importance, helping identify the most crucial factors in deforestation prediction.
Handling Missing Data: XGBoost has built-in methods for handling missing data, which is particularly useful in large-scale environmental modeling where data gaps are common.
These principles make XGBoost particularly well-suited for the complex task of deforestation prediction, allowing Forest Foresight to create highly accurate and robust models from diverse environmental and socio-economic data sources.
By leveraging the power of XGBoost, Forest Foresight can process large amounts of multidimensional data efficiently, capture complex non-linear relationships between variables, and produce reliable predictions of future deforestation risks.
To explain XGBoost in non-technical terms:
Imagine you're trying to predict house prices, and you want to make really good guesses. XGBoost is like having a team of experts who each specialize in spotting different patterns about houses.
The first expert might notice that bigger houses tend to cost more. They make predictions based just on house size, but they make some mistakes. Then, the second expert focuses specifically on fixing those mistakes by looking at another feature - maybe the neighborhood. The third expert then looks at what's still being predicted wrong and tries to fix those errors by looking at the age of the house.
Each new expert builds on what the previous experts learned, but focuses mainly on the houses that were hardest to price correctly. It's like having a group of people where each person is really good at catching the mistakes others missed.
What makes XGBoost special is:
It's really fast - like having all these experts work together super efficiently
It's really accurate - because it keeps focusing on and fixing its mistakes
It's careful not to go overboard - it has ways to prevent experts from making wild guesses
In real-world terms, this is why XGBoost is used for things like:
Predicting whether someone might like a movie on Netflix
Figuring out if a credit card transaction might be fraudulent
Guessing how many products a store needs to stock