Technical Deep Dive
Â
This section provides an in-depth look at the technology and methodology behind Forest Foresight, offering insights into our tech stack, data sources, preprocessing techniques, and machine learning approach.
Tech Stack
Our technology stack is designed to provide robust backend processing with versatile frontend options:
Frontend:
ArcGIS Online Dashboards: Primary visualization tool
WRI Map Builder App: Enhanced visualization capabilities
Web Interface: User-friendly access for custom area analysis
Open Access Data Repository: Direct data access for researchers and developers
Backend:
Server/Computer running the ForestForesight R package
AWS S3 Bucket: Cloud storage for prediction data
This architecture allows for scalable processing and flexible data access, catering to a wide range of user needs.
Â
Data Integration and Prediction
Forest Foresight combines multiple data sources to create accurate deforestation predictions:
Near Real-Time Deforestation Data:
Optical satellite imagery
Radar satellite data
Contextual Data:
Forest height
Distance to oil palm mills
Distance to roads
Elevation and slope
Agricultural potential
Various other relevant factors
This diverse dataset is fed into our XGBoost algorithm, which is trained on historical data with a 6-month gap. This approach allows the model to learn patterns that predict deforestation 6 months into the future.
Data Sources
Forest Foresight utilizes a wide array of data sources to ensure comprehensive and accurate predictions. A detailed Excel spreadsheet is available, listing all resources used in the model. This includes satellite imagery providers, geospatial databases, and various environmental and socio-economic datasets.
Data Processing and Model Evaluation
Â
Â
All input data is resampled to a consistent resolution of 0.004 degrees latitude and longitude (approximately 410x410 meters at the equator). This standardization ensures compatibility across different data sources and enables consistent analysis.
Model evaluation is performed using historical data, calculating:
Precision: Measure of false positive rate
Recall: Measure of false negative rate
F0.5 score: Balanced measure of precision and recall, with more emphasis on precision
These metrics allow us to assess and refine the model's performance continually.
Machine Learning Algorithm Comparison
Â
Through rigorous testing, we've found that XGBoost's ensemble trees algorithm consistently outperforms other methods such as Support Vector Machines (SVM) and Neural Networks for our specific use case. XGBoost offers superior predictive power and robustness, making it ideal for the complex task of deforestation prediction.
Data Preprocessing Techniques
Â
Several preprocessing techniques are employed to ensure data quality and compatibility:
Reprojection: Aligning all data to a common coordinate reference system
Reclassification: Grouping or categorizing data values for simplified analysis
Rasterization: Converting vector data to raster format for uniform analysis
Resampling: Adjusting the resolution of raster data to our standard 0.004 degree grid
Data Flattening: Transforming multi-dimensional data into a 2D format for machine learning input
These techniques are crucial for creating a consistent and high-quality dataset for our model.
Time Gap in Model Training
A six-month gap is maintained between the training data and the prediction target. This gap serves two crucial purposes:
It prevents the model from "cheating" by using information that wouldn't be available in a real-world prediction scenario.
It accounts for the lag in confirming deforestation events, ensuring that our training data is complete and accurate.
This approach ensures that our model learns to make genuine predictions rather than simply recapitulating known data.
Automatic Feature Selection
Our software incorporates an intelligent feature selection mechanism:
For each feature, the system automatically selects the most appropriate dataset based on the date in the filename.
It chooses the data closest to, but never after, the chosen date for training, validation, or testing.
This ensures that the model always uses the most relevant and temporally appropriate data for each prediction task, maintaining the integrity of the time-based prediction model.
By combining these advanced techniques and careful methodological considerations, Forest Foresight delivers highly accurate and reliable deforestation predictions, providing valuable insights for conservation efforts and land management strategies.
XGBoost
To further illustrate our machine learning approach, we present three key concepts through visual representations:
Decision Tree Principles
Â
The first image demonstrates the fundamental concept of decision trees:
A decision tree is a flowchart-like structure where each internal node represents a "test" on an attribute (e.g., "Is the distance to the nearest road less than 1 km?").
Each branch represents the outcome of the test.
Each leaf node represents a class label (the decision taken after computing all attributes).
In the context of Forest Foresight, a decision tree might use factors like proximity to roads, forest density, and historical deforestation patterns to classify areas as high or low risk for future deforestation.
XGBoost Performance Comparison
The second image illustrates XGBoost's superior performance compared to single decision trees and random forests:
Single Decision Tree: Shows moderate performance but is prone to overfitting.
Random Forest: Improves upon single trees by aggregating multiple trees, reducing overfitting.
XGBoost: Demonstrates the highest performance, particularly in complex scenarios like deforestation prediction.
This visualization underscores why Forest Foresight utilizes XGBoost for its predictive modeling, as it consistently outperforms other tree-based methods in both accuracy and robustness.
XGBoost Basic Principles
The third image outlines the fundamental principles of XGBoost (eXtreme Gradient Boosting):
Sequential Tree Building: XGBoost builds trees one at a time, where each new tree helps to correct errors made by previously trained trees.
Gradient Boosting: It uses gradient descent to minimize errors, optimizing the loss function with each new tree.
Regularization: XGBoost employs regularization techniques to prevent overfitting, balancing model complexity with predictive power.
Feature Importance: The algorithm automatically calculates feature importance, helping identify the most crucial factors in deforestation prediction.
Handling Missing Data: XGBoost has built-in methods for handling missing data, which is particularly useful in large-scale environmental modeling where data gaps are common.
These principles make XGBoost particularly well-suited for the complex task of deforestation prediction, allowing Forest Foresight to create highly accurate and robust models from diverse environmental and socio-economic data sources.
By leveraging the power of XGBoost, Forest Foresight can process large amounts of multidimensional data efficiently, capture complex non-linear relationships between variables, and produce reliable predictions of future deforestation risks.