Now that youâve learned how the Data Preprocessing Pipeline transforms your messy soil measurements into clean, analysis-ready datasets, itâs time to explore the first layer of AgReFed-MLâs prediction system: Mean Function Models.
Think of this relationship like creating a masterpiece painting. The Data Preprocessing Pipeline is like preparing your canvas and organizing your paints. Mean Function Models are like having a skilled sketch artist create the basic outline and composition of your painting. This rough sketch captures the main shapes, proportions, and overall structure. Later, a master artist (the Gaussian Process Models) will add fine details, subtle shading, and beautiful textures to create the final masterpiece.
Imagine youâre trying to predict soil organic carbon levels across a 1,000-hectare farm. You have soil measurements from 50 sample points scattered across the property, plus environmental data like elevation, rainfall, vegetation indices, and slope for every square meter of the farm.
Without Mean Function Models, youâd face these challenges:
Mean Function Models solve these problems by learning the primary relationships between soil properties and environmental factors. Theyâre called âmean functionâ models because they predict the average (mean) soil property value youâd expect at any location, given the environmental conditions there.
Think of Mean Function Models as experienced agricultural consultants who can give you a good first estimate of soil conditions based on what they can see about the landscape:
AgReFed-ML provides two types of agricultural consultants, each with different strengths:
This consultant likes to find simple, interpretable relationships. BLR assumes that soil properties change in predictable, linear ways with environmental factors. For example: âFor every 100 meters of elevation gain, soil organic carbon typically increases by 0.2%.â
This consultant is excellent at finding complex, non-linear patterns. RF can discover relationships like: âHigh soil carbon occurs when elevation is above 300m AND rainfall is between 800-1200mm AND the area has moderate slope.â
Every Mean Function Model in AgReFed-ML has three essential components:
These are the landscape characteristics that help predict soil properties:
# Common environmental covariates
covariates = [
'elevation', # Height above sea level
'slope', # Steepness of terrain
'aspect', # Direction slope faces
'ndvi', # Vegetation greenness
'rainfall', # Annual precipitation
'temperature' # Average temperature
]These covariates are available everywhere across your study area, unlike soil measurements which are only available at sample points.
The model learns relationships by studying your soil sample locations:
# Training data: known soil samples + their environmental conditions
X_train = environmental_data_at_sample_points # Shape: (50 samples, 6 covariates)
y_train = soil_carbon_measurements # Shape: (50 samples,)
# Train the model to learn the relationships
model = train_model(X_train, y_train)The model studies these 50 examples to learn how environmental conditions relate to soil carbon levels.
Once trained, the model can predict soil properties anywhere:
# Prediction data: environmental conditions everywhere on the farm
X_predict = environmental_data_full_farm # Shape: (1,000,000 locations, 6 covariates)
# Generate predictions across the entire farm
predictions = model.predict(X_predict) # Shape: (1,000,000 predictions,)Now you have soil carbon estimates for every location on your farm!
Letâs walk through creating a soil organic carbon map using both types of mean function models.
For soil organic carbon mapping, you need to decide between the two consultant types:
# Option 1: Bayesian Linear Regression (good for simple relationships)
model_type = 'blr'
# Option 2: Random Forest (good for complex patterns)
model_type = 'rf'When to use BLR: When you want interpretable results and suspect simple relationships (e.g., âcarbon increases with elevationâ)
When to use RF: When you suspect complex interactions between variables (e.g., âhigh carbon occurs under specific combinations of elevation, rainfall, and slopeâ)
Letâs start with the systematic analyst approach:
from model_blr import blr_train
# Train BLR model on your soil samples
blr_model = blr_train(X_train, y_train)This trains the BLR model to find the best linear relationships between environmental factors and soil carbon. The model will identify which environmental variables are most important and how strongly they relate to soil carbon.
from model_blr import blr_predict
# Generate predictions and uncertainty estimates
predictions, uncertainties, _ = blr_predict(X_predict, blr_model)The BLR model returns two things: - Predictions: Estimated soil carbon values for each location - Uncertainties: How confident the model is about each prediction
Now letâs try the pattern recognition expert:
from model_rf import rf_train
# Train RF model on the same soil samples
rf_model = rf_train(X_train, y_train)The Random Forest trains 1,000 decision trees, each learning slightly different patterns from your data. This ensemble approach makes RF very robust to noise and capable of capturing complex relationships.
from model_rf import rf_predict
# Generate predictions with uncertainty from tree ensemble
predictions, uncertainties, _ = rf_predict(X_predict, rf_model)Random Forest estimates uncertainty by looking at how much the 1,000 different trees disagree with each other. High disagreement means high uncertainty.
When you train and use a Mean Function Model, hereâs the step-by-step process that occurs:
Letâs break this down:
# The model studies your soil samples
def learn_relationships(soil_samples, environmental_data):
# Find patterns like: "High elevation â High carbon"
# Or: "Steep slopes + High rainfall â Low carbon"
relationships = find_patterns(soil_samples, environmental_data)
return trained_modelDuring training, the model examines each soil sample location and asks: âWhat environmental conditions led to this soil carbon level?â
For Bayesian Linear Regression:
# BLR finds the best linear equation
# soil_carbon = aâĂelevation + aâĂrainfall + aâĂslope + constant
def find_linear_relationships(X, y):
coefficients = calculate_best_fit(X, y) # Find aâ, aâ, aâ, etc.
uncertainties = estimate_coefficient_uncertainty(coefficients)
return coefficients, uncertaintiesFor Random Forest:
# RF builds many decision trees
def build_tree_ensemble(X, y):
trees = []
for i in range(1000): # Build 1000 trees
tree = build_decision_tree(X, y) # Each tree learns different patterns
trees.append(tree)
return treesdef generate_predictions(trained_model, new_locations):
predictions = []
uncertainties = []
for location in new_locations:
# Get environmental conditions at this location
elevation = location.elevation
rainfall = location.rainfall
slope = location.slope
# Apply learned relationships
prediction = model.predict([elevation, rainfall, slope])
uncertainty = model.estimate_uncertainty([elevation, rainfall, slope])
predictions.append(prediction)
uncertainties.append(uncertainty)
return predictions, uncertaintiesThis process generates soil property estimates for every location where you have environmental data.
The Mean Function Models are implemented in two specialized modules that handle the different modeling approaches. Hereâs how the core components work:
The BLR model is implemented in model_blr.py and provides interpretable linear relationships:
def blr_train(X_train, y_train, logspace=False):
"""Train Bayesian Linear Regression model"""
# Use scikit-learn's BayesianRidge for robust linear regression
reg = BayesianRidge(tol=1e-6, fit_intercept=True, compute_score=True)
reg.fit(X_train, y_train)
# Remove statistically insignificant features
coef_uncertainties = np.diag(reg.sigma_)
significant_features = abs(reg.coef_) > 3 * coef_uncertainties
return regThis implementation automatically identifies which environmental variables have statistically significant relationships with your soil property.
The RF model is implemented in model_rf.py and captures complex non-linear patterns:
def rf_train(X_train, y_train):
"""Train Random Forest regression model"""
# Use 1000 trees with optimized parameters
rf_reg = RandomForestRegressor(
n_estimators=1000, # Number of decision trees
min_samples_leaf=2, # Prevent overfitting
max_features=0.3, # Use 30% of features per tree
random_state=42 # Reproducible results
)
rf_reg.fit(X_train, y_train)
return rf_regThe Random Forest creates an ensemble of 1000 decision trees, each trained on slightly different subsets of your data and features.
Both models provide uncertainty estimates, but through different mechanisms:
BLR Uncertainty (from mathematical theory):
def blr_predict(X_test, blr_reg):
"""Get predictions with theoretical uncertainties"""
y_pred, y_std = blr_reg.predict(X_test, return_std=True)
# y_std comes from Bayesian theory about parameter uncertainty
return y_pred, y_stdRF Uncertainty (from ensemble disagreement):
def pred_ints(model, X, percentile=95):
"""Calculate uncertainty from tree ensemble"""
# Get predictions from all 1000 trees
all_predictions = []
for tree in model.estimators_:
predictions = tree.predict(X)
all_predictions.append(predictions)
# Uncertainty = how much trees disagree
stddev = np.std(all_predictions, axis=0)
return stddevThis gives you two different perspectives on prediction uncertainty: mathematical theory (BLR) versus empirical variability (RF).
The choice between BLR and RF depends on your specific needs:
def choose_model(data_characteristics):
if data_characteristics['sample_size'] < 100:
return 'blr' # Better for small datasets
elif data_characteristics['need_interpretability']:
return 'blr' # Linear relationships are easy to explain
elif data_characteristics['complex_terrain']:
return 'rf' # Better for complex patterns
else:
return 'rf' # Generally more accurate for large datasetsBoth models can be used as mean functions for Gaussian Process Models, which will refine these initial estimates with spatial modeling.
Mean Function Models provide essential capabilities for agricultural machine learning:
Mean Function Models are the essential first step in AgReFed-MLâs two-stage prediction system. Like skilled sketch artists, they quickly capture the main relationships between soil properties and environmental factors, creating a solid foundation for more sophisticated spatial modeling.
Whether you choose the systematic approach of Bayesian Linear Regression or the pattern-recognition power of Random Forest, these models transform your limited soil samples into comprehensive baseline maps across your entire study area. They identify which environmental factors matter most and provide honest estimates of prediction uncertainty.
These baseline estimates become the foundation for the next stage of modeling sophistication. Ready to see how AgReFed-ML refines these initial sketches into detailed masterpieces? The next chapter covers Gaussian Process Models, where weâll explore how the system adds spatial relationships and fine-scale details to create the most accurate possible soil property maps.