Now that youâve learned how the Spatial-Temporal Modeling Framework creates sophisticated 4D predictions that vary across space and time, youâre ready to explore a crucial component for testing and validating these complex models: Synthetic Data Generation.
Think of this relationship like the difference between test-driving a car on public roads versus training in a professional driving simulator. The Spatial-Temporal Modeling Framework is like having a sophisticated sports car with advanced features - GPS navigation, adaptive cruise control, and terrain-sensing suspension. But before you take that car onto challenging mountain roads with real consequences, you want to test all its systems in a controlled environment where you know exactly what should happen. Thatâs what Synthetic Data Generation provides: a professional testing laboratory for your soil models.
Imagine youâve just developed a sophisticated soil organic carbon model using all the techniques from previous chapters - Mean Function Models, Gaussian Process Models, and Spatial-Temporal Modeling Framework. Now you need to answer critical questions:
Without synthetic data, you face these challenges: - Unknown ground truth: You never know the âcorrectâ answer to compare against - Limited test scenarios: Real data only covers the conditions that happened to occur - Expensive experimentation: Testing different sampling strategies requires expensive field work - Unclear model behavior: No way to isolate and test specific model components
Synthetic Data Generation solves these problems by creating artificial soil datasets with known properties. Itâs like having a flight simulator where you can create any weather condition, terrain type, or emergency scenario to test how your âpilotâ (model) performs when you know exactly what the right response should be.
Think of Synthetic Data Generation as a sophisticated laboratory that can create realistic but artificial soil datasets. This laboratory has three main capabilities:
The system lets you specify exactly how soil properties relate to environmental factors: - âSoil carbon increases by 0.2% for every 100m of elevation gainâ
- âHigh rainfall and moderate slope create optimal carbon conditionsâ - âSpatial correlation extends 200 meters before properties become independentâ
The system creates artificial landscapes with controllable characteristics: - Random scattered sampling points (like real field sampling) - Regular grid patterns (like drone or satellite data) - Any geographic extent and coordinate system you need
The system lets you control how challenging the synthetic data is: - Simple linear relationships vs. complex interactions - Low measurement noise vs. realistic field measurement uncertainty - Strong spatial patterns vs. highly variable conditions
Every synthetic dataset created by AgReFed-ML has four essential components:
These are the artificial environmental conditions that serve as predictors:
# Create synthetic environmental features
features = [
'elevation', # Artificial elevation values
'slope', # Simulated slope values
'rainfall', # Generated precipitation data
'temperature' # Synthetic temperature values
]These features are mathematically generated but designed to behave like real environmental data.
You specify exactly how environmental features relate to soil properties:
# Define the true relationships (what your model should learn)
true_coefficients = {
'elevation': 0.002, # +0.2% carbon per 100m elevation
'rainfall': 0.001, # +0.1% carbon per 100mm rainfall
'slope': -0.05, # -0.05% carbon per degree slope
'temperature': -0.01 # -0.01% carbon per degree temperature
}These coefficients define the âground truthâ that your trained models should discover.
You can add realistic spatial correlations:
# Add spatial correlation (nearby locations are similar)
spatial_settings = {
'correlation_length': 200, # Correlation extends 200 meters
'correlation_strength': 0.3 # 30% of variation is spatial
}This makes the synthetic data behave like real soil data, where nearby locations tend to have similar properties.
Add realistic uncertainty to mimic real-world measurement limitations:
This ensures your synthetic data includes the same kind of variability youâd see in real soil measurements.
Letâs walk through creating a synthetic soil organic carbon dataset to test your modeling pipeline. Weâll create data with known relationships, then see if your models can discover those relationships.
# Settings for synthetic soil carbon data
settings = {
'n_features': 6, # Number of environmental variables
'n_informative_features': 4, # How many actually matter
'n_samples': 200, # Number of soil sample locations
'model_order': 'quadratic', # Include feature interactions
'spatial_correlation': True # Add realistic spatial patterns
}This creates a dataset with 6 environmental features where 4 actually influence soil carbon, using 200 sample locations with quadratic relationships and spatial correlation.
# Define the synthetic landscape
landscape_settings = {
'spatial_size': 1000, # 1km x 1km study area
'center': [500000, 4000000], # UTM coordinates of center
'coordinate_system': 'EPSG:32633', # UTM Zone 33N
'sampling_pattern': 'random' # Random sample locations
}This creates a 1-kilometer square study area with randomly distributed soil sample locations.
# Set difficulty and realism levels
complexity_settings = {
'noise_level': 0.15, # 15% measurement uncertainty
'correlation_length': 150, # Spatial correlation extends 150m
'correlation_amplitude': 0.2 # 20% of variation is spatial
}This adds realistic measurement uncertainty and spatial patterns that extend about 150 meters.
from synthgen import gen_synthetic
# Create synthetic soil carbon data with known properties
synthetic_data, true_coefficients, feature_names = gen_synthetic(
n_features=6,
n_samples=200,
model_order='quadratic',
noise=0.15,
corr_length=150,
corr_amp=0.2,
spatialsize=1000
)This single function call creates a complete synthetic dataset with environmental features, soil carbon values, and spatial coordinates.
Now you can test whether your modeling pipeline can recover the known relationships:
# Train your model on synthetic data
from soilmod_predict import main
# Use the same prediction workflow as with real data
main('settings_synthetic_test.yaml')
# Compare model predictions to known true values
model_accuracy = compare_predictions_to_truth(
predictions=model_results,
true_values=synthetic_data['true_carbon']
)Since you know the true relationships, you can measure exactly how well your model performs.
When you generate synthetic data, hereâs the step-by-step process that occurs behind the scenes:
Letâs break this down:
The system first creates realistic environmental features:
# Generate environmental features that behave like real data
def create_environmental_features(n_samples, n_features):
# Create features with realistic ranges and correlations
features = generate_realistic_features(
n_samples=n_samples,
feature_types=['elevation', 'slope', 'rainfall', 'temperature']
)
# Normalize features to standard ranges
normalized_features = scale_to_realistic_ranges(features)
return normalized_featuresThis creates environmental data that has the same statistical properties as real environmental data.
The system applies your specified relationships to create soil property values:
# Apply known relationships to generate soil properties
def apply_known_relationships(features, coefficients, model_order):
if model_order == 'linear':
# Simple linear relationships
soil_values = features @ coefficients
elif model_order == 'quadratic':
# Include feature interactions
interactions = create_feature_interactions(features)
extended_features = combine(features, interactions)
soil_values = extended_features @ extended_coefficients
return soil_valuesThis step creates soil property values using exactly the relationships you specified.
The system adds realistic spatial patterns:
# Add spatial correlation between nearby locations
def add_spatial_correlation(soil_values, coordinates, correlation_params):
# Calculate distances between all sample locations
distance_matrix = calculate_distances(coordinates)
# Create spatial correlation kernel
correlation_kernel = exponential_kernel(
distances=distance_matrix,
length_scale=correlation_params['length'],
amplitude=correlation_params['amplitude']
)
# Apply spatial smoothing
spatially_correlated = apply_kernel(soil_values, correlation_kernel)
return spatially_correlatedThis makes nearby locations have similar soil values, just like real soil data.
Finally, the system adds measurement uncertainty and formats the output:
# Add realistic measurement noise
def add_measurement_noise(clean_values, noise_level):
noise = generate_random_noise(
size=len(clean_values),
standard_deviation=noise_level
)
noisy_values = clean_values + noise
return noisy_values
# Format as standard soil dataset
def format_as_soil_dataset(features, soil_values, coordinates):
# Combine everything into a standard data format
dataset = create_dataframe(
environmental_features=features,
soil_measurements=soil_values,
spatial_coordinates=coordinates
)
return datasetThis creates a synthetic dataset that looks exactly like real soil data but with known ground truth.
The Synthetic Data Generation system is implemented in the synthgen.py file with several sophisticated components working together. Hereâs how the core functionality works:
The gen_synthetic() function orchestrates the entire process:
def gen_synthetic(n_features, n_informative_features=10, n_samples=200,
model_order='quadratic', correlated=False, noise=0.1,
corr_length=10, corr_amp=0.2, spatialsize=100):
"""Generate synthetic soil datasets with known properties"""
# Generate base regression features
X_features, y_values, true_coefficients = make_regression(
n_samples=n_samples,
n_features=n_features,
n_informative=n_informative_features,
noise=noise,
random_state=42
)
return synthetic_dataset, true_coefficients, feature_namesThis uses scikit-learnâs regression data generator as a foundation, then adds soil-specific enhancements.
The system adds realistic spatial patterns using exponential kernels:
def create_kernel_expsquared(distance_matrix, gamma):
"""Create exponential squared kernel for spatial correlation"""
# Spatial correlation decreases exponentially with distance
return np.exp(-distance_matrix**2 / (2 * gamma**2))
# Apply spatial correlation to soil values
if (corr_amp > 0) & (corr_length > 0):
distances = calculate_pairwise_distances(coordinates)
spatial_kernel = create_kernel_expsquared(distances, corr_length)
# Add spatially correlated component
spatial_component = spatial_kernel @ base_soil_values
final_soil_values = base_soil_values + corr_amp * spatial_componentThis creates realistic spatial patterns where nearby locations have similar soil properties.
The system supports different relationship complexities:
# Linear relationships: y = aâĂxâ + aâĂxâ + ...
if model_order == 'linear':
soil_values = features @ coefficients
# Quadratic relationships: includes xâĂxâ, xâ², etc.
elif model_order == 'quadratic':
# Add all pairwise interactions
interactions = []
for i, j in itertools.combinations(features.T, 2):
interactions.append(i * j)
# Add squared terms
squared_terms = features**2
# Combine all terms
extended_features = np.hstack([features, interactions, squared_terms])
soil_values = extended_features @ extended_coefficientsThis allows you to test models against different levels of complexity.
The system supports both regular grids and random sampling patterns:
def create_sample_locations(n_samples, spatial_size, pattern='random'):
"""Create spatial coordinates for synthetic samples"""
if pattern == 'grid':
# Regular grid pattern
grid_size = int(np.sqrt(n_samples))
x, y = np.meshgrid(
np.linspace(-spatial_size/2, spatial_size/2, grid_size),
np.linspace(-spatial_size/2, spatial_size/2, grid_size)
)
coordinates = np.column_stack([x.flatten(), y.flatten()])
elif pattern == 'random':
# Random scattered points
coordinates = np.random.uniform(
-spatial_size/2, spatial_size/2,
(n_samples, 2)
)
return coordinatesThis flexibility lets you test how sampling patterns affect model performance.
Synthetic Data Generation provides essential capabilities for developing robust soil models:
Synthetic Data Generation transforms model development from educated guesswork into rigorous scientific testing. Like having a professional flight simulator for training pilots, it provides a controlled environment where you can test every aspect of your soil modeling pipeline with complete knowledge of what the correct results should be.
By creating artificial datasets with known relationships, realistic spatial patterns, and controllable complexity, the system enables systematic testing and validation of all the sophisticated techniques youâve learned: Mean Function Models, Gaussian Process Models, Uncertainty Quantification, and Spatial-Temporal Modeling.
These rigorously tested models become the foundation for comprehensive performance evaluation on real data. Ready to explore how AgReFed-ML systematically evaluates model accuracy and reliability? The next chapter covers Model Evaluation and Cross-Validation, where youâll learn how the system provides honest assessments of model performance and helps you choose the best approaches for your specific agricultural applications.