# For working with shapefiles (packaged is called pyshp)
import shapefile
# For working with dataframes
import pandas as pd
Machine Learning (ML) From Scratch
Questions
- How can I use Python for Machine learning?
- How to I wrange my data to work within an ML context?
- How do I assess whether my models fit well?
Objectives
- Use scikit-learn to solve a machine learning problem
Most machine learning problems begin with a dataset, but before we can perform any kind of inference on that dataset we must create/wrangle/build it. This is often the most time-consuming and hard part of a successful machine learning workflow. There is no set procedure here, as all data is different, although there are a few simple methods we can take to make a useful dataset.
Machine learning can be split into:
Supervised learning, where we are trying to predict:
- A numeric value such as a stock price, mineral quantity at a specific site etc. This is called “regression”.
- A categorical label, such as having/not having a specific disease, a response (such as “low”, “medium”, “high”) etc. This is called “classification”.
Unsupervised learning, where we are trying to find groups in our data withouth specifying a particular outcome to split the dataset along.
We will be using data from a submitted Manuscript (Butterworth and Barnett-Moore 2020) which was a finalist in the Unearthed, ExploreSA: Gawler Challenge. You can visit the original repo here.
Building a dataset of “targets” and “predictor variables”
The targets in a supervised ML context can be a simple binary 1 or 0, or could be some category (classification), or the value of a particular parameter (regression problems). It is the “feature” of a dataset that we want to learn something about!
The “predictor/feature variables” are the qualities/parameters that may have some causal relationship with the “target”.
Step 1 - Determine our target variable
Let’s explore our our main dataset.
Deposit locations - mine and mineral occurrences
The most important dataset for this workflow is the currently known locations of mineral occurrences. Using the data we already know about these mineral deposits we will build a model to predict where future occurrences will be.
# Set the filename
="../data/MinesMinerals/mines_and_mineral_occurrences_all.shp"
mineshape
# Set shapefile attributes and assign
= shapefile.Reader(mineshape)
sf = [x[0] for x in sf.fields][1:]
fields = sf.records()
records = [s.points for s in sf.shapes()]
shps
# Write into a dataframe for easy use
= pd.DataFrame(columns=fields, data=records) df
View the metadata of the South Australian all mines and mineral deposits to get a better understanding for what features we could use as a target.
#See what the dataframe looks like
print(df.columns)
#For clean printing to html drop columns that contains annoying / and \ chars.
#And set max columns
= 8
pd.options.display.max_columns =['REFERENCE','O_MAP_SYMB']) df.drop(columns
Index(['MINDEP_NO', 'DEP_NAME', 'REFERENCE', 'COMM_CODE', 'COMMODS',
'COMMOD_MAJ', 'COMM_SPECS', 'GCHEM_ASSC', 'DISC_YEAR', 'CLASS_CODE',
'OPER_TYPE', 'MAP_SYMB', 'STATUS_VAL', 'SIZE_VAL', 'GEOL_PROV',
'DB_RES_RVE', 'DB_PROD', 'DB_DOC_IMG', 'DB_EXV_IMG', 'DB_DEP_IMG',
'DB_DEP_FLE', 'COX_CLASS', 'REG_O_CTRL', 'LOC_O_CTRL', 'LOC_O_COM',
'O_LITH_CDE', 'O_LITH01', 'O_STRAT_NM', 'H_LITH_CDE', 'H_LITH02',
'H_STRAT_NM', 'H_MAP_SYMB', 'EASTING', 'NORTHING', 'ZONE', 'LONGITUDE',
'LATITUDE', 'SVY_METHOD', 'HORZ_ACC', 'SRCE_MAP', 'SRCE_CNTRE',
'COMMENTS', 'O_MAP_SYMB'],
dtype='object')
MINDEP_NO | DEP_NAME | COMM_CODE | COMMODS | ... | HORZ_ACC | SRCE_MAP | SRCE_CNTRE | COMMENTS | |
---|---|---|---|---|---|---|---|---|---|
0 | 5219 | MOUNT DAVIES NO.2A | Ni | Nickel | ... | 2000.0 | 500k meis | ||
1 | 52 | ONE STONE | Ni | Nickel | ... | 500.0 | 71-385 | ||
2 | 8314 | HINCKLEY RANGE | Fe | Iron | ... | 500.0 | |||
3 | 69 | KALKA | V, ILM | Vanadium, Ilmenite | ... | 100.0 | 1 MILE | mgt polygon on digital map | |
4 | 65 | ECHIDNA | Ni | Nickel | ... | 20.0 | 50K GEOL | DH ECHIDNA PROSPECT | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
8672 | 6937 | YARINGA | QTZE | Quartzite | ... | 200.0 | 50k moc | fenced yard | |
8673 | 4729 | WELCHS | SCHT | Schist | ... | 20.0 | 50k topo | ||
8674 | 4718 | ARCADIAN | CLAY | Clay | ... | 5.0 | Plan 1951-0327 | Pit | |
8675 | 1436 | MCDONALD | Au | Gold | ... | 200.0 | 50k moc | qz float | |
8676 | 8934 | FAIRFIELD FARM | SAND | Sand | ... | 20.0 | pit |
8677 rows × 41 columns
#We are building a model to target South Australia, so load in a map of it.
= "../data/SA/SA_STATE_POLYGON_shp"
gawlshape = shapefile.Reader(gawlshape)
shapeRead = shapeRead.shapes()
shapes
#Save the boundary xy pairs in arrays we will use throughout the workflow
= [x[0] for x in shapes[1].points]
xval = [x[1] for x in shapes[1].points] yval
# Subset the data, for a single Mineral target
='Mn'
commname
#Pull out all the occurences of the commodity and go from there
=df[df['COMM_CODE'].str.contains(commname)]
comm=comm.reset_index(drop=True)
commprint("Shape of "+ commname, comm.shape)
# Can make further subsets of the data here if needed
#commsig=comm[comm.SIZE_VAL!="Low Significance"]
#comm=comm[comm.SIZE_VAL!="Low Significance"]
#comm=comm[comm.COX_CLASS == "Olympic Dam Cu-U-Au"]
#comm=comm[(comm.lon<max(xval)) & (comm.lon>min(xval)) & (comm.lat>min(yval)) & (comm.lat<max(yval))]
Shape of Mn (115, 43)
# For plotting
import matplotlib.pyplot as plt
= plt.figure(figsize=(6,6))
fig = plt.axes()
ax 'b.',label="All Mineral Deposits")
ax.plot(df.LONGITUDE,df.LATITUDE,'yx',label=commname+" Deposits")
ax.plot(comm.LONGITUDE,comm.LATITUDE,
'grey',linestyle='--',linewidth=1,label='SA')
ax.plot(xval,yval,#ax.plot(comm.LONGITUDE, comm.LATITUDE, marker='o', linestyle='',markersize=5, color='y',label=commname+" Deposits")
128.5,141.5)
plt.xlim(-38.5,-25.5)
plt.ylim(=3)
plt.legend(loc
plt.show()
Step 2 - Wrangle the geophysical and geological datasets (predictor variables)
Many geophysical data are available for South Australia overlapping our target mineral locations. We may presume that certain mineral occurrences express a combination of geology and geophysics. We can train an algorithm to learn these associations and then use the same algorithm to make predictions for where unknown occurrences may be found.
Here we load in the (slightly) pre-processed geophysical datasets and prepare them for further manipulations, data-mining, and machine learning. All of the full/raw datasets are available from https://map.sarig.sa.gov.au/. For this exercise we have simplified the datasets by reducing complexity and resolution. Grab additional processed datasets from https://github.com/natbutter/gawler-exploration/tree/master/ML-DATA
Resistivity xyz data
#Read in the data
=pd.read_csv(
data_res"../data/AusLAMP_MT_Gawler_25.xyzr",
=',',header=0,names=['lat','lon','depth','resistivity']
sep
) data_res
lat | lon | depth | resistivity | |
---|---|---|---|---|
0 | -27.363931 | 128.680796 | -25.0 | 2.0007 |
1 | -27.659362 | 128.662322 | -25.0 | 1.9979 |
2 | -27.886602 | 128.647965 | -25.0 | 1.9948 |
3 | -28.061394 | 128.636833 | -25.0 | 1.9918 |
4 | -28.195844 | 128.628217 | -25.0 | 1.9885 |
... | ... | ... | ... | ... |
11003 | -35.127716 | 142.399588 | -25.0 | 2.0079 |
11004 | -35.230939 | 142.408396 | -25.0 | 2.0084 |
11005 | -35.365124 | 142.419903 | -25.0 | 2.0085 |
11006 | -35.539556 | 142.434958 | -25.0 | 2.0076 |
11007 | -35.766303 | 142.454694 | -25.0 | 2.0049 |
11008 rows × 4 columns
This data is the Lat-Lon spatial location and the value of the feature at that location.
= plt.figure(figsize=(6,6))
fig = plt.axes()
ax = ax.scatter(data_res.lon,data_res.lat,s=4,c=data_res.resistivity,cmap="jet")
im 'grey',linestyle='--',linewidth=1,label='SA')
ax.plot(xval,yval,='x', linestyle='',markersize=5, color='y',label=commname+" Deposits")
ax.plot(comm.LONGITUDE, comm.LATITUDE, marker
128.5,141.5)
plt.xlim(-38.5,-25.5)
plt.ylim(=3)
plt.legend(loc
= fig.add_axes([0.40, 0.18, 0.2, 0.015])
cbaxes = plt.colorbar(im, cax = cbaxes,orientation="horizontal",extend='both')
cbar 'Resistivity $\Omega$.m', labelpad=10)
cbar.set_label('top')
cbar.ax.xaxis.set_label_position(
plt.show()
Faults and dykes vector polylines
# For dealing with arrays
import numpy as np
#Get fault data neo
= "../data/Faults/Faults.shp"
faultshape = shapefile.Reader(faultshape)
shapeRead = shapeRead.shapes()
shapes = len(shapes)
Nshp
# Initialise an empty list
= []
faultsNeo for i in range(0,Nshp):
for j in shapes[i].points:
0],j[1]])
faultsNeo.append([j[
#Convert list to array
= np.array(faultsNeo)
faultsNeo faultsNeo
array([[133.46269605, -27.41825034],
[133.46770683, -27.42062991],
[133.4723624 , -27.42259841],
...,
[138.44613353, -35.36560605],
[138.44160669, -35.36672662],
[138.43805501, -35.36793484]])
This data is just a Lat-Lon location. Think how we can use this in a model.
= plt.figure(figsize=(6,6))
fig = plt.axes()
ax 0],faultsNeo[:,1],'.',markersize=0.1,label="Neoproterozoic-Faults")
plt.plot(faultsNeo[:,'grey',linestyle='--',linewidth=1,label='SA')
ax.plot(xval,yval,='x', linestyle='',markersize=5, color='y',label=commname+" Deposits")
ax.plot(comm.LONGITUDE, comm.LATITUDE, marker
128.5,141.5)
plt.xlim(-38.5,-25.5)
plt.ylim(=3)
plt.legend(loc
plt.show()
Netcdf formatted raster grids - geophysics
# For timing events
import time
# For making grids and reading netcdf data
import scipy
import scipy.io
#Define a function to read the netcdf files
def read_netcdf(filename):
= time.time()
tic = scipy.io.netcdf_file(filename, 'r', mmap=False)
data = data.variables['lon'][:]
x_data = data.variables['lat'][:]
y_data = np.array(data.variables['Band1'][:])
z_data
data.close()
= time.time()
toc print("Loaded", filename, "in", f'{toc - tic:.2f}s')
print("Spacing x", f'{x_data[2] - x_data[1]:.2f}',
"y", f'{y_data[2] - y_data[1]:.2f}',
"Shape:", np.shape(z_data), "Min x:", np.min(x_data), "Max x:", np.max(x_data),
"Min y:", np.min(y_data), f'Max y {np.max(y_data):.2f}')
= {
data_dic 'x_data': x_data,
'y_data': y_data,
'z_data': z_data,
'origin_x': np.min(x_data),
'origin_y': np.min(y_data),
'pixel_x': x_data[2] - x_data[1],
'pixel_y': y_data[2] - y_data[1]
}
return data_dic
# Digital Elevation Model
= read_netcdf("../data/sa-dem.nc")
data_dem # Total Magnetic Intensity
= read_netcdf("../data/sa-mag-tmi.nc")
data_mag # Gravity
= read_netcdf("../data/sa-grav.nc") data_grav
Loaded ../data/sa-dem.nc in 0.01s
Spacing x 0.01 y 0.01 Shape: (1208, 1201) Min x: 129.005 Max x: 141.005 Min y: -38.065 Max y -25.99
Loaded ../data/sa-mag-tmi.nc in 0.01s
Spacing x 0.01 y 0.01 Shape: (1208, 1201) Min x: 129.005 Max x: 141.005 Min y: -38.065 Max y -25.99
Loaded ../data/sa-grav.nc in 0.01s
Spacing x 0.01 y 0.01 Shape: (1208, 1201) Min x: 129.005 Max x: 141.005 Min y: -38.065 Max y -25.99
= plt.figure(figsize=(6,6))
fig = plt.axes()
ax = plt.pcolormesh(data_dem["x_data"],data_dem["y_data"],data_dem["z_data"],cmap='Greys',shading='auto')
im 'grey',linestyle='--',linewidth=1,label='SA')
ax.plot(xval,yval,='x', linestyle='',markersize=5, color='y',label=commname+" Deposits")
ax.plot(comm.LONGITUDE, comm.LATITUDE, marker
128.5,141.5)
plt.xlim(-38.5,-25.5)
plt.ylim(=3)
plt.legend(loc
= fig.add_axes([0.40, 0.18, 0.2, 0.015])
cbaxes = plt.colorbar(im, cax = cbaxes,orientation="horizontal",extend='both')
cbar 'DEM (m)', labelpad=10)
cbar.set_label('top')
cbar.ax.xaxis.set_label_position(
plt.show()
These data are raster grids. Essentially Lat-Lon-Value like the XYZ data, but represented in a different format.
Categorical Geology in vector polygons
#Archean basement geology
= shapefile.Reader("../data/Archaean_Early_Mesoprterzoic_polygons_shp/geology_archaean.shp")
geolshape = geolshape.records()
recsArch = geolshape.shapes() shapesArch
# Print the field names in the shapefile
for i,field in enumerate(geolshape.fields):
print(i-1,field[0])
-1 DeletionFlag
0 MAJORSTRAT
1 SG_DESCRIP
2 MAPUNIT
3 SG_PROVINC
4 DOMAIN
5 AGE
6 SEQUSET
7 PRIMARYAGE
8 OROGENYAGE
9 INHERITAGE
10 STRATNO
11 STRATNAME
12 STRATDESC
13 GISCODE
14 SUBDIVNAME
15 SUBDIVSYMB
16 PROVINCE
17 MAXAGE
18 MAXMOD
19 MAXMETH
20 MINAGE
21 MINMOD
22 MINMETH
23 GLCODE
= plt.figure(figsize=(8,8))
fig = plt.axes()
ax
#index of the geology unit #4 #10 #12
= 4
geoindex #Gather all the unique Major Geology unit numbers
=[]
labsfor i in recsArch:
labs.append(i[geoindex])
= list(set(labs))
geols
# Create a unique color for each geological unit label
= plt.cm.tab20(np.linspace(0, 1, len(geols)))
color ={}
cdictfor i, geol in enumerate(geols):
cdict.update({geol:color[i]})
#Plot each of the geology polygons
for i in range(len(shapesArch)):
= shapesArch[i].points
boundary = [x for x, y in shapesArch[i].points]
xs = [y for x, y in shapesArch[i].points]
ys = cdict[recsArch[i][geoindex]]
c =c,label=recsArch[i][geoindex])
ax.fill(xs,ys,c
#Plot the extra stuff
'grey',linestyle='--',linewidth=1,label='SA')
ax.plot(xval,yval,
ax.plot(
comm.LONGITUDE,
comm.LATITUDE, ='s',
marker='k',
markeredgecolor='',
linestyle=4,
markersize='y',
color=commname+" Deposits"
label
)
#Legend without duplicate values
= ax.get_legend_handles_labels()
handles, labels = [(h, l) for i, (h, l) in enumerate(zip(handles, labels)) if l not in labels[:i]]
unique *zip(*unique), bbox_to_anchor = (1.02, 1.01), ncol=3)
ax.legend(
128.5,141.5)
plt.xlim(-38.5,-25.5)
plt.ylim(
plt.show()
Take a moment to appreciate the various methods you have used just to load the data!
Now we need to think about what we actually want to achieve? What is our goal here? This will determine what kind of data analysis/manipulation we need to make here. Consider the flow diagram for choosing the right machine learning method.
Step 3 - Assign geophys values to target locations
We need to assign the values of each of these geophysical datasets (predictor variables) to the target class (i.e. mineral deposit locations). The assumption being that the occurrence of some mineral deposit (e.g. Cu) is a function of x1, x2, x3, x4, x5, x6. Where the Resistivity is x1, the distance to a Neoprotezoic fault is x2, the value of DEM, magnetic TMI, and Gravity is x3, x4, and x5, and the geological basement unit is x6.
# Make a Target DataFrame of the points we want to interrogate the features for
= comm[['LONGITUDE', 'LATITUDE']].copy() td1
Resistivity
# For making KD Trees
import scipy.spatial
# Define a function which "coregisters" a point from a bunch of other points.
def coregPoint(tree,point,region,retval='index'):
'''
Finds the nearest neighbour to a point from a bunch of other points
tree - a scipy CKTree to search for the point over
point - array([longitude,latitude])
region - integer, same units as data
'''
= tree.query(point,k=1,distance_upper_bound=region)
dists, indexes
if retval=='index':
return (indexes)
elif retval=='dists':
return(dists)
# Find the values of the resetivity grid for each lat/lon deposit location.
# Make a search-tree of the point-pairs for fast lookup of nearest matches
= scipy.spatial.cKDTree(np.c_[data_res.lon,data_res.lat])
treeres
# Perform the search for each point
= td1.apply(
indexes lambda x: coregPoint(treeres,np.array([x.LONGITUDE, x.LATITUDE]),1,retval='index'), axis=1)
'res'] = data_res.loc[indexes].resistivity.values
td1[ td1
LONGITUDE | LATITUDE | res | |
---|---|---|---|
0 | 139.179436 | -29.877637 | 2.2135 |
1 | 138.808767 | -30.086296 | 2.3643 |
2 | 138.752281 | -30.445684 | 2.1141 |
3 | 138.530506 | -30.533225 | 2.2234 |
4 | 138.887019 | -30.565479 | 2.1982 |
... | ... | ... | ... |
110 | 136.059715 | -34.327929 | 3.4926 |
111 | 138.016821 | -35.733084 | 2.0868 |
112 | 139.250036 | -34.250155 | 1.9811 |
113 | 135.905480 | -34.425866 | 2.7108 |
114 | 135.835578 | -34.509779 | 3.1224 |
115 rows × 3 columns
Faults
#Same for the fault data
# but this time we get the "distance to the point", rather than the value at that point.
= scipy.spatial.cKDTree(faultsNeo)
treefaults
= td1.apply(
dists lambda x: coregPoint(treefaults,np.array([x.LONGITUDE, x.LATITUDE]),100,retval='dists'), axis=1)
'faults'] = dists
td1[ td1
LONGITUDE | LATITUDE | res | faults | |
---|---|---|---|---|
0 | 139.179436 | -29.877637 | 2.2135 | 0.010691 |
1 | 138.808767 | -30.086296 | 2.3643 | 0.103741 |
2 | 138.752281 | -30.445684 | 2.1141 | 0.006659 |
3 | 138.530506 | -30.533225 | 2.2234 | 0.013925 |
4 | 138.887019 | -30.565479 | 2.1982 | 0.007356 |
... | ... | ... | ... | ... |
110 | 136.059715 | -34.327929 | 3.4926 | 0.526835 |
111 | 138.016821 | -35.733084 | 2.0868 | 0.002451 |
112 | 139.250036 | -34.250155 | 1.9811 | 0.027837 |
113 | 135.905480 | -34.425866 | 2.7108 | 0.670323 |
114 | 135.835578 | -34.509779 | 3.1224 | 0.776152 |
115 rows × 4 columns
Geophysics
# Define a function which "coregisters" a point within a raster.
def get_coords_at_point(originx,originy,pixelx,pixely,lon,lat):
'''
Given a point in some coordinate reference (e.g. lat/lon)
Find the closest point to that in an array (e.g. a raster)
and return the index location of that point in the raster.
INPUTS
"output from "gdal_data.GetGeoTransform()"
originx: first point in first axis
originy: first point in second axis
pixelx: difference between x points
pixely: difference between y points
lon: x/row-coordinate of interest
lat: y/column-coordinate of interest
RETURNS
col: x index value from the raster
row: y index value from the raster
'''
= int((lon - originx)/pixelx)
row = int((lat - originy)/pixely)
col
return (col, row)
# Pass entire array of latlon and raster info to us in get_coords_at_point
def rastersearch(latlon,raster,originx,originy,pixelx,pixely):
=[]
zlistfor lon,lat in zip(latlon.LONGITUDE,latlon.LATITUDE):
try:
zlist.append(raster[get_coords_at_point(originx,originy,pixelx,pixely,lon,lat)])except:
zlist.append(np.nan)
return(zlist)
'dem'] = rastersearch(td1,data_dem["z_data"],data_dem["origin_x"],data_dem["origin_y"],data_dem["pixel_x"],data_dem["pixel_y"])
td1['mag'] = rastersearch(td1,data_mag["z_data"],data_mag["origin_x"],data_mag["origin_y"],data_mag["pixel_x"],data_mag["pixel_y"])
td1['grav'] = rastersearch(td1,data_grav["z_data"],data_grav["origin_x"],data_grav["origin_y"],data_grav["pixel_x"],data_grav["pixel_y"]) td1[
td1
LONGITUDE | LATITUDE | res | faults | dem | mag | grav | |
---|---|---|---|---|---|---|---|
0 | 139.179436 | -29.877637 | 2.2135 | 0.010691 | 187.297424 | -118.074890 | 1.852599 |
1 | 138.808767 | -30.086296 | 2.3643 | 0.103741 | 179.499237 | -209.410507 | -12.722121 |
2 | 138.752281 | -30.445684 | 2.1141 | 0.006659 | 398.336823 | -159.566422 | -6.249788 |
3 | 138.530506 | -30.533225 | 2.2234 | 0.013925 | 335.983429 | -131.176437 | -11.665316 |
4 | 138.887019 | -30.565479 | 2.1982 | 0.007356 | 554.278198 | -192.363297 | -1.025702 |
... | ... | ... | ... | ... | ... | ... | ... |
110 | 136.059715 | -34.327929 | 3.4926 | 0.526835 | 45.866119 | -244.067841 | 11.410070 |
111 | 138.016821 | -35.733084 | 2.0868 | 0.002451 | 145.452789 | -203.566940 | 18.458364 |
112 | 139.250036 | -34.250155 | 1.9811 | 0.027837 | 276.489319 | -172.889587 | -1.714886 |
113 | 135.905480 | -34.425866 | 2.7108 | 0.670323 | 162.431747 | 569.713684 | 15.066316 |
114 | 135.835578 | -34.509779 | 3.1224 | 0.776152 | 89.274399 | 64.385925 | 24.267015 |
115 rows × 7 columns
# Check we got it right.
# Plot a grid, and our interrogated points
= plt.figure(figsize=(6,6))
fig = plt.axes()
ax = plt.pcolormesh(
im "x_data"],data_grav["y_data"],data_grav["z_data"],
data_grav[='jet',shading='auto',vmin=min(td1.grav),vmax=max(td1.grav)
cmap
)#ax.plot(xval,yval,'grey',linestyle='--',linewidth=1,label='SA')
#ax.plot(comm.LONGITUDE, comm.LATITUDE, marker='o', linestyle='',markersize=5, color='y',label=commname+" Deposits")
=20, c=td1.grav,
ax.scatter(td1.LONGITUDE, td1.LATITUDE, s=commname+" Gravity",cmap='jet',vmin=min(td1.grav),vmax=max(td1.grav),edgecolors='white')
label
138,140)
plt.xlim(-32,-30)
plt.ylim(=3)
plt.legend(loc
= fig.add_axes([0.40, 0.18, 0.2, 0.015])
cbaxes = plt.colorbar(im, cax = cbaxes,orientation="horizontal",extend='both')
cbar 'Gravity (gal)', labelpad=10)
cbar.set_label('top')
cbar.ax.xaxis.set_label_position(
plt.show()
Geology
# For dealing with shapefile components
from shapely.geometry import Point, shape
# Make all the polygons into a list for fast access in the function
= [shape(shp) for shp in shapesArch]
shape_objects
# #Define a function to find what polygon a point lives inside (speed imporivements can be made here)
def shapeExplore(lon,lat,shape_objects,recs):
= 4
record_of_interest_index #'record' is the column index you want returned
for i, polygon in enumerate(shape_objects):
if Point((lon,lat)).within(polygon):
return(recs[i][record_of_interest_index])
#if you have been through the loop with no result
return('-9999')
%%time
'geol'] = td1.apply(lambda x: shapeExplore(x.LONGITUDE, x.LATITUDE, shape_objects,recsArch), axis=1) td1[
CPU times: user 791 ms, sys: 4.59 ms, total: 795 ms
Wall time: 804 ms
td1
LONGITUDE | LATITUDE | res | faults | dem | mag | grav | geol | |
---|---|---|---|---|---|---|---|---|
0 | 139.179436 | -29.877637 | 2.2135 | 0.010691 | 187.297424 | -118.074890 | 1.852599 | Crustal element Muloorina |
1 | 138.808767 | -30.086296 | 2.3643 | 0.103741 | 179.499237 | -209.410507 | -12.722121 | Crustal element Adelaide |
2 | 138.752281 | -30.445684 | 2.1141 | 0.006659 | 398.336823 | -159.566422 | -6.249788 | Crustal element Adelaide |
3 | 138.530506 | -30.533225 | 2.2234 | 0.013925 | 335.983429 | -131.176437 | -11.665316 | Crustal element Adelaide |
4 | 138.887019 | -30.565479 | 2.1982 | 0.007356 | 554.278198 | -192.363297 | -1.025702 | Crustal element Adelaide |
... | ... | ... | ... | ... | ... | ... | ... | ... |
110 | 136.059715 | -34.327929 | 3.4926 | 0.526835 | 45.866119 | -244.067841 | 11.410070 | Cleve, Spencer, Olympic Domains |
111 | 138.016821 | -35.733084 | 2.0868 | 0.002451 | 145.452789 | -203.566940 | 18.458364 | Crustal element Kanmantoo SW |
112 | 139.250036 | -34.250155 | 1.9811 | 0.027837 | 276.489319 | -172.889587 | -1.714886 | Crustal element Kanmantoo Main |
113 | 135.905480 | -34.425866 | 2.7108 | 0.670323 | 162.431747 | 569.713684 | 15.066316 | Cleve Domain |
114 | 135.835578 | -34.509779 | 3.1224 | 0.776152 | 89.274399 | 64.385925 | 24.267015 | Cleve Domain |
115 rows × 8 columns
Congrats, you now have an ML dataset ready to go!
Almost… but what is the target? Let’s make a binary classifier.
Step 4 - Generate a “non-deposit” dataset
We have a set of locations where a certain mineral deposit occurs along with the values of various geophysical parameters at those locations. To identify what values of the geophysics are associated with a mineral deposit then we need a representation of the “background noise” of those parameters, i.e. what the values are when there is no mineral deposit.
This step is important. There are numerous ways to generate our non-deposit set, each with different benefits and trade-offs. The randomisation of points throughout some domain appears to be robust. But you must think, is this domain a reasonable estimation of “background” geophysics/geology? Why are you picking these locations as non-deposits? Will they be over/under-representing actual deposits? Will they be over/under-representing actual non-deposits?
#Now make a set of "non-deposits" using a random location within our exploration area
= np.random.uniform(low=min(df.LATITUDE), high=max(df.LATITUDE), size=len(comm.LATITUDE))
lats_rand = np.random.uniform(low=min(df.LONGITUDE), high=max(df.LONGITUDE), size=len(comm.LONGITUDE))
lons_rand
print("Produced", len(lats_rand),len(lons_rand), "latitude-longitude pairs for non-deposits.")
Produced 115 115 latitude-longitude pairs for non-deposits.
# Where are these randomised "non deposits"
= plt.figure(figsize=(6,6))
fig = plt.axes()
ax
'grey',linestyle='--',linewidth=1,label='SA')
ax.plot(xval,yval,
ax.plot(lons_rand, lats_rand, ='.', linestyle='',markersize=1, color='b',label="Random Samples")
marker
ax.plot(td1.LONGITUDE, td1.LATITUDE, ='x', linestyle='',markersize=5, color='y',label=commname+" Deposits")
marker
128.5,141.5)
plt.xlim(-38.5,-25.5)
plt.ylim(=3)
plt.legend(loc
plt.show()
We must do the same coregistration/interrogation of the different data layers for our randomised “non-deposit” data.
%%time
= pd.DataFrame({'LONGITUDE': lons_rand, 'LATITUDE': lats_rand})
td2
# Res
= td2.apply(
indexes lambda x: coregPoint(treeres,np.array([x.LONGITUDE, x.LATITUDE]),10,retval='index'), axis=1)
'res'] = data_res.loc[indexes].resistivity.values
td2[
# Faults
'faults'] = td2.apply(
td2[lambda x: coregPoint(treefaults,np.array([x.LONGITUDE, x.LATITUDE]),100,retval='dists'), axis=1)
# Geophys
'dem'] = rastersearch(td2,data_dem["z_data"],data_dem["origin_x"],data_dem["origin_y"],data_dem["pixel_x"],data_dem["pixel_y"])
td2['mag'] = rastersearch(td2,data_mag["z_data"],data_mag["origin_x"],data_mag["origin_y"],data_mag["pixel_x"],data_mag["pixel_y"])
td2['grav'] = rastersearch(td2,data_grav["z_data"],data_grav["origin_x"],data_grav["origin_y"],data_grav["pixel_x"],data_grav["pixel_y"])
td2[
#Geology
'geol'] = td2.apply(lambda x: shapeExplore(x.LONGITUDE, x.LATITUDE, shape_objects,recsArch), axis=1) td2[
CPU times: user 1.77 s, sys: 15.9 ms, total: 1.78 s
Wall time: 1.72 s
#Add flag indicating classification label
'deposit'] = 1
td1['deposit'] = 0 td2[
# Make the feature vectore
= pd.concat([td1,td2],axis=0,ignore_index=True)
fv fv
LONGITUDE | LATITUDE | res | faults | ... | mag | grav | geol | deposit | |
---|---|---|---|---|---|---|---|---|---|
0 | 139.179436 | -29.877637 | 2.2135 | 0.010691 | ... | -118.074890 | 1.852599 | Crustal element Muloorina | 1 |
1 | 138.808767 | -30.086296 | 2.3643 | 0.103741 | ... | -209.410507 | -12.722121 | Crustal element Adelaide | 1 |
2 | 138.752281 | -30.445684 | 2.1141 | 0.006659 | ... | -159.566422 | -6.249788 | Crustal element Adelaide | 1 |
3 | 138.530506 | -30.533225 | 2.2234 | 0.013925 | ... | -131.176437 | -11.665316 | Crustal element Adelaide | 1 |
4 | 138.887019 | -30.565479 | 2.1982 | 0.007356 | ... | -192.363297 | -1.025702 | Crustal element Adelaide | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
225 | 133.735472 | -32.324423 | 0.5211 | 0.921400 | ... | 0.000000 | 0.000000 | Nuyts Domain | 0 |
226 | 131.269745 | -32.102380 | -0.4837 | 0.524201 | ... | 0.000000 | 0.000000 | Christie, Fowler, Wilgena, Nuyts, Mount Woods | 0 |
227 | 136.423004 | -28.064587 | 1.9746 | 0.411990 | ... | 177.662155 | -29.012423 | Noolyeana Domain | 0 |
228 | 131.174728 | -31.034882 | 1.8956 | 0.033631 | ... | -57.023735 | -34.558727 | Fisher Domain | 0 |
229 | 134.505847 | -31.003589 | 2.0723 | 1.642021 | ... | 9.525288 | -39.825821 | Harris Greenstone Domain | 0 |
230 rows × 9 columns
# Save all our hard work to a csv file for more hacking to come!
'../data/fv.csv',index=False) fv.to_csv(
Exploratory Data Analysis
This is often the point you receive the data in (if you are using any well-formed datasets). But in reality 90% of the time is doing weird data wrangling steps like what we have done. Then 9% is spent exploring your dataset and understanding it more, dealing with missing data, observing correlations. This is often an iterative process. Let’s do some simple visualisations.
Note: the last 1% of time is actually applying the ML algorithm!
Note2: These percentages are totally made up, but feeeel about right.
#Get information about index type and column types, non-null values and memory usage.
fv.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 230 entries, 0 to 229
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 LONGITUDE 230 non-null float64
1 LATITUDE 230 non-null float64
2 res 230 non-null float64
3 faults 230 non-null float64
4 dem 228 non-null float64
5 mag 228 non-null float64
6 grav 228 non-null float64
7 geol 230 non-null object
8 deposit 230 non-null int64
dtypes: float64(7), int64(1), object(1)
memory usage: 16.3+ KB
# For nice easy data vis plots
import seaborn as sns
= fv.isnull().sum(axis = 0).sort_values(ascending = False)
missingNo = missingNo[missingNo.values > 0]
missingNo
missingNo
=missingNo.values, y=missingNo.index); sns.barplot(x
import upsetplot
= missingNo.index[:5].tolist()
missing_cols = (fv.loc[:, missing_cols]
missing_counts
.isnull()
.groupby(missing_cols)
.size())
; upsetplot.plot(missing_counts)
# Plot historgrams and scatter plots for each combination of features.
='deposit',palette="Set1",diag_kind="auto") sns.pairplot(fv,hue
/Users/nbutter/miniconda3/envs/geopy/lib/python3.9/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight
self._figure.tight_layout(*args, **kwargs)
#Plot a heatmap for how correlated each of the features are
= fv.select_dtypes(include=['float64', 'int64']).corr()
corr
sns.heatmap(corr,=plt.cm.BrBG,
cmap=-0.5, vmax=0.5,
vmin=True,
square=True, yticklabels=True,
xticklabels; )
for i in ['res', 'faults', 'dem', 'mag', 'grav']:
= sns.boxplot(x='deposit',y=i, data=fv)
ax plt.show()
Machine Learning
We now have a clean dataset, we know a bit about, let’s try and measure some inferences.
ML Classification
This is where the ML classifier is defined. We can substitue our favourite ML technique here, and tune model variables as desired. As always the scikit-learn documentation is a great starting point to learn how these algorithms work.
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.neural_network import MLPClassifier
# Create the 'feature vector' and a 'target classification vector'
# Features: Drop the lat/lon, and drop the 'deposit'
= fv.dropna().iloc[:,2:-1]
features # Targets: just the 'deposit' 1/0 binary classified column
= fv.dropna().deposit
targets
features.columns
Index(['res', 'faults', 'dem', 'mag', 'grav', 'geol'], dtype='object')
= ['res', 'faults', 'dem', 'mag', 'grav']
numfts = ['geol'] catfts
#Create the ML classifier with numerical and categorical data
#Scale, and replace missing values
= Pipeline(steps=[
numeric_transformer 'imputer',SimpleImputer(missing_values=-9999., strategy='median')),
('scaler', StandardScaler())])
(
#Encode categorical data and fill missing values with default 0
= Pipeline(steps=[
categorical_transformer 'imputer', SimpleImputer(strategy='constant')),
('onehot', OneHotEncoder(handle_unknown='ignore'))])
(
#Combine numerical and categorical data
= ColumnTransformer(transformers=[
preprocessor 'num', numeric_transformer, numfts),
('cat', categorical_transformer, catfts)])
(
# Append classifier to preprocessing pipeline.
# Now we have a full prediction pipeline.
= Pipeline(steps=[('preprocessor', preprocessor),
rf 'classifier', RandomForestClassifier(random_state=1))])
(#('classifier', MLPClassifier(solver='adam', alpha=0.001, max_iter=10000))])
# Train the classifier! That is it, we have made our ML model.
rf.fit(features,targets)
# Now see how well it performs with a "cross-fold validation"
= cross_val_score(rf, features,targets, cv=5)
scores
print("RF 5-fold cross validation Scores:", scores)
print("SCORE Mean: %.2f" % np.mean(scores), "STD: %.2f" % np.std(scores), "\n")
'bo',label='Target (expected)')
plt.plot(targets.values,'r.',label='Prediction')
plt.plot(rf.predict(features),"Feature set")
plt.xlabel("Target Prediction Class")
plt.ylabel(=7) plt.legend(loc
RF 5-fold cross validation Scores: [0.95652174 0.89130435 0.97826087 0.93333333 0.75555556]
SCORE Mean: 0.90 STD: 0.08
<matplotlib.legend.Legend at 0x16fa3a6a0>
print("Features:",np.shape(features),"Targets:",np.shape(targets))
print("RF CV-Scores: ",scores)
print("CV-SCORE Mean: %.2f" % np.mean(scores), "STD: %.2f" % np.std(scores))
#print("OOB score:",rf.steps[-1][1].oob_score_)
print("Targets (expected result):")
print(targets.values)
print("Prediction (actual result):")
print(rf.predict(features))
Features: (228, 6) Targets: (228,)
RF CV-Scores: [0.95652174 0.89130435 0.97826087 0.93333333 0.75555556]
CV-SCORE Mean: 0.90 STD: 0.08
Targets (expected result):
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0]
Prediction (actual result):
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0]
# Gather the importance measures
= []
ft_imp = []
ft_lab
for i,lab in enumerate(np.append(numfts,rf['preprocessor'].transformers_[1][1]['onehot'].get_feature_names_out(catfts))):
-1][1].feature_importances_[i])
ft_imp.append(rf.steps[ ft_lab.append(lab)
#Make the bar plot
= (list(t) for t in zip(*sorted(zip(ft_imp,ft_lab))))
ft_imps, ft_labs = len(ft_imp)
datalength
#Create a new figure
= plt.subplots(figsize=(4,10))
fig,ax
#Plot the bar graph
= ax.barh(np.arange(0, datalength, step=1),ft_imps)
rects 0, datalength, step=1))
ax.set_yticks(np.arange(
ax.set_yticklabels(ft_labs)'Feature Importance')
ax.set_xlabel(print("From the Random Forest ML algorithm\nthese are the the most significant features for predicting the target bins.\n")
plt.show()
From the Random Forest ML algorithm
these are the the most significant features for predicting the target bins.
Find where to dig!
Now we have a trained model we can pass it NEW data, that is values for all the geopysical parameters, and it will give us a prediction for whether there is a deposit there or not. Simple.
= [2.2,2.2]
res1 = [0.01,2.2]
faults1 = [187,2.2]
dem1 = [-118,2.2]
mag1 = [1.8,2.2]
grav = ['Crustal element Adelaide','Crustal element Muloorina']
geol
= pd.DataFrame({'res':res1,'faults':faults1,'dem':dem1,'mag':mag1,'grav':grav,'geol':geol})
targets = targets
targets_scaled # targets_scaled = rf.named_steps['preprocessor'].transform(targets)
print(targets)
print(targets_scaled)
rf.predict_proba(targets_scaled)
res faults dem mag grav geol
0 2.2 0.01 187.0 -118.0 1.8 Crustal element Adelaide
1 2.2 2.20 2.2 2.2 2.2 Crustal element Muloorina
res faults dem mag grav geol
0 2.2 0.01 187.0 -118.0 1.8 Crustal element Adelaide
1 2.2 2.20 2.2 2.2 2.2 Crustal element Muloorina
array([[0. , 1. ],
[0.78, 0.22]])
# Better - Make a grid of target locations over an entire area
#100x100 takes about 1 minute
= np.mgrid[130:140:100j,-36:-26:100j]
grid_x, grid_y
# Now we want to get the geophys values for every single point on this grid
# Which we will then apply our model to!
%%time
= pd.DataFrame({'LONGITUDE': grid_x.reshape(grid_x.size), 'LATITUDE': grid_y.reshape(grid_y.size)})
tdf # tdf = rf['preprocessor'].transform(tdf)
# Res
= tdf.apply(
indexes lambda x: coregPoint(treeres,np.array([x.LONGITUDE, x.LATITUDE]),10,retval='index'), axis=1)
'res'] = data_res.loc[indexes].resistivity.values
tdf[print("Done res")
# Faults
'faults'] = tdf.apply(
tdf[lambda x: coregPoint(treefaults,np.array([x.LONGITUDE, x.LATITUDE]),100,retval='dists'), axis=1)
print("Done faults")
# Geophys
'dem'] = rastersearch(tdf,data_dem["z_data"],data_dem["origin_x"],data_dem["origin_y"],data_dem["pixel_x"],data_dem["pixel_y"])
tdf['mag'] = rastersearch(tdf,data_mag["z_data"],data_mag["origin_x"],data_mag["origin_y"],data_mag["pixel_x"],data_mag["pixel_y"])
tdf['grav'] = rastersearch(tdf,data_grav["z_data"],data_grav["origin_x"],data_grav["origin_y"],data_grav["pixel_x"],data_grav["pixel_y"])
tdf[print("Done rasters")
# Geology
'geol'] = tdf.apply(lambda x: shapeExplore(x.LONGITUDE, x.LATITUDE, shape_objects,recsArch), axis=1)
tdf[print("Done!")
Done res
Done faults
Done rasters
Done!
CPU times: user 1min 54s, sys: 466 ms, total: 1min 55s
Wall time: 1min 55s
# Save all our hard work to a csv file for more hacking to come!
'../data/tdf_100.csv',index=False)
tdf.to_csv(# tdf = pd.read_csv('../data/tdf_100.csv') #Read the file in if you need
#Apply the trained ML to our gridded data to determine the probabilities at each of the points
print('RF...')
=np.array(rf.predict_proba(tdf.iloc[:,2:]))
pRF_mapprint("Done RF")
RF...
Done RF
#X, Y = np.meshgrid(xi, yi)
= scipy.interpolate.griddata((tdf.LONGITUDE, tdf.LATITUDE), pRF_map[:,1], (grid_x, grid_y),method='linear') gridZ
= plt.figure(figsize=(10,10))
fig = plt.axes()
ax = plt.pcolormesh(grid_x,grid_y,gridZ,cmap='bwr',shading='auto')
im 'grey',linestyle='--',linewidth=1,label='SA')
ax.plot(xval,yval,==1].LONGITUDE, fv[fv.deposit==1].LATITUDE,
ax.plot(fv[fv.deposit='x', linestyle='',markersize=5, color='y',label=commname+" Deposits")
marker==0].LONGITUDE, fv[fv.deposit==0].LATITUDE,
ax.plot(fv[fv.deposit='.', linestyle='',markersize=1, color='b',label="Random Samples")
marker
128.5,141.5)
plt.xlim(-38.5,-25.5)
plt.ylim(=3)
plt.legend(loc
= fig.add_axes([0.40, 0.18, 0.2, 0.015])
cbaxes = plt.colorbar(im, cax = cbaxes,orientation="horizontal")
cbar 'Probability of Mineral Depost', labelpad=10)
cbar.set_label('top')
cbar.ax.xaxis.set_label_position(
plt.show()
All materials copyright Sydney Informatics Hub, University of Sydney