Python can deal with basically any type of data you throw at it. The open source python community has developed many packages that make things easy. Today we will look at pyshp
(for dealing with shapefiles), pandas
(great for tables and time series), lasio
(for las format well log data) and obspy
(a highly featured seismic data processing suite) packages.
Data for this exercised was downloaded from http://www.bom.gov.au/water/groundwater/explorer/map.shtml
Shapefiles are a very common file format for GIS data, the standard for which is developed and maintained by ESRI, the makers of the ArcGIS software. Shapefiles collect vectors of features, such as points, lines, polygons. The “file” is actually a misnomer - if you look at a single “shapefile” on your machine using a file explorer, you can see that it’s actually made up of several files, three of which are mandatory, and others which may/may not be there.
#Load the required modules
import shapefile
#NOTE: Weirdly and confusingly, this package is called "pyshp" but you call it via the name "shapefile"
#help(shapefile)
#Or check out the help pages https://github.com/GeospatialPython/pyshp
#Set the filename
boreshape='../data/shp_torrens_river/NGIS_BoreLine.shp'
#read in the file
shapeRead = shapefile.Reader(boreshape)
#And save out some of the shape file attributes
recs = shapeRead.records()
shapes = shapeRead.shapes()
fields = shapeRead.fields
Nshp = len(shapes)
print(Nshp) #print the Number of items in the shapefile
7635
fields #print the fields
[('DeletionFlag', 'C', 1, 0),
['HydroID', 'N', 10, 0],
['HydroCode', 'C', 30, 0],
['BoreID', 'N', 10, 0],
['TopElev', 'F', 19, 11],
['BottomElev', 'F', 19, 11],
['HGUID', 'N', 10, 0],
['HGUNumber', 'N', 10, 0],
['NafHGUNumb', 'N', 10, 0],
['SHAPE_Leng', 'F', 19, 11]]
recs[3] #print the first record, then this is a list that can be subscripted further
Record #3: [32002002, '652800645', 30027773, -147.26, -154.26, 31000045, 1044, 125005, 0.0]
shapes[1].points #print the point values of the first shape
[(591975.5150000006, -3816141.8817), (591975.5150000006, -3816141.8817)]
shapeRead.shapeTypeName
'POLYLINEZ'
rec= shapeRead.record(0)
rec['TopElev']
6.74
There are two coordinates. But they are duplicated.
rec= shapeRead.record(299)
rec['BottomElev']
#or
recs[299][4]
#Here is a slightly neater way to read in the data, but it looks confusing at first.
#But we will need it in this form for our next exercise.
#This type of assignment is known as "list comprehension"
#fields = [x[0] for x in shapeRead.fields][1:]
#Break this down line by line
#for x in shapeRead.fields:
# print(x)
#Now just print the 1st (0th) column of each list variable
#for x in shapeRead.fields:
# print(x[0])
#But we want to save these values in a list (not just print them out).
#fields=[]
#for x in shapeRead.fields:
# fields.append(x[0])
#And we don't want the DeletionFlag field, so we need to just get all the values except the first
#fields=fields[1:]
#fields = [x[0] for x in shapeRead.fields][1:]
#Do a list comprehnsion for the the other variable too
#shps = [s.points for s in shapeRead.shapes()]
Shapefiles are not a native python format, but the community have developed tools for exploring them. The package we have used “pyshp” imported with the name “shapefile” (for some non-consistent weird reason), is one example of working with shapefiles. Alternatives exist.
Pandas is one of the most useful packages (along with probably numpy and matplotlib). We will use it several times throughout the course for data handling and manipulation.
#As before, read in the shapefile
boreshape='../data/shp_torrens_river/NGIS_BoreLine.shp'
#Read the shapefile attributes to variables
shapeRead = shapefile.Reader(boreshape)
fields = [x[0] for x in shapeRead.fields][1:]
shps = [s.points for s in shapeRead.shapes()]
recs= shapeRead.records()
import pandas
#Now convert the variables to a pandas dataframe
df = pandas.DataFrame(columns=fields, data=recs)
#See more details at the docs: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html
df
HydroID | HydroCode | BoreID | TopElev | BottomElev | HGUID | HGUNumber | NafHGUNumb | SHAPE_Leng | |
---|---|---|---|---|---|---|---|---|---|
0 | 32001999 | 652800645 | 30027773 | 6.74 | -74.26 | 31000043 | 1042 | 104005 | 0.0 |
1 | 32002000 | 652800645 | 30027773 | -74.26 | -125.26 | 31000109 | 1108 | 110002 | 0.0 |
2 | 32002001 | 652800645 | 30027773 | -125.26 | -147.26 | 31000045 | 1044 | 125005 | 0.0 |
3 | 32002002 | 652800645 | 30027773 | -147.26 | -154.26 | 31000045 | 1044 | 125005 | 0.0 |
4 | 32002003 | 652800645 | 30027773 | -154.26 | -168.26 | 31000045 | 1044 | 125005 | 0.0 |
… | … | … | … | … | … | … | … | … | … |
7630 | 32145557 | 662810075 | 30057044 | 102.62 | 90.89 | 31000139 | 1138 | 100001 | 0.0 |
7631 | 32145558 | 662810075 | 30057044 | 103.08 | 102.62 | 31000139 | 1138 | 100001 | 0.0 |
7632 | 32145559 | 662813065 | 30060034 | 535.08 | 451.08 | 31000026 | 1025 | 134001 | 0.0 |
7633 | 32145560 | 662813065 | 30060034 | 451.08 | 171.08 | 31000014 | 1013 | 134001 | 0.0 |
7634 | 32145561 | 662814687 | 30061656 | 444.30 | 432.30 | 31000014 | 1013 | 134001 | 0.0 |
7635 rows × 9 columns
#Add a new column called "coords" to the DataFrame, fill it with what is in "shps"
df['coords'] = shps
#Alternatively, you can use the "assign" method
#df = df.assign(coords=shps)
df
HydroID | HydroCode | BoreID | TopElev | BottomElev | HGUID | HGUNumber | NafHGUNumb | SHAPE_Leng | coords | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 32001999 | 652800645 | 30027773 | 6.74 | -74.26 | 31000043 | 1042 | 104005 | 0.0 | [(591975.5150000006, -3816141.8817), (591975.5… |
1 | 32002000 | 652800645 | 30027773 | -74.26 | -125.26 | 31000109 | 1108 | 110002 | 0.0 | [(591975.5150000006, -3816141.8817), (591975.5… |
2 | 32002001 | 652800645 | 30027773 | -125.26 | -147.26 | 31000045 | 1044 | 125005 | 0.0 | [(591975.5150000006, -3816141.8817), (591975.5… |
3 | 32002002 | 652800645 | 30027773 | -147.26 | -154.26 | 31000045 | 1044 | 125005 | 0.0 | [(591975.5150000006, -3816141.8817), (591975.5… |
4 | 32002003 | 652800645 | 30027773 | -154.26 | -168.26 | 31000045 | 1044 | 125005 | 0.0 | [(591975.5150000006, -3816141.8817), (591975.5… |
… | … | … | … | … | … | … | … | … | … | … |
7630 | 32145557 | 662810075 | 30057044 | 102.62 | 90.89 | 31000139 | 1138 | 100001 | 0.0 | [(605865.9246000014, -3830429.3729999997), (60… |
7631 | 32145558 | 662810075 | 30057044 | 103.08 | 102.62 | 31000139 | 1138 | 100001 | 0.0 | [(605865.9246000014, -3830429.3729999997), (60… |
7632 | 32145559 | 662813065 | 30060034 | 535.08 | 451.08 | 31000026 | 1025 | 134001 | 0.0 | [(612545.5916999988, -3832402.8148999996), (61… |
7633 | 32145560 | 662813065 | 30060034 | 451.08 | 171.08 | 31000014 | 1013 | 134001 | 0.0 | [(612545.5916999988, -3832402.8148999996), (61… |
7634 | 32145561 | 662814687 | 30061656 | 444.30 | 432.30 | 31000014 | 1013 | 134001 | 0.0 | [(635716.0604999997, -3816751.5364999995), (63… |
7635 rows × 10 columns
Pandas more frequently is used to directly read in tables. So let’s read in the csv data that came with shapefile (as this gives us some additional fields not stored in the shapefile that we can explore.
#read in the data
log_data=pandas.read_csv("../data/shp_torrens_river/NGIS_LithologyLog.csv",usecols=list(range(0,13)))
#What is the "usecols" variable equal to?
#Try reading the data without using the usecols option, can you solve the error?
log_data # print the first 30 and last 30 rows
OBJECTID | BoreID | HydroCode | RefElev | RefElevDesc | FromDepth | ToDepth | TopElev | BottomElev | MajorLithCode | MinorLithCode | Description | Source | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1769789 | 30062892 | 662815923 | 57.25 | NGS | 18.0 | 19.5 | 39.25 | 37.75 | CLYU | None | Clay | SAGeodata |
1 | 1769790 | 30062892 | 662815923 | 57.25 | NGS | 19.5 | 22.0 | 37.75 | 35.25 | ROCK | None | Rocks and sand | SAGeodata |
2 | 1769791 | 30062892 | 662815923 | 57.25 | NGS | 22.0 | 24.0 | 35.25 | 33.25 | CLYU | None | Clay | SAGeodata |
3 | 1770725 | 30141910 | 662816624 | 4.0 | NGS | 0.0 | 6.0 | 4.0 | -2.0 | None | None | No sample | SAGeodata |
4 | 1770726 | 30141910 | 662816624 | 4.0 | NGS | 6.0 | 15.0 | -2.0 | -11.0 | CLYU | None | Clay - orange-red grey, mottled; stiff-sticky…. | SAGeodata |
… | … | … | … | … | … | … | … | … | … | … | … | … | … |
70168 | 4120345 | 30304039 | 662829228 | None | UNK | 9.0 | 10.0 | None | None | CLYU | None | Sandy clay | SAGeodata |
70169 | 4120346 | 30304039 | 662829228 | None | UNK | 10.0 | 12.5 | None | None | SAND | None | Clay sand | SAGeodata |
70170 | 4120347 | 30304050 | 652802882 | None | UNK | 0.0 | 0.3 | None | None | FILL | None | Fill | SAGeodata |
70171 | 4120348 | 30304050 | 652802882 | None | UNK | 0.3 | 0.8 | None | None | SAND | None | Clayey sand | SAGeodata |
70172 | 4120349 | 30304050 | 652802882 | None | UNK | 0.8 | 3.5 | None | None | SAND | None | Sand | SAGeodata |
70173 rows × 13 columns
# add a new column as a function of existing columns
log_data['Thickness'] = log_data.ToDepth - log_data.FromDepth
type(log_data) # see what Python type the DataFrame is
pandas.core.frame.DataFrame
log_data.head(3) # print the first 3 rows
OBJECTID | BoreID | HydroCode | RefElev | RefElevDesc | FromDepth | ToDepth | TopElev | BottomElev | MajorLithCode | MinorLithCode | Description | Source | Thickness | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1769789 | 30062892 | 662815923 | 57.25 | NGS | 18.0 | 19.5 | 39.25 | 37.75 | CLYU | None | Clay | SAGeodata | 1.5 |
1 | 1769790 | 30062892 | 662815923 | 57.25 | NGS | 19.5 | 22.0 | 37.75 | 35.25 | ROCK | None | Rocks and sand | SAGeodata | 2.5 |
2 | 1769791 | 30062892 | 662815923 | 57.25 | NGS | 22.0 | 24.0 | 35.25 | 33.25 | CLYU | None | Clay | SAGeodata | 2.0 |
log_data.index # “the index” (aka “the labels”).
#Pandas is great for using timeseries data, where the index can be the timestamps
RangeIndex(start=0, stop=70173, step=1)
log_data.columns # column names (which is “an index”)
Index(['OBJECTID', 'BoreID', 'HydroCode', 'RefElev', 'RefElevDesc',
'FromDepth', 'ToDepth', 'TopElev', 'BottomElev', 'MajorLithCode',
'MinorLithCode', 'Description', 'Source', 'Thickness'],
dtype='object')
log_data.dtypes # data types of each column
OBJECTID int64
BoreID int64
HydroCode int64
RefElev object
RefElevDesc object
FromDepth float64
ToDepth float64
TopElev object
BottomElev object
MajorLithCode object
MinorLithCode object
Description object
Source object
Thickness float64
dtype: object
log_data.shape # number of rows and columns
(70173, 14)
log_data.values # underlying numpy array — df are stored as numpy arrays for efficiencies.
array([[1769789, 30062892, 662815923, ..., 'Clay', 'SAGeodata', 1.5],
[1769790, 30062892, 662815923, ..., 'Rocks and sand', 'SAGeodata',
2.5],
[1769791, 30062892, 662815923, ..., 'Clay', 'SAGeodata', 2.0],
...,
[4120347, 30304050, 652802882, ..., 'Fill', 'SAGeodata', 0.3],
[4120348, 30304050, 652802882, ..., 'Clayey sand', 'SAGeodata',
0.5],
[4120349, 30304050, 652802882, ..., 'Sand', 'SAGeodata', 2.7]],
dtype=object)
#log_data['MajorLithCode'] # select one column
##Equivalent to
#log_data.MajorLithCode
##and
#log_data.iloc[:,9]
##and
#log_data.loc[:,'MajorLithCode']
type(log_data['MajorLithCode']) # determine datatype of column (e.g., Series)
pandas.core.series.Series
#describe the data frame
log_data.describe(include='all')
OBJECTID | BoreID | HydroCode | RefElev | RefElevDesc | FromDepth | ToDepth | TopElev | BottomElev | MajorLithCode | MinorLithCode | Description | Source | Thickness | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 7.017300e+04 | 7.017300e+04 | 7.017300e+04 | 70173 | 70173 | 70173.000000 | 70173.000000 | 70173 | 70173 | 70173 | 70173 | 70173 | 70173 | 70173.000000 |
unique | NaN | NaN | NaN | 5068 | 4 | NaN | NaN | 27784 | 27883 | 81 | 42 | 33614 | 54 | NaN |
top | NaN | NaN | NaN | None | NGS | NaN | NaN | None | None | CLYU | None | Clay | SAGeodata | NaN |
freq | NaN | NaN | NaN | 18514 | 44947 | NaN | NaN | 18514 | 18514 | 25861 | 62812 | 4603 | 70119 | NaN |
mean | 2.505677e+06 | 3.018198e+07 | 6.624491e+08 | NaN | NaN | 24.938020 | 30.621160 | NaN | NaN | NaN | NaN | NaN | NaN | 5.683139 |
std | 9.276182e+05 | 8.069609e+04 | 2.130226e+06 | NaN | NaN | 45.431762 | 48.605931 | NaN | NaN | NaN | NaN | NaN | NaN | 9.942400 |
min | 1.769789e+06 | 3.002715e+07 | 6.528000e+08 | NaN | NaN | 0.000000 | 0.010000 | NaN | NaN | NaN | NaN | NaN | NaN | 0.000000 |
25% | 1.932741e+06 | 3.014557e+07 | 6.628129e+08 | NaN | NaN | 0.800000 | 3.960000 | NaN | NaN | NaN | NaN | NaN | NaN | 1.000000 |
50% | 1.999028e+06 | 3.018487e+07 | 6.628196e+08 | NaN | NaN | 7.000000 | 11.500000 | NaN | NaN | NaN | NaN | NaN | NaN | 2.800000 |
75% | 3.967146e+06 | 3.025487e+07 | 6.628248e+08 | NaN | NaN | 25.800000 | 34.700000 | NaN | NaN | NaN | NaN | NaN | NaN | 7.000000 |
max | 4.120349e+06 | 3.030405e+07 | 6.728042e+08 | NaN | NaN | 610.300000 | 620.160000 | NaN | NaN | NaN | NaN | NaN | NaN | 300.500000 |
# summarise a pandas Series
log_data.FromDepth.describe() # describe a single column
count 70173.000000
mean 24.938020
std 45.431762
min 0.000000
25% 0.800000
50% 7.000000
75% 25.800000
max 610.300000
Name: FromDepth, dtype: float64
#calculate mean of 6th column ("FromDepth")
log_data.iloc[:,5].mean()
24.93802017870121
#alternate method to calculate mean of FromDepth column (the 6th one)
log_data["FromDepth"].mean()
24.93802017870121
#Count how many Lith Codes there are
lithCounts=log_data.MajorLithCode.value_counts()
#Print the lithcodes, use .index or .values
lithCounts
CLYU 25861
SAND 12772
SLAT 4071
FILL 4020
SDST 3209
...
REGO 1
SANN 1
CALU 1
DUST 1
ALDG 1
Name: MajorLithCode, Length: 81, dtype: int64
#plot a bar chart of the lith codes
lithCounts.plot.bar(figsize=(15,5))
<AxesSubplot:>
#Plot a bar chart of the lith codes for the rarer lithologies
lithCounts[(lithCounts < 50)].plot.bar(rot=45,figsize=(15,5))
<AxesSubplot:>
import numpy as np
import matplotlib.pyplot as plt
#Pandas data
x = log_data.Thickness
mu = log_data.Thickness.mean()
sigma = log_data.Thickness.std()
# An Equivalent Numpy version
#x = log_data.Thickness.values
#mu = np.mean(x) # mean of distribution
#sigma = np.std(x) # standard deviation of distribution
# the histogram of the data
plt.hist(x, bins=np.arange(0,20,1), alpha=0.5)
plt.xlabel('Thickness (m)')
plt.ylabel('Count')
mystring="Histogram with a mean of "+ str(np.round(mu,2)) + " & SD " + str(np.round(sigma,2))
plt.title(mystring)
# Tweak spacing to prevent clipping of ylabel
#plt.subplots_adjust(left=0.15)
plt.show()
Plot a histogram of the thickness where the “MajorLithCode” is “FILL”.
Hint: to filter a pandas data frame by value use the following syntax:
df[df['Variable'] == "value"]
x = log_data[log_data['MajorLithCode'] == "FILL"]['Thickness'].values
#OR
#x = log_data[log_data.MajorLithCode == "FILL"].Thickness.values
plt.hist(x, bins=np.arange(0,20,1), alpha=0.5)
plt.show()
# import numpy as np
# cmap = plt.get_cmap('viridis')
# colors = cmap(np.linspace(0, 1, len(lithCounts.index)))
# colors
# for row in log_data.itertuples():
# boreid=row[3]
# for ind,value in enumerate(recs):
# try:
# value.index(boreid)
# print(recs)
# except:
# continue
# #(row[3])
#You can plot the location of the bores slowly
# for ind, value in enumerate(recs):
# #Get the lat lon value
# lon=df.coords[ind][0][0]
# lat=df.coords[ind][0][1]
# #Get the Lithology unit
# #value[]
# #Now add the point to the plot
# plt.plot(lon,lat,"|")
# plt.show()
# #or fast
# lons= [df.coords[i][0][0] for i in range(1,len(recs))]
# lats= [df.coords[i][0][1] for i in range(1,len(recs))]
# plt.plot(lons,lats,"|")
# plt.show()
Go to http://www.bom.gov.au/water/groundwater/explorer/map.shtml and pick another River Region. Download the dataset in “Shapefile” format (this will download the csv also). Once you have the data, follow the same routines as above and see what you can find out about the river region.
TODO Nate
Python has a wide range of packages/libraries to do specific tasks. You can often create your own tools for doing niche tasks, but often you will find that many already exist to make things simpler for you. We will explore libraries that work with borehole data (in .las format) with the lasio library.
This tutorial based off https://towardsdatascience.com/handling-big-volume-of-well-log-data-with-a-boosted-time-efficiency-with-python-dfe0319daf26
Original Data from: https://sarigbasis.pir.sa.gov.au/WebtopEw/ws/samref/sarig1/image/DDD/PEDP013LOGS.zip
Title: Cooper Basin selected well logs in LAS format.
Publication Date: November 20
Prepared by: Energy Resources Division, Department of the Premier and Cabinet
This Record URL: https://sarigbasis.pir.sa.gov.au/WebtopEw/ws/samref/sarig1/wci/Record?r=0&m=1&w=catno=2040037
#For plotting
import matplotlib.pyplot as plt
#Library specifically for "well data"
import lasio
#To read files
import glob
#For "regular expression manipulation"
import re
#Build a list of filenames to read
read_files = glob.glob("../data/WELL/*.las")
read_files
['../data/WELL\\Balnaves.las',
'../data/WELL\\Banyula.las',
'../data/WELL\\Beachport1.las',
'../data/WELL\\BeachportEast1.las',
'../data/WELL\\BiscuitFlat1.las',
'../data/WELL\\BoolLagoon1.las',
'../data/WELL\\Bungaloo1.las',
'../data/WELL\\Burrungule1.las']
Note: the possibility of Windows VS Unix character interpretations.
#Cut out just the name of the well from the filenames
well_names = []
for file in read_files:
print("FILE:", file)
#Split the filepath at a "/" OR a ".las" OR a "\"
well=re.split(r'/|\\|.las',file)
print("SPLIT:", well, "\n")
well_names.append(well[-2])
print("There are ", len(well_names), "wells.")
print(well_names)
FILE: ../data/WELL\Balnaves.las
SPLIT: ['..', 'data', 'WELL', 'Balnaves', '']
FILE: ../data/WELL\Banyula.las
SPLIT: ['..', 'data', 'WELL', 'Banyula', '']
FILE: ../data/WELL\Beachport1.las
SPLIT: ['..', 'data', 'WELL', 'Beachport1', '']
FILE: ../data/WELL\BeachportEast1.las
SPLIT: ['..', 'data', 'WELL', 'BeachportEast1', '']
FILE: ../data/WELL\BiscuitFlat1.las
SPLIT: ['..', 'data', 'WELL', 'BiscuitFlat1', '']
FILE: ../data/WELL\BoolLagoon1.las
SPLIT: ['..', 'data', 'WELL', 'BoolLagoon1', '']
FILE: ../data/WELL\Bungaloo1.las
SPLIT: ['..', 'data', 'WELL', 'Bungaloo1', '']
FILE: ../data/WELL\Burrungule1.las
SPLIT: ['..', 'data', 'WELL', 'Burrungule1', '']
There are 8 wells.
['Balnaves', 'Banyula', 'Beachport1', 'BeachportEast1', 'BiscuitFlat1', 'BoolLagoon1', 'Bungaloo1', 'Burrungule1']
#Now actually read in the log files to lasio
#The last cell was just to automatically make a nicely formatted list of well names!
lases = []
for files in read_files:
las = lasio.read(files)
lases.append(las)
#You can get an idea of what you can interogate using the help function
#help(lases)
#This is just a regular Python list! But the list contains
#in this case, special objects known as "LasFile(s)" or lasio.las object.
#Get some details using help again
#help(lases[1])
#From there we can get some info from each of the wells
j=0
for well in lases:
#e.g. pull out the varaibles availble from the wells
print("Wellid:", j, well_names[j])
j+=1
print(well.keys())
Wellid: 0 Balnaves
['DEPTH', 'CALI', 'DRHO', 'DT', 'GR', 'MINV', 'MNOR', 'NPHI', 'PEF', 'RDEP', 'RHOB', 'RMED', 'RMIC', 'SP']
Wellid: 1 Banyula
['DEPTH', 'CALI', 'DRHO', 'DT', 'GR', 'NPHI', 'RDEP', 'RHOB', 'RMED', 'SP']
Wellid: 2 Beachport1
['DEPTH', 'CALI', 'MINV', 'MNOR', 'RDEP', 'RMED', 'SP']
Wellid: 3 BeachportEast1
['DEPTH', 'GR', 'RDEP', 'RMED', 'SP']
Wellid: 4 BiscuitFlat1
['DEPTH', 'CALI', 'DRHO', 'DT', 'GR', 'MINV', 'MNOR', 'NPHI', 'PEF', 'RDEP', 'RHOB', 'RMED', 'RMIC', 'SP']
Wellid: 5 BoolLagoon1
['DEPTH', 'CALI', 'DRHO', 'DT', 'GR', 'NPHI', 'PEF', 'RDEP', 'RHOB', 'RMED', 'SP']
Wellid: 6 Bungaloo1
['DEPTH', 'CALI', 'DRHO', 'DT', 'DTS', 'GR', 'NPHI', 'PEF', 'RDEP', 'RHOB', 'RMED', 'RMIC', 'SP']
Wellid: 7 Burrungule1
['DEPTH', 'CALI', 'DT', 'GR', 'RDEP', 'RMED', 'SP']
#Set a wellid you want to explore more
wellid=1
#Make a plot of one of the wells
plt.plot(lases[wellid]['DRHO'],lases[wellid]['DEPTH'])
[<matplotlib.lines.Line2D at 0x232520c7908>]
You have just plotted the density (DRHO) at each measured depth point. You can clean this up and present it better in the next few cells.
#Get some more info out of the well data
print(lases[wellid].curves)
Mnemonic Unit Value Description
-------- ---- ----- -----------
DEPTH M Depth
CALI in Caliper CALI Edited, Spliced, BANYU001.G01.lis
DRHO g/cm3 DenCorr DRHO Edited, BANYU001.G01.lis
DT us/ft Sonic DT Edited, Spliced, BANYU001.G01.lis
GR gAPI GammaRay GR Spliced, BANYU001.G01.lis
NPHI dec Neutron NPHI Edited, BANYU001.G01.lis
RDEP ohmm DeepRes ILD Spliced, BANYU001.G01.lis
RHOB g/cm3 Density RHOB Edited, BANYU001.G01.lis
RMED ohmm MedRes ILM Spliced, BANYU001.G01.lis
SP mV SponPot SP Spliced, BANYU001.G01.lis
Run this bit of code. Then add additional mnemonic plots to the figure.
#Import additional packages we will need
import numpy as np
import pandas as pd
#Convert a data array to a pandas dataframe
#and find significant spikes in the data
#Return the spikes as a binary 1 or 0 array
def find_unc(data):
#Convert data to pandas
df=pd.DataFrame(data)
#Caluclate the rolling average
#(200 is somewhat arbitray value to take the rolling average over)
df_mean = df.rolling(200).mean()
#Calculate the percent change (i.e any points of change) of the data
df_change = df_mean.pct_change(periods=200)
#Convert large percent changes to 1 or 0
#0.5 (50%) is a somewhat arbitray number
#to set as the amount of change in the data
dfbin = ((df_change < -0.5) | (df_change > 0.5)).astype(int)
#Return the binaray array
return(dfbin)
#Define a function to make the plot and set parameters
def make_plot(i,var,colour):
#Set the data to a variable
data=lases[wellid][var]
#Find the spikes in the data
dfbin=find_unc(data)
#Now perform the plotting
top=min(lases[wellid]['DEPTH'])
bot=max(lases[wellid]['DEPTH'])
ax[i].plot(dfbin*np.nanmax(data), lases[wellid]['DEPTH'], color = 'black', linewidth = 0.5)
ax[i].plot(data, lases[wellid]['DEPTH'], color = colour, linewidth = 0.5)
ax[i].set_xlabel(var)
ax[i].xaxis.label.set_color(colour)
ax[i].set_xlim(np.nanpercentile(lases[wellid][var],0.5), np.nanpercentile(lases[wellid][var],99.5))
ax[i].tick_params(axis='x', colors=colour)
ax[i].title.set_color(colour)
ax[i].set_ylim(top,bot)
ax[i].invert_yaxis()
ax[i].tick_params(left=False,
bottom=True,
labelleft=False,
labelbottom=True)
#Make the figure
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(16,6))
#Add a plot of each mnemoic to your figure
make_plot(0,"GR","green")
make_plot(1,"RDEP","red")
#Quick way to get the list of keys
#lases[wellid].keys()
#Fix the details on the figure
plt.subplots_adjust(wspace=0.01)
ax[0].set_ylabel("Depth (m)")
ax[0].tick_params(left=True,
bottom=True,
labelleft=True,
labelbottom=True)
To solve the challenge you can change the ncols
varibale and then add new calls to the make_plot
function.
#Change the number of columns
fig, ax = plt.subplots(nrows=1, ncols=8, figsize=(16,6))
#Add additional calls to the make plot
make_plot(0,"CALI","green")
make_plot(1,"DRHO","red")
make_plot(2,"DT","blue")
make_plot(3,"GR","purple")
make_plot(4,"RDEP","cyan")
make_plot(5,"RHOB","pink")
make_plot(6,"RMED","brown")
make_plot(7,"SP","orange")
To learn more about the smoothing steps make a diagnostic plots at each step.
#Set an example dataset, i.e.
#wellid 1 and var is "GR"
data=lases[1]["GR"]
#Convert data to pandas
df=pd.DataFrame(data)
plt.plot(df)
plt.title("Raw data")
plt.show()
#Caluclate the rolling average
df_mean = df.rolling(200).mean()
plt.plot(df_mean)
plt.title("Smoothed data")
plt.show()
#Calculate the percent change (i.e any points of change) of the data
df_change = df_mean.pct_change(periods=200)
plt.plot(df_change)
plt.title("Percentage changes along the smoothed data")
plt.show()
#Convert large percent changes to 1 or 0
dfbin = ((df_change < -0.2) | (df_change > 0.2)).astype(int)
plt.plot(dfbin)
plt.title("'Binarised' version of large percentage changes")
plt.show()
from obspy.io.segy.segy import _read_segy
import matplotlib.pyplot as plt
import numpy as np
#Adapted from https://agilescientific.com/blog/2016/9/21/x-lines-of-python-read-and-write-seg-y
#See the notebooks here for more good examples
#https://github.com/agile-geoscience/xlines
#Set the filename of the segy data
filename="../data/james/james_1959_pstm_tvfk_gain.sgy"
#Data randomly chosen from here:
#Title: 2006 James 3D Seismic Survey.
#Author: White, A.
#Prepared by: Terrex Seismic Pty Ltd; Pioneer Surveys Pty Ltd; WestenGeco
#Tenement: PPL00182
#Operator: Santos Ltd
#https://sarigbasis.pir.sa.gov.au/WebtopEw/ws/samref/sarig1/wci/Record?r=0&m=1&w=catno=2035790
#This will take about 1 minute.
#When the [*] changes to [52] and the circle in the top right is clear, it has completed
stream = _read_segy(filename, headonly=True)
print(np.shape(stream.traces))
stream
(48832,)
48832 traces in the SEG Y structure.
#Look at a single trace
one_trace = stream.traces[10000]
#Print out details single trace
print(one_trace)
plt.figure(figsize=(16,2))
plt.plot(one_trace.data)
plt.show()
Trace sequence number within line: 10001
1001 samples, dtype=float32, 250.00 Hz
#Stack multiple traces into a single numpy array
data = np.stack([t.data for t in stream.traces[12320:12320+500]])
#What does the stacked data look like
data.shape
(500, 1001)
#Have a look at the data
plt.imshow(data.T, cmap="Greys", aspect='auto')
<matplotlib.image.AxesImage at 0x7f865805d460>
#Make a more informative plot
#Restrict the data to the 95th percentile
vm = np.percentile(data, 95)
print("The 95th percentile is {:.0f}; the max amplitude is {:.0f}".format(vm, data.max()))
#Make the plot
plt.figure(figsize=(16,8))
plt.imshow(data.T, cmap="RdBu", vmin=-vm, vmax=vm, aspect='auto')
plt.colorbar()
plt.show()
The 95th percentile is 4365; the max amplitude is 34148
#What else is in the data?
#Print out the segy headers
print(stream.textual_file_header.decode())
C 1 CLIENT SANTOS COMPANY CREW NO C 2 LINE 2000.00 AREA JAMES3D C 3 REEL NO DAY-START OF REEL YEAR OBSERVER C 4 INSTRUMENT MFG MODEL SERIAL NO C 5 DATA TRACES/RECORD 24569 AUXILIARY TRACES/RECORD 0 CDP FOLD 40 C 6 SAMPLE INTERVAL 4.00 SAMPLES/TRACE 1001 BITS/IN BYTES/SAMPLE 4 C 7 RECORDING FORMAT FORMAT THIS REEL SEG-Y MEASUREMENT SYSTEM METERS C 8 SAMPLE CODE FLOATING PT C09 JAMES 3D C10 WESTERNGECO C11 MARCH 2007 C12 VERSION : James3D_pstm_tvfk_gain C13 FILTERED TRIM PSTM STACK C14 C15 GEOMETRY APPLY-TAR-MINP- C16 NOISE REDUCTION - SWATT C17 SC DECON - SCAC C18 RESIDUAL_STATICS C19 TRIM_STATICS - INVERSE_TAR - SORT C20 PSTM - SORT - GAIN C21 TRIM_STATICS - STACK C22 SPECW_10-70HZ -TVF_10-75HZ-TRACE_BALANCE C23 C24 C25 C26 C27 C28 C29 C30 C31 C32 C33 C34 C35 C36 C37 C38 C39 C40 END EBCDIC
#And the header information for a particular trace
print(stream.traces[1013].header)
trace_sequence_number_within_line: 1014
trace_sequence_number_within_segy_file: 1014
original_field_record_number: 2004
trace_number_within_the_original_field_record: 1
energy_source_point_number: 10026
ensemble_number: 10026
trace_number_within_the_ensemble: 28
trace_identification_code: 1
number_of_vertically_summed_traces_yielding_this_trace: 1
number_of_horizontally_stacked_traces_yielding_this_trace: 13
data_use: 1
distance_from_center_of_the_source_point_to_the_center_of_the_receiver_group: 0
receiver_group_elevation: 0
surface_elevation_at_source: 0
source_depth_below_surface: 0
datum_elevation_at_receiver_group: 0
datum_elevation_at_source: 0
water_depth_at_source: 0
water_depth_at_group: 0
scalar_to_be_applied_to_all_elevations_and_depths: 1
scalar_to_be_applied_to_all_coordinates: 1
source_coordinate_x: 482760
source_coordinate_y: 7035836
group_coordinate_x: 482760
group_coordinate_y: 7035836
coordinate_units: 1
weathering_velocity: 0
subweathering_velocity: 0
uphole_time_at_source_in_ms: 0
uphole_time_at_group_in_ms: 0
source_static_correction_in_ms: 0
group_static_correction_in_ms: 0
total_static_applied_in_ms: -70
lag_time_A: 0
lag_time_B: 0
delay_recording_time: 0
mute_time_start_time_in_ms: 0
mute_time_end_time_in_ms: 20
number_of_samples_in_this_trace: 1001
sample_interval_in_ms_for_this_trace: 4000
gain_type_of_field_instruments: 0
instrument_gain_constant: 0
instrument_early_or_initial_gain: 0
correlated: 0
sweep_frequency_at_start: 0
sweep_frequency_at_end: 0
sweep_length_in_ms: 0
sweep_type: 0
sweep_trace_taper_length_at_start_in_ms: 0
sweep_trace_taper_length_at_end_in_ms: 0
taper_type: 0
alias_filter_frequency: 0
alias_filter_slope: 0
notch_filter_frequency: 0
notch_filter_slope: 0
low_cut_frequency: 0
high_cut_frequency: 0
low_cut_slope: 0
high_cut_slope: 0
year_data_recorded: 0
day_of_year: 0
hour_of_day: 0
minute_of_hour: 0
second_of_minute: 0
time_basis_code: 0
trace_weighting_factor: 0
geophone_group_number_of_roll_switch_position_one: 0
geophone_group_number_of_trace_number_one: 0
geophone_group_number_of_last_trace: 0
gap_size: 0
over_travel_associated_with_taper: 0
x_coordinate_of_ensemble_position_of_this_trace: 0
y_coordinate_of_ensemble_position_of_this_trace: 0
for_3d_poststack_data_this_field_is_for_in_line_number: 0
for_3d_poststack_data_this_field_is_for_cross_line_number: -4587520
shotpoint_number: 2004
scalar_to_be_applied_to_the_shotpoint_number: 0
trace_value_measurement_unit: 10026
transduction_constant_mantissa: 0
transduction_constant_exponent: 0
transduction_units: 0
device_trace_identifier: 0
scalar_to_be_applied_to_times: 1052
source_type_orientation: 0
source_energy_direction_mantissa: 0
source_energy_direction_exponent: 1607
source_measurement_mantissa: 0
source_measurement_exponent: 0
source_measurement_unit: 0
#You can automatically extract data you might need from the header
#Get the sample interval from the header info
dt = stream.traces[0].header.sample_interval_in_ms_for_this_trace / 1e6
dt
0.004
A single seismic section can be viewed with this snippet of code:
#Set number of xlines
n=262
#Set start iline
m=0
print(m,m*n,m*n+n)
data = np.stack(t.data for t in stream.traces[m*n:m*n+n])
vm = np.percentile(data, 95)
plt.figure(figsize=(14,4))
plt.imshow(data.T,cmap="RdBu", vmin=-vm, vmax=vm, aspect='auto')
plt.show()
Can you put this in a loop to show multiple sections at once?
…
#Set number of xlines
n=262
#Set start iline
m=0
while m < 10:
print(m,m*n,m*n+n)
data = np.stack(t.data for t in stream.traces[m*n:m*n+n])
vm = np.percentile(data, 95)
plt.figure(figsize=(14,4))
plt.imshow(data.T,cmap="RdBu", vmin=-vm, vmax=vm, aspect='auto')
plt.show()
m=m+1
Sometimes there are no good libraries or data-readers availble for your specific use-case. Very common if you get some specialty instrument with some unique data format. Often the documentation is a good place to start for figuring out how to interpret the data, and more-often-than not, the header of the data can give you all the information you need. Download this VELF file, containing some 3D seismic data. I could not find any good python libraries to handle this dataset, so we can just try out a few things to get the data into a format that is useful for us.
#Imports for plotting
import matplotlib.pyplot as plt
from matplotlib import cm
import pandas as pd
#Open the file for reading
f = open("../data/S3D_Vrms_StkVels_VELF.txt",'r')
#Read in all the lines in the file and save them to a variable
mylist = f.readlines()
#Close the file
f.close()
print("Done reading file.")
Done reading file.
#Print out the first 20 lines
print(mylist[0:20])
['Client: XXX \n', 'Project: YYY\n', 'Contractor: ZZZ\n', 'Date:\n', '\n', 'Velocity type: RMS Velocity in Time\n', '\n', '\n', 'Datum: GDA94, UTM Zone: UTM53, Central Meridian : \n', 'Statics: Two way time corrected to mean sea level: No\n', ' Gun and Cable statics applied: No\n', ' Tidal statics applied: No\n', '\n', '3D Grid details:\n', 'inline crossline X Y\n', '\n', '1000 5000 599413.78 7382223.37\n', '1000 5309 595633.30 7375486.63\n', '1448 5000 609180.96 7376742.28\n', '1448 5309 605400.48 7370005.55\n']
#Set up some empty lists to store each bit of data in
inline=[]
crossline=[]
X=[]
Y=[]
velf=[]
#Loop through all the lines in the file
for i,line in enumerate(mylist):
#First split the line up (by default, split by whitespace)
splitline=line.split()
#If we encounter certain lines, save some data
if i in [16,17,18,19]:
#Print out the lines (check we are doing the right thing)
print(splitline)
inline.append(int(splitline[0]))
crossline.append(int(splitline[1]))
X.append(float(splitline[2]))
Y.append(float(splitline[3]))
#This is where the actual data starts
#Now depending on the key word at the start of each line
#save the data to each particular list/array
#Read the data in again, this time with some thought about what we actually want to do
if i>49:
if splitline[0]=='LINE':
LINE = int(splitline[1])
if splitline[0]=='SPNT':
xline3d=int(splitline[1])
binx=float(splitline[2])
biny=float(splitline[3])
inline3d=int(splitline[4])
if splitline[0]=='VELF':
for j,val in enumerate(splitline[1:]):
#print(j,val)
#Counting from the 0th index of splitline[1:end]
if j%2==0:
t=int(val)
else:
vt=int(val)
velf.append([LINE,xline3d,binx,biny,inline3d,t,vt])
['1000', '5000', '599413.78', '7382223.37']
['1000', '5309', '595633.30', '7375486.63']
['1448', '5000', '609180.96', '7376742.28']
['1448', '5309', '605400.48', '7370005.55']
#Convert the python "list" type to Pandas dataframe
df=pd.DataFrame(velf)
#Set the names of the columns
df.columns=['LINE','xline3d','binx','biny','inline3d','t','vt']
df
LINE | xline3d | binx | biny | inline3d | t | vt | |
---|---|---|---|---|---|---|---|
0 | 1000 | 5080 | 598435.0 | 7380479.0 | 1000 | 0 | 3200 |
1 | 1000 | 5080 | 598435.0 | 7380479.0 | 1000 | 295 | 3300 |
2 | 1000 | 5080 | 598435.0 | 7380479.0 | 1000 | 598 | 4137 |
3 | 1000 | 5080 | 598435.0 | 7380479.0 | 1000 | 738 | 4537 |
4 | 1000 | 5080 | 598435.0 | 7380479.0 | 1000 | 1152 | 4500 |
… | … | … | … | … | … | … | … |
3082 | 1440 | 5280 | 605580.0 | 7370735.0 | 1440 | 2216 | 5259 |
3083 | 1440 | 5280 | 605580.0 | 7370735.0 | 1440 | 2861 | 5791 |
3084 | 1440 | 5280 | 605580.0 | 7370735.0 | 1440 | 3526 | 6294 |
3085 | 1440 | 5280 | 605580.0 | 7370735.0 | 1440 | 4697 | 7077 |
3086 | 1440 | 5280 | 605580.0 | 7370735.0 | 1440 | 5988 | 7748 |
3087 rows × 7 columns
#Plot the target area
plt.scatter(df.binx,df.biny,c=df.xline3d)
plt.colorbar()
plt.show()
#Plot it in 3d just becasue we can
fig = plt.figure()
ax = fig.add_subplot(projection='3d')
ax.scatter(df.binx,df.biny,df.inline3d,c=df.xline3d)
ax.view_init(30, 70)
plt.show()
#Now, make some plots...
#One way to do this, is to
#make a 'group' for each unique seismic line
groups=df.groupby(['LINE','xline3d','inline3d'])
#Make plots by certain groupings
#Add a value to spread out the data nicely
i=0
for name,grp in groups:
if name[2]==1280:
print(name)
plt.plot(grp.vt+i,-grp.t)
i+=500
(1280, 5040, 1280)
(1280, 5060, 1280)
(1280, 5080, 1280)
(1280, 5100, 1280)
(1280, 5120, 1280)
(1280, 5140, 1280)
(1280, 5160, 1280)
(1280, 5180, 1280)
(1280, 5200, 1280)
(1280, 5220, 1280)
(1280, 5240, 1280)
(1280, 5260, 1280)
(1280, 5280, 1280)
from scipy.interpolate import interp1d
import numpy as np
#Normal plots
%matplotlib inline
#Fancy intereactive plots
#%matplotlib notebook
fig = plt.figure(figsize=(10,10))
ax = fig.add_subplot(projection='3d')
for name,grp in groups:
##Plot all the data
ax.plot(grp.binx+grp.vt,grp.biny,grp.t,'b-')
##Plot all the data with colors
# colors=cm.seismic(grp.vt/grp.vt.max())
# ax.scatter(grp.binx+grp.vt,grp.biny,grp.t,c=grp.vt/grp.vt.max(),cmap=cm.seismic)
#Interpolate the data and plot with colors
# x = grp.t
# y = grp.vt
# f = interp1d(x, y, kind='linear')
# num=50
# xnew = np.linspace(0,max(x),num)
# ynew = f(xnew)
# binx = np.linspace(min(grp.binx),max(grp.binx),num)
# biny = np.linspace(min(grp.biny),max(grp.biny),num)
# colours = cm.seismic(ynew/grp.vt.max())
# ax.scatter(binx+xnew,biny,xnew,c=colours)
plt.show()
For example: https://github.com/geological-survey-of-queensland/open-data-api
This is a general API, and can be interfaced with many types of software. Python can do it too! You can write an entire application around APIs provided by developers.
import requests
import json
# set the API 'endpoint'
api = 'https://geoscience.data.qld.gov.au/api/action/'
# construct our query using the rules set out in the documentation
query = api + 'package_search?q=quamby'
# make the get request and store it in the response object
response = requests.get(query)
# view the payload as JSON
json_response = response.json()
#What kind of data does the server return us?
type(json_response)
#json_response
dict
#Dig into the returned data and just pull out certain items we want
jr=json_response['result']['results'][1]['GeoJSONextent']
#Clean up the returned values
mysplit=jr.split(':')[-1].split('}')[0]
#And convert them into something more reasonable.
import ast
xy=ast.literal_eval(mysplit)[0]
#Now you have the data you need in a useful format you can do whatever you like!
xy
[[140.217847053169, -20.0651426032713],
[140.217846866281, -20.0818094534046],
[140.217846679393, -20.098476305538],
[140.217846494505, -20.1151431576712],
[140.217846308617, -20.1318100088045],
[140.20117986795, -20.1318102200127],
[140.201179682062, -20.1484770721459],
[140.184513241396, -20.1484772833552],
[140.18451305651, -20.1651441374885],
[140.184512871624, -20.1818109906216],
[140.184512686738, -20.1984778437548],
[140.184512502851, -20.2151446978879],
[140.184512318964, -20.2318115510211],
[140.201178757624, -20.2318113368118],
[140.201178572735, -20.2484781899449],
[140.201178390847, -20.265145043078],
[140.201178208958, -20.2818118962111],
[140.201178028069, -20.2984787493441],
[140.184511590415, -20.2984789665533],
[140.184511408527, -20.3151458196863],
[140.184511227638, -20.3318126738193],
[140.18451104575, -20.3484795289523],
[140.184510864861, -20.3651463820852],
[140.184510683973, -20.3818132362181],
[140.184510502084, -20.398480089351],
[140.184510219194, -20.4151469904839],
[140.184509936305, -20.4318138906168],
[140.201176401949, -20.4318136554078],
[140.217842867592, -20.4318134202],
[140.234509334236, -20.4318131849934],
[140.25117579888, -20.4318129497879],
[140.267842265524, -20.4318127145836],
[140.284508731169, -20.4318124793805],
[140.301175196813, -20.4318122431786],
[140.317841659457, -20.4318120809779],
[140.317841839357, -20.4151452358448],
[140.317842020257, -20.3984783917118],
[140.317842205157, -20.3818115465788],
[140.317842390056, -20.3651447014456],
[140.317842576955, -20.3484778563125],
[140.317842761854, -20.3318110101794],
[140.317842946753, -20.3151441640463],
[140.317843132652, -20.2984773189131],
[140.31784331755, -20.2818104737799],
[140.317843502448, -20.2651436286467],
[140.30117702579, -20.2651437498476],
[140.284510586133, -20.2651439660497],
[140.284510403237, -20.2818108161829],
[140.284510219342, -20.2984776663161],
[140.284510036446, -20.3151445164492],
[140.28450985255, -20.3318113665823],
[140.267843414897, -20.3318115837855],
[140.267843232003, -20.3484784349186],
[140.251176794352, -20.348478653123],
[140.251176612458, -20.365145505256],
[140.251176429564, -20.381812356389],
[140.234509992916, -20.3818125765945],
[140.234509810023, -20.3984794277275],
[140.217843374376, -20.3984796479342],
[140.217843556268, -20.3818127958012],
[140.21784373916, -20.3651459436682],
[140.217843920051, -20.3484790915352],
[140.217844101942, -20.3318122384023],
[140.217844284833, -20.3151453852692],
[140.234510722486, -20.3151451680625],
[140.234510904378, -20.2984783169294],
[140.234511087269, -20.2818114637963],
[140.234511270161, -20.2651446126632],
[140.234511451052, -20.24847776053],
[140.234511637943, -20.2318109083969],
[140.234511822834, -20.2151440572637],
[140.234512008724, -20.1984772051305],
[140.234512194615, -20.1818103539973],
[140.234512379505, -20.1651435028641],
[140.25117882117, -20.1651432906583],
[140.251179007062, -20.148476440525],
[140.251179192953, -20.1318095883917],
[140.251179378844, -20.1151427382584],
[140.251179564734, -20.098475887125],
[140.251179752625, -20.0818090379917],
[140.251179940515, -20.0651421878583],
[140.234513496842, -20.0651423950642],
[140.217847053169, -20.0651426032713]]