Skip to contents



Important: This is a pre-release version of dataharvester which is undergoing active development. Please report any bugs in the issues tab.

dataharvester is an R interface to the Geodata-Harvester. Use dataharvester to preprocess, aggregate, visualise and download geospatial data from a range of Australian (and international) data sources, including:

With connectivity to the Earth Engine API, perform petabyte-scale operations which include temporal cloud/shadow masking and automatic calculation of spectral indices (e.g. NDVI) for the following collections:

  • Landsat 5 (TM), 7 (ETM+), 8 (OLI/TRS) and 9 (OLI-2/TRS-2)
  • Sentinel-2 (Surface Reflectance) and Sentinel-3 (Ocean and Land Colour Instrument)
  • MODIS (or Moderate Resolution Imaging Spectroradiometer) products

For all other Earth Engine collections, while dataharvester does not provide officual support, users can still perform basic operations to filter, reduce, visualise and download data.

Demo

Below is an output when the following 3 lines of code is run on a configuration file, which downloads 16 images from 7 API sources:

library(dataharvester)
initialise_harvester("r-harvester")
harvest("data/settings_harvest.yaml", plot = TRUE, contour = FALSE)

Installation

Install the development version of this package from GitHub using install_github() from remotes or devtools:

# install.packages("remotes") # uncomment and run this line if necessary
remotes::install_github("sydney-informatics-hub/dataharvester")

library(dataharvester) # load package

Examples

“Headless” run

Run initialise_harvester() after loading the package. The function helps you initialise the package, verifies package dependencies, and optionally, connect to the Earth Engine API (if initialise_harvester(earthengine = TRUE) ). Note that the default environment is r-harvester, but you can provide a different name if this environment is being used in other projects.

The first initialisation may take a few minutes if dependencies need to be installed.

library(dataharvester)
initialise_harvester("r-harvester", earthengine = TRUE)

Note: connecting to the Earth Engine API requires an existing Google account. Depending on your OS, a web browser popup may appear where you will be guided to connect to the API. Otherwise, a link will be provided, and you will need to follow the instructions to finalise the authentication process. More information can be found in the documentation.

Then, run harvest(), which parses a YAML config file:

harvest(path_to_config = "path/to/config.yaml")

An example config file and data is provided in data/settings_harvest.yaml

Manual downloads

The harvest() function is essentially a wrapper for all the API calls available in the Data-Harvester. Individual functions are available for each of the data sources. For example, the code below downloads Bulk Density data from SLGA for a specified region and saves it to the downloads folder in your working directory:

slga <- download_slga(
  layer = "Bulk_Density",
  bounding_box = c(149.769345, -30.335861, 149.949173, -30.206271),
  out_path = "downloads/",
  resolution = 6,
  depth_min = 0,
  depth_max = 5
)

For more examples, please see examples in tests/test_functions.R

Access to the Google Earth Engine API is aimed at simplifying the most common tasks for beginners. For example, below is example code that performs all of the following (in 9 lines of code):

  • Access to the Google Earth Engine Data Catalog
  • Filter by date and coordinates
  • Scaling, offsetting, cloud and shadow masking
  • Automatically calculate Awesome Spectral Indices (e.g. NDVI)
  • Pixel stretching for image preview
  • Mapping of results
  • Downloading of image(s) at specified resolution and location
  • Splitting of large images automatically to allow file sizes that exceed the limit(s) set by the API (~32 MB)
# Make sure that GEE API is initialised
initialise_harvester("r-harvester", earthengine = TRUE)
# 1. Define dataset
img <- collect_ee(
  collection = "LANDSAT/LC08/C02/T1_L2",
  coords = c(149.799, -30.31, 149.80, -30.309),
  date_min = "2019-01-01",
  date_max = "2019-02-01"
)
# 2. Preprocess (cloud masking, scale and offsetting enabled by default)
img <- preprocess_ee(img, mask_clouds=TRUE, reduce="median", spectral = "NDVI")
# 3. Visualise (optional)
img <- map_ee(img, bands = "NDVI_median")
# 4. Download at 100m/px resolution
img <- download_ee(img, bands = "NDVI", scale = 100, out_path = "downloads/")

Note that the above functions can be chained/simplified using native (|>) or dplyr (%>%) pipes.

Alternatively, a config file can be provided (see e.g., data/settings_harvest_gee.yaml) and all steps are run automatically via

img <- auto_ee(path/to/config.yaml)

These functions are not meant to provide full functionality from the GEE API. At any time, a user may take an Earth Engine object and process it directly using the API - in this case, with rgee in R.

# Extract ee.Image or ee.ImageCollection object
gee_obj <- img$ee_image

# Now use `gee_obj` anywhere that can process Earth Engine Object data...

Acknowledgments

Acknowledgements are an important way for us to demonstrate the value we bring to your research. Your research outcomes are vital for ongoing funding of the Sydney Informatics Hub. If you make use of this software for your research project, please include the following acknowledgement:

“This research was supported by the Sydney Informatics Hub, a Core Research Facility of the University of Sydney, and the Agricultural Research Federation (AgReFed).”

Attribution

This software was developed by the Sydney Informatics Hub, a core research facility of the University of Sydney, as part of the Data Harvesting project for the Agricultural Research Federation (AgReFed). AgReFed is supported by the Australian Research Data Commons (ARDC) and the Australian Government through the National Collaborative Research Infrastructure Strategy (NCRIS).

We would also like to acknowledge the use of the following Python packages in dataharvester, as we have used their code in the Data-Harvester: geemap, wxee and eemont.

License

© 2023 The University of Sydney

This is free software: you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License (LGPL version 3) as published by the Free Software Foundation.