Working with Big Data using Dask

Questions

Use a modern python library and elegant syntax for performance benefits
How do I deal with large irregular data and show me some real world examples of Dask?

Objectives

Intro to Dask concepts and High level datastructures
Use dask dataframes
Use dask delayed functions
Deal with semi-structured and unstructured data in memory efficient and parallel manner
Show me examples of using Dask on Large Datasets

DASK

Dask is a flexible library for parallel computing in Python.

Dask is composed of two parts: Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads. “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of dynamic task schedulers.

Dask emphasizes the following virtues:

Familiar: Provides parallelized NumPy array and Pandas DataFrame objects
Flexible: Provides a task scheduling interface for more custom workloads and integration with other projects.
Native: Enables distributed computing in pure Python with access to the PyData stack.
Fast: Operates with low overhead, low latency, and minimal serialization necessary for fast numerical algorithms
Scales up: Runs resiliently on clusters with 1000s of cores
Scales down: Trivial to set up and run on a laptop in a single process
Responsive: Designed with interactive computing in mind, it provides rapid feedback and diagnostics to aid humans

Dask High Level Schema https://docs.dask.org/en/latest/

Dask provides high level collections - these are Dask Dataframes, bags, and arrays. On a low level, dask dynamic task schedulers to scale up or down processes, and presents parallel computations by implementing task graphs. It provides an alternative to scaling out tasks instead of threading (IO Bound) and multiprocessing (cpu bound).

A Dask DataFrame is a large parallel DataFrame composed of many smaller Pandas DataFrames, split along the index. These Pandas DataFrames may live on disk for larger-than-memory computing on a single machine, or on many different machines in a cluster. One Dask DataFrame operation triggers many operations on the constituent Pandas DataFrames.

Dask High Level Schema https://docs.dask.org/en/latest/dataframe.html/

Common Use Cases: Dask DataFrame is used in situations where Pandas is commonly needed, usually when Pandas fails due to data size or computation speed: - Manipulating large datasets, even when those datasets don’t fit in memory - Accelerating long computations by using many cores - Distributed computing on large datasets with standard Pandas operations like groupby, join, and time series computations

Dask Dataframes may not be the best choice if: your data fits comfortable in RAM - Use pandas only! If you need a proper database. You need functions not implemented by dask dataframes - see Dask Delayed.

Dask Dataframes

We will load in some data to explore.

#Import dask dataframe modules
import dask.dataframe as dd

#import dask
#dask.config.set({"visualization.engine": "cytoscape"})
#NOTE: to run this example (with diagrams) you will need to "pip install graphviz" and donwload graphviz
#https://graphviz.org/download/
import os
#os.environ["PATH"] += os.pathsep + 'C:/APPS/Graphviz/bin'

# Setup a parlalle LocalCluster that makes use of all the cores and RAM we have on a single machine
from dask.distributed import Client, LocalCluster
cluster = LocalCluster()
# explicitly connect to the cluster we just created
client = Client(cluster)
client

/Users/darya/miniconda3/envs/geopy/lib/python3.9/site-packages/distributed/node.py:183: UserWarning: Port 8787 is already in use.
Perhaps you already have a cluster running?
Hosting the HTTP server on port 54995 instead
  warnings.warn(

Client

Client-eac15e50-22ac-11ed-9bb4-fe453513c759

Connection method: Cluster object	Cluster type: distributed.LocalCluster
Dashboard: http://127.0.0.1:54995/status

Cluster Info

LocalCluster

fbc990ef

Dashboard: http://127.0.0.1:54995/status	Workers: 5
Total threads: 10	Total memory: 32.00 GiB
Status: running	Using processes: True

Scheduler Info

Scheduler

Scheduler-0247a9eb-a609-475c-b8e5-1e5c3e496e65

Comm: tcp://127.0.0.1:54998	Workers: 5
Dashboard: http://127.0.0.1:54995/status	Total threads: 10
Started: Just now	Total memory: 32.00 GiB

Workers

Worker: 0

Comm: tcp://127.0.0.1:55037	Total threads: 2
Dashboard: http://127.0.0.1:55039/status	Memory: 6.40 GiB
Nanny: tcp://127.0.0.1:55002
Local directory: /var/folders/1b/_jymrbj17cz6t7cxdl86xshh0000gr/T/dask-worker-space/worker-g_vn8opz

Worker: 1

Comm: tcp://127.0.0.1:55044	Total threads: 2
Dashboard: http://127.0.0.1:55053/status	Memory: 6.40 GiB
Nanny: tcp://127.0.0.1:55004
Local directory: /var/folders/1b/_jymrbj17cz6t7cxdl86xshh0000gr/T/dask-worker-space/worker-uxx2_0xo

Worker: 2

Comm: tcp://127.0.0.1:55043	Total threads: 2
Dashboard: http://127.0.0.1:55045/status	Memory: 6.40 GiB
Nanny: tcp://127.0.0.1:55001
Local directory: /var/folders/1b/_jymrbj17cz6t7cxdl86xshh0000gr/T/dask-worker-space/worker-6v08k6vd

Worker: 3

Comm: tcp://127.0.0.1:55051	Total threads: 2
Dashboard: http://127.0.0.1:55055/status	Memory: 6.40 GiB
Nanny: tcp://127.0.0.1:55005
Local directory: /var/folders/1b/_jymrbj17cz6t7cxdl86xshh0000gr/T/dask-worker-space/worker-fd9fzs7b

Worker: 4

Comm: tcp://127.0.0.1:55057	Total threads: 2
Dashboard: http://127.0.0.1:55058/status	Memory: 6.40 GiB
Nanny: tcp://127.0.0.1:55003
Local directory: /var/folders/1b/_jymrbj17cz6t7cxdl86xshh0000gr/T/dask-worker-space/worker-wsov5c3n

All materials copyright Sydney Informatics Hub, University of Sydney

Unnamed: 0

0 Present day longitude (degrees)

1 Present day latitude (degrees)

2 Reconstructed longitude (degrees)

3 Reconstructed latitude (degrees)

4 Age (Ma)

5 Time before mineralisation (Myr)

6 Seafloor age (Myr)

7 Segment length (km)

8 Slab length (km)

9 Distance to trench edge (km)

10 Subducting plate normal velocity (km/Myr)

11 Subducting plate parallel velocity (km/Myr)

12 Overriding plate normal velocity (km/Myr)

13 Overriding plate parallel velocity (km/Myr)

14 Convergence normal rate (km/Myr)

15 Convergence parallel rate (km/Myr)

16 Subduction polarity (degrees)

17 Subduction obliquity (degrees)

18 Distance along margin (km)

19 Subduction obliquity signed (radians)

20 Ore Deposits Binary Flag (1 or 0)

npartitions=1

int64

float64

...

Unnamed: 0

0 Present day longitude (degrees)

1 Present day latitude (degrees)

2 Reconstructed longitude (degrees)

3 Reconstructed latitude (degrees)

4 Age (Ma)

5 Time before mineralisation (Myr)

6 Seafloor age (Myr)

7 Segment length (km)

8 Slab length (km)

...

11 Subducting plate parallel velocity (km/Myr)

12 Overriding plate normal velocity (km/Myr)

13 Overriding plate parallel velocity (km/Myr)

14 Convergence normal rate (km/Myr)

15 Convergence parallel rate (km/Myr)

16 Subduction polarity (degrees)

17 Subduction obliquity (degrees)

18 Distance along margin (km)

19 Subduction obliquity signed (radians)

20 Ore Deposits Binary Flag (1 or 0)

-66.28

-27.37

-65.264812

-28.103781

6.0

0.0

48.189707

56.08069

2436.30907

...

40.63020

-17.43987

12.20271

102.31471

28.82518

5.67505

15.73415

2269.19769

0.274613

1.0

-69.75

-30.50

-67.696759

-31.970639

12.0

0.0

52.321162

56.09672

2490.68735

...

39.60199

-22.80622

13.40127

115.35820

27.39401

5.78937

13.35854

1823.34107

0.233151

1.0

-66.65

-27.27

-65.128689

-28.374772

9.0

0.0

53.506085

55.77705

2823.54951

...

45.32425

-18.08485

11.27500

100.24282

34.62444

8.97218

19.05520

2269.19769

0.332576

1.0

-66.61

-27.33

-65.257928

-28.311094

8.0

0.0

51.317135

55.90088

2656.71724

...

43.13319

-17.78538

11.72618

101.21965

31.92962

7.42992

17.50782

2269.19769

0.305569

1.0

-66.55

-27.40

-65.366917

-28.257580

7.0

0.0

49.340097

56.09011

2547.29585

...

40.57322

-17.43622

12.23778

102.25748

28.80235

5.65657

15.73067

2269.19769

0.274552

1.0

Index(['Unnamed: 0', '0 Present day longitude (degrees)', '1 Present day latitude (degrees)', '2 Reconstructed longitude (degrees)', '3 Reconstructed latitude (degrees)', '4 Age (Ma)', '5 Time before mineralisation (Myr)', '6 Seafloor age (Myr)', '7 Segment length (km)', '8 Slab length (km)', '9 Distance to trench edge (km)', '10 Subducting plate normal velocity (km/Myr)', '11 Subducting plate parallel velocity (km/Myr)', '12 Overriding plate normal velocity (km/Myr)', '13 Overriding plate parallel velocity (km/Myr)', '14 Convergence normal rate (km/Myr)', '15 Convergence parallel rate (km/Myr)', '16 Subduction polarity (degrees)', '17 Subduction obliquity (degrees)', '18 Distance along margin (km)', '19 Subduction obliquity signed (radians)', '20 Ore Deposits Binary Flag (1 or 0)'], dtype='object')

#You can run typical pandas operations (generally faster! - but only noticeable on large data) #group by operation - calculate the convergence rate by age. #Notice the compute() trigger that performs the operations. #df.groupby('4 Age (Ma)')['14 Convergence normal rate (km/Myr)'].mean() df.groupby('4 Age (Ma)')['14 Convergence normal rate (km/Myr)'].mean().compute()

4 Age (Ma) 1.0 66.594390 3.0 86.227770 4.0 76.746980 5.0 86.430612 6.0 96.153738 ... 175.0 30.189475 176.0 26.693450 177.0 15.504740 178.0 58.860860 179.0 65.671240 Name: 14 Convergence normal rate (km/Myr), Length: 126, dtype: float64

Unnamed: 0

0 Present day longitude (degrees)

1 Present day latitude (degrees)

2 Reconstructed longitude (degrees)

3 Reconstructed latitude (degrees)

4 Age (Ma)

5 Time before mineralisation (Myr)

6 Seafloor age (Myr)

7 Segment length (km)

8 Slab length (km)

9 Distance to trench edge (km)

10 Subducting plate normal velocity (km/Myr)

11 Subducting plate parallel velocity (km/Myr)

12 Overriding plate normal velocity (km/Myr)

13 Overriding plate parallel velocity (km/Myr)

14 Convergence normal rate (km/Myr)

15 Convergence parallel rate (km/Myr)

16 Subduction polarity (degrees)

17 Subduction obliquity (degrees)

18 Distance along margin (km)

19 Subduction obliquity signed (radians)

20 Ore Deposits Binary Flag (1 or 0)

npartitions=4

int64

float64

...

array([[1., 1., 1., ..., 1., 1., 1.], [1., 1., 1., ..., 1., 1., 1.], [1., 1., 1., ..., 1., 1., 1.], ..., [1., 1., 1., ..., 1., 1., 1.], [1., 1., 1., ..., 1., 1., 1.], [1., 1., 1., ..., 1., 1., 1.]])

Bytes

30.52 MiB

Shape

(1000, 4000)

Count

1 Graph Layer

1 Chunks

Type

float64

numpy.ndarray

Bytes

30.52 MiB

7.63 MiB

Shape

(1000, 4000)

(1000, 1000)

Count

1 Graph Layer

4 Chunks

Type

float64

numpy.ndarray

Bytes

5.96 GiB

7.63 MiB

Shape

(200000, 4000)

(1000, 1000)

Count

1 Graph Layer

800 Chunks

Type

float64

numpy.ndarray

Working with Big Data using Dask

Questions

Objectives

DASK

Dask Dataframes

Client

Cluster Info

LocalCluster

Scheduler Info

Scheduler

Workers

Worker: 0

Worker: 1

Worker: 2

Worker: 3

Worker: 4

Helpful Links:

Key Points