Polars
Polars is a lightning fast DataFrame library. The key features of polars are:
Fast and Accessible: Written from scratch in Rust, designed close to the machine and without external dependencies. It also has python and R bindings!
I/O: First class support for all common data storage layers: local, cloud storage & databases.
Handle Datasets larger than RAM
Intuitive API: Write your queries the way they were intended. Polars, internally, will determine the most efficient way to execute using its query optimizer. Out of Core: The streaming API allows you to process your results without requiring all your data to be in memory at the same time
Parallel: Utilises the power of your machine by dividing the workload among the available CPU cores without any additional configuration.
The philosophy of Polars is to provide a dataframe library that utilises available cores, has an intuitive api and is performant - hence adheres to a strict schema (data-types should be known before running the query).
Polars concepts
Data Frames and Series are the primary data structures supported by polars. Polars describes common operations available as Contexts and Expressions.
Expressions are essentially data transformations. Polars automatically makes each expression parallel - a main reason why polars is so quick. Within an expression there may be multiple expressions and parallelization going on. Polars performs query optimisations on every expression.
Contexts refers to the context in which an expression needs to be evaluated. Specifically:
select : selecting columns
filter : filtering rows
with_columns : create / do something with columns
group_by : group by a factor and follow with
agg : aggregation
# fetching some data ----------------------------------------------------------------
import polars as pl
from datetime import datetime
!pip install ucimlrepo
from ucimlrepo import fetch_ucirepo # fetch iris dataset
import pandas as pd
= fetch_ucirepo(id=53)
iris = iris.data.features
X = iris.data.targets
y = pd.concat([X, y], axis=1)
iris_df ={'class': 'species'}, inplace=True)
iris_df.rename(columns"iris_data.csv", index=False)
iris_df.to_csv(
# load a pandas.DataFrame into a polars.DataFrame
= pl.DataFrame(
df
{"integer": [1, 2, 3,4],
"date": [
2025, 1, 1),
datetime(2025, 1, 2),
datetime(2025, 1, 3),
datetime(2025, 1, 3),
datetime(
],"float": [4.0, 5.0, 6.0,12],
"string": ["a", "b", "c","b"],
}
)# Series types need to be the same
= pl.Series([1,2,3,4,500])
sr # can specify the data types for better performance
= pl.Series([1,2,3,4,500],dtype=pl.Int64) sr
# To inspect the data
#sumamry statistics
df.describe() #column names and types
df.schema =2) # random sample of n rows
df.sample(n3) df.head(
integer | date | float | string |
---|---|---|---|
i64 | datetime[μs] | f64 | str |
1 | 2025-01-01 00:00:00 | 4.0 | "a" |
2 | 2025-01-02 00:00:00 | 5.0 | "b" |
3 | 2025-01-03 00:00:00 | 6.0 | "c" |
Getting to know the syntax
We will cover some basic syntax. Check out the API documentation for more information.
Reading and Writing Data
Polars support reading and writing data many types of files. It also supporting reading from a database, or cloud storage. See the IO Documentation for more information.
# writing data to disk
"output.csv")
df.write_csv("output.parquet")
df.write_parquet(# reading back into a polars dataframe
= pl.read_csv("output.csv")
new_df = pl.read_parquet("output.parquet")
new_df
# can convert back to pandas dataframe (or other types)
df.to_pandas()
integer | date | float | string | |
---|---|---|---|---|
0 | 1 | 2025-01-01 | 4.0 | a |
1 | 2 | 2025-01-02 | 5.0 | b |
2 | 3 | 2025-01-03 | 6.0 | c |
3 | 4 | 2025-01-03 | 12.0 | b |
# selecting columns ----------------------------------------------------------------
"float")) # selecting a column
df.select(pl.col(
"date","string")) # selecting multiple columns
df.select(pl.col(
"*").exclude("string")) # select all columns then exclude
df.select(pl.col(
"^(da|fl).*$")) # supports regular expressions
df.select(pl.col(
# selectors are intuitive helper funtions for selecting columns by name or type
import polars.selectors as cs
| cs.contains("ate")) # select all columns that are integers or contains ate
df.select(cs.integer()
# filtering rows ---------------------------------------------------------------------
filter(pl.col("integer") >= 2) #filtering rows
df.
filter((pl.col("integer") >=2) &
df."float") == 5.0)) #filtering rows with multiple conditions (| = or & = and)
(pl.col(
#creating / manipulating columns -----------------------------------------------------
"integer") + 3).alias("new_column")) # creating column and naming it
df.with_columns((pl.col(
#group by aggregations ----------------------------------------------------------------
"string").agg(pl.col("integer").sum().alias("sum"),
df.group_by("date").sort().first().alias("earliest"),
pl.col("float") / pl.col("integer"))
pl.col(
string | sum | earliest | float |
---|---|---|---|
str | i64 | datetime[μs] | list[f64] |
"b" | 6 | 2025-01-02 00:00:00 | [2.5, 3.0] |
"a" | 1 | 2025-01-01 00:00:00 | [4.0] |
"c" | 3 | 2025-01-03 00:00:00 | [2.0] |
# Can combine expressions for compactness
= df.with_columns((pl.col("float") * pl.col("integer"))
df3 "product")).select(pl.all().exclude("integer")) .alias(
Links to documentation:
Data Transformation such as join, Concatenation, pivot and unpivot.
Data types and casting
Most data types are specified by the arrow syntax with the exception of String, Categorical and Object types.
Categorical data represents string data where the values in the column have a finite set of values (yet for performance implementation different to strings). Polars supports both Enum data type, where categories are known up front, and the more flexible Categorical data type where values are not known beforehand. Conversion between them is trivial. Relying on polars inferring the categories with Categorical types comes at a performance cost. See Categorical page for more information.
Casting (changing the datatypes) is enabled by either specifying the dtype argument or applying the cast() function.
# Use Enum where categories are known
= pl.Enum(["polar","panda","teddy"])
cat_types = pl.Series(["polar","polar","teddy","panda"],dtype= cat_types)
animals # Use Categprical otherwise
= pl.Series(["poobear","minimouse","teddy","poobear"],dtype= pl.Categorical)
fictional_animals
# casting columns to other data types with cast
"integer": pl.Float32, "float": pl.UInt8}) df.cast({
integer | date | float | string |
---|---|---|---|
f32 | datetime[μs] | u8 | str |
1.0 | 2025-01-01 00:00:00 | 4 | "a" |
2.0 | 2025-01-02 00:00:00 | 5 | "b" |
3.0 | 2025-01-03 00:00:00 | 6 | "c" |
4.0 | 2025-01-03 00:00:00 | 12 | "b" |
Lazy / Eager and Streaming
Polars supports two modes of operation: lazy and eager. In the eager API the query is executed immediately while in the lazy API the query is only evaluated once it is ‘needed’. Deferring the execution to the last minute can have significant performance advantages.
An example of using the eager API is below. Every step is executed immediately returning the intermediate results. This can be very wasteful as we might do work or load extra data that is not being used.
= pl.read_csv("iris_data.csv") #read the iris dataset
df = df.filter(pl.col("sepal length") > 5) #filter
df_small = df_small.group_by("species").agg(pl.col("sepal width").mean()) #mean of the sepal width per species df_agg
If we instead used the lazy API and waited on execution until all the steps are defined then the query planner could perform various optimizations.
= (
q "iris_data.csv") #doesnt read it all before other operation is performed
pl.scan_csv(filter(pl.col("sepal length") > 5)
."species").agg(pl.col("sepal width").mean())
.group_by(
)
# a lazyframe
q
= q.collect() # inform polars that you want to execute the query
df_agg = q.collect(streaming=True) # with streaming mode to process in batches df_agg
Streaming
One additional benefit of the lazy API is that it allows queries to be executed in a streaming manner. Instead of processing the data all-at-once Polars can execute the query in batches allowing you to process datasets that are larger-than-memory. See here for more info on streaming.
When to use Lazy versus Eager:
In general the lazy API should be preferred unless you are either interested in the intermediate results or are doing exploratory work and don’t know yet what your query is going to look like
When using Lazy mode, apply filteres as early as possible before reading the data. Only select column you need.
Common Machine Learning Workflow
Given your new knowledge of polars, here is an example on how to integrate into the usual pipeline consisting of data ingestion and manipulation, model preperation and prediction.
# loading libraries
!pip install ucimlrepo
!pip install scikit-learn==1.4
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import plot_tree
from matplotlib import pylab as plt
# Fit a descision tree from polars dataframe object
= pl.read_csv("iris_data.csv")
df = df.select(['sepal length', 'sepal width', 'petal length', 'petal width'])
data = df.select(['species']).to_series()
target = DecisionTreeClassifier()
clf = clf.fit(data, target) model
# Plot the decision tree
=True, feature_names=data.columns)
plot_tree(clf, filled"Decision tree trained on all the iris features")
plt.title( plt.show()
# Add predicitions to the polars dataframe
= pl.concat([df,
predict_df =["predict"])],
pl.DataFrame(model.predict(data), schema="horizontal")
howprint(predict_df.sample(8))
shape: (8, 6) ┌──────────────┬─────────────┬──────────────┬─────────────┬─────────────────┬─────────────────┐ │ sepal length ┆ sepal width ┆ petal length ┆ petal width ┆ species ┆ predict │ │ — ┆ — ┆ — ┆ — ┆ — ┆ — │ │ f64 ┆ f64 ┆ f64 ┆ f64 ┆ str ┆ str │ ╞══════════════╪═════════════╪══════════════╪═════════════╪═════════════════╪═════════════════╡ │ 6.0 ┆ 3.4 ┆ 4.5 ┆ 1.6 ┆ Iris-versicolor ┆ Iris-versicolor │ │ 6.3 ┆ 2.8 ┆ 5.1 ┆ 1.5 ┆ Iris-virginica ┆ Iris-virginica │ │ 5.7 ┆ 4.4 ┆ 1.5 ┆ 0.4 ┆ Iris-setosa ┆ Iris-setosa │ │ 5.8 ┆ 2.7 ┆ 5.1 ┆ 1.9 ┆ Iris-virginica ┆ Iris-virginica │ │ 6.6 ┆ 3.0 ┆ 4.4 ┆ 1.4 ┆ Iris-versicolor ┆ Iris-versicolor │ │ 6.9 ┆ 3.1 ┆ 5.4 ┆ 2.1 ┆ Iris-virginica ┆ Iris-virginica │ │ 6.7 ┆ 2.5 ┆ 5.8 ┆ 1.8 ┆ Iris-virginica ┆ Iris-virginica │ │ 4.8 ┆ 3.0 ┆ 1.4 ┆ 0.3 ┆ Iris-setosa ┆ Iris-setosa │ └──────────────┴─────────────┴──────────────┴─────────────┴─────────────────┴─────────────────┘
Ecosystem
On the Supported Polars Ecosystem page you can find a non-exhaustive list of libraries and tools that support Polars.
As the data ecosystem is evolving fast, more libraries will likely support Polars in the future.