Highlights of the Ibis 1.3 release

Ibis 1.3 was just released, after 8 months of development work, with 104 new commits from 16 unique contributors. What is new? In this blog post we will discuss some important features in this new version!

First, if you are new to the Ibis framework world, you can check this blog post I wrote last year, with some introductory information about it.

Some highlighted features of this new version are:

  • Addition of a PySpark backend
  • Improvement of geospatial support
  • Addition of JSON, JSONB and UUID data types
  • Initial support for Python 3.8 added and support for Python 3.5 dropped
  • Added new backends and geospatial methods to the documentation
  • Renamed the mapd backend to omniscidb

This blog post is divided into different sections:

  • OmniSciDB
  • PostgreSQL
  • PySpark
  • Geospatial support
  • Python versions support
In [1]:
import ibis
import pandas as pd

OmniSciDB

The mapd backend is now named omniscidb!

An important feature of omniscidb is that now you can define if the connection is IPC (Inter-Process Communication), and you can also specify the GPU device ID you want to use (if you have a NVIDIA card, supported by cudf).

IPC is used to provide shared data support between processes. OmniSciDB uses Apache Arrow to provide IPC support.

In [2]:
con_omni = ibis.omniscidb.connect(
    host='localhost', 
    port='6274',
    user='admin',
    password='HyperInteractive',
    database='ibis_testing',
    ipc=False,
    gpu_device=None
)
con_omni.list_tables()
Out[2]:
['diamonds', 'batting', 'awards_players', 'functional_alltypes', 'geo']

Also you can now specify ipc or gpu_device directly to the execute method:

In [3]:
t = con_omni.table('functional_alltypes')
expr = t[['id', 'bool_col']].head(5)
df = expr.execute(ipc=False, gpu_device=None)
df
Out[3]:
id bool_col
0 6690 True
1 6691 False
2 6692 True
3 6693 False
4 6694 True

As you can imagine, the result of df is a pandas.DataFrame

In [4]:
type(df)
Out[4]:
pandas.core.frame.DataFrame

But if you are using gpu_device the result would be a cudf :)

Note: when IPC=True is used, the code needs to be executed on the same machine where the database is running

Note: when gpu_device is used, 1) it uses IPC (see the note above) and 2) it needs a NVIDIA card supported by cudf.

Another interesting feature is that now omniscidb also supports shapefiles (input) and geopandas (output)!

Check out the Geospatial support section below to see more details!

Also the new version adds translations for more window operations for the omniscidb backend, such as: DenseRank, RowNumber, MinRank, Count, PercentRank/CumeDist.

For more information about window operations, check the Window functions documentation section.

PostgreSQL

Some of the highlighted features for the PostgreSQL backend are new data types included, such as: JSON, JSONB and UUID.

In [5]:
from uuid import uuid4 
uuid_value = ibis.literal(uuid4(), type='uuid')
uuid_value == ibis.literal(uuid4(), type='uuid')
Out[5]:
In [6]:
import json
json_value = ibis.literal(json.dumps({"id": 1}), type='json')
json_value
Out[6]:
In [7]:
jsonb_value = ibis.literal(json.dumps({"id": 1}).encode('utf8'), type='jsonb')
jsonb_value
Out[7]:

Another important new features on PostgreSQL backend is the support of new geospatial operations, such as

  • GeometryType
  • GeometryN
  • IsValid
  • LineLocatePoint
  • LineMerge
  • LineSubstring
  • OrderingEquals
  • Union

Also, now it has support for two geospatial data types: MULTIPOINT and MULTILINESTRING.

Check out the Geospatial support section below to see some usage examples of geospatial operations!

PySpark

This new version also includes support for a new backend: PySpark!

Let's do the first steps with this new backend starting with a Spark session creation.

In [8]:
import os

import pyspark
from pyspark.sql import SparkSession
import pyspark.sql.types as pt
from pathlib import Path

# spark session and pyspark connection
spark_session = SparkSession.builder.getOrCreate()
con_pyspark = ibis.pyspark.connect(session=spark_session)

We can use spark or pandas for reading from CSV file. In this example, we will use pandas.

In [9]:
data_directory = Path(
    os.path.join(
        os.path.dirname(ibis.__path__[0]),
        'ci',
        'ibis-testing-data'
    )
)

pd_df_alltypes = pd.read_csv(data_directory / 'functional_alltypes.csv')
pd_df_alltypes.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7300 entries, 0 to 7299
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   index            7300 non-null   int64  
 1   Unnamed: 0       7300 non-null   int64  
 2   id               7300 non-null   int64  
 3   bool_col         7300 non-null   int64  
 4   tinyint_col      7300 non-null   int64  
 5   smallint_col     7300 non-null   int64  
 6   int_col          7300 non-null   int64  
 7   bigint_col       7300 non-null   int64  
 8   float_col        7300 non-null   float64
 9   double_col       7300 non-null   float64
 10  date_string_col  7300 non-null   object 
 11  string_col       7300 non-null   int64  
 12  timestamp_col    7300 non-null   object 
 13  year             7300 non-null   int64  
 14  month            7300 non-null   int64  
dtypes: float64(2), int64(11), object(2)
memory usage: 855.6+ KB

Now, we can create a Spark DataFrame and we will create a temporary view from this data frame. Also we should enforce the desired types for each column.

In [10]:
def pyspark_cast(df, col_types):
    for col, dtype in col_types.items():
        df = df.withColumn(col, df[col].cast(dtype))
    return df

ps_df_alltypes = spark_session.createDataFrame(pd_df_alltypes)

ps_df_alltypes = pyspark_cast(
    ps_df_alltypes, {
        'index': 'integer',
        'Unnamed: 0': 'integer',
        'id': 'integer',
        'bool_col': 'boolean',
        'tinyint_col': 'byte',
        'smallint_col': 'short',
        'int_col': 'integer',
        'bigint_col': 'long',
        'float_col': 'float',
        'double_col': 'double',
        'date_string_col': 'string',
        'string_col': 'string',
        'timestamp_col': 'timestamp',
        'year': 'integer',
        'month': 'integer'
    }
)

# use ``SparkSession`` to create a table
ps_df_alltypes.createOrReplaceTempView('functional_alltypes')
con_pyspark.list_tables()
Out[10]:
['functional_alltypes']

Check if all columns were created with the desired data type:

In [11]:
t = con_pyspark.table('functional_alltypes')
t
Out[11]:

Different than a SQL backend, that returns a SQL statement, the returned object from the PySpark compile method is a PySpark DataFrame:

In [12]:
expr = t.head()
expr_comp = expr.compile()
type(expr_comp)
Out[12]:
pyspark.sql.dataframe.DataFrame
In [13]:
expr_comp
Out[13]:
DataFrame[index: int, Unnamed: 0: int, id: int, bool_col: boolean, tinyint_col: tinyint, smallint_col: smallint, int_col: int, bigint_col: bigint, float_col: float, double_col: double, date_string_col: string, string_col: string, timestamp_col: timestamp, year: int, month: int]

To convert the compiled expression to a Pandas DataFrame, you can use the toPandas method. The result should be the same as that returned by the execute method.

In [14]:
assert all(expr.execute() == expr_comp.toPandas())
In [15]:
expr.execute()
Out[15]:
index Unnamed: 0 id bool_col tinyint_col smallint_col int_col bigint_col float_col double_col date_string_col string_col timestamp_col year month
0 0 0 6690 True 0 0 0 0 0.0 0.0 11/01/10 0 2010-11-01 00:00:00.000 2010 11
1 1 1 6691 False 1 1 1 10 1.1 10.1 11/01/10 1 2010-11-01 00:01:00.000 2010 11
2 2 2 6692 True 2 2 2 20 2.2 20.2 11/01/10 2 2010-11-01 00:02:00.100 2010 11
3 3 3 6693 False 3 3 3 30 3.3 30.3 11/01/10 3 2010-11-01 00:03:00.300 2010 11
4 4 4 6694 True 4 4 4 40 4.4 40.4 11/01/10 4 2010-11-01 00:04:00.600 2010 11

To finish this section, we can play a little bit with some aggregation operations.

In [16]:
expr = t
expr = expr.groupby('string_col').aggregate(
    int_col_mean=t.int_col.mean(),
    int_col_sum=t.int_col.sum(),
    int_col_count=t.int_col.count(),
)
expr.execute()
Out[16]:
string_col int_col_count int_col_mean int_col_sum
0 7 730 7.0 5110
1 3 730 3.0 2190
2 8 730 8.0 5840
3 0 730 0.0 0
4 5 730 5.0 3650
5 6 730 6.0 4380
6 9 730 9.0 6570
7 1 730 1.0 730
8 4 730 4.0 2920
9 2 730 2.0 1460

Check out the PySpark Ibis backend API documentation and the tutorials for more details.

Geospatial support

Currently, ibis.omniscidb and ibis.postgres are the only Ibis backends that support geospatial features.

In this section we will check some geospatial features using the PostgreSQL backend.

In [17]:
con_psql = ibis.postgres.connect(
    host='localhost',
    port=5432,
    user='postgres',
    password='postgres',
    database='ibis_testing'
)
con_psql.list_tables()
Out[17]:
['array_types',
 'awards_players',
 'batting',
 'diamonds',
 'films',
 'functional_alltypes',
 'geo',
 'geography_columns',
 'geometry_columns',
 'intervals',
 'not_supported_intervals',
 'raster_columns',
 'raster_overviews',
 'spatial_ref_sys',
 'tzone']

Two important features are that it support shape objects (input) and geopandas dataframe (output)!

So, let's import shapely to create a simple shape point and polygon.

In [18]:
import shapely

shp_point = shapely.geometry.Point((20, 10))
shp_point
Out[18]:
In [19]:
shp_polygon_1 = shapely.geometry.Polygon([(20, 10), (40, 30), (40, 20), (20, 10)])
shp_polygon_1
Out[19]:

Now, let's create a Ibis table expression to manipulate a "geo" table:

In [20]:
t_geo = con_psql.table('geo')
df_geo = t_geo.execute()
df_geo
Out[20]:
id geo_point geo_linestring geo_polygon geo_multipolygon
0 1 POINT (0.00000 0.00000) LINESTRING (0 0, 1 1) POLYGON ((30 10, 40 40, 20 40, 10 20, 30 10)) (POLYGON ((30 20, 45 40, 10 40, 30 20)), POLYG...
1 2 POINT (1.00000 1.00000) LINESTRING (1 1, 2 2) POLYGON ((35 10, 45 45, 15 40, 10 20, 35 10), ... (POLYGON ((40 40, 20 45, 45 30, 40 40)), POLYG...
2 3 POINT (2.00000 2.00000) LINESTRING (2 2, 3 3) POLYGON ((2 2, 3 3, 4 4, 5 5, 5 2, 2 2)) (POLYGON ((2 2, 3 3, 4 4, 5 5, 5 2, 2 2)))
3 4 POINT (3.00000 3.00000) LINESTRING (3 3, 4 4) POLYGON ((3 3, 4 4, 5 5, 6 6, 6 3, 3 3)) (POLYGON ((3 3, 4 4, 5 5, 6 6, 6 3, 3 3)))
4 5 POINT (4.00000 4.00000) LINESTRING (4 4, 5 5) POLYGON ((4 4, 5 5, 6 6, 7 7, 7 4, 4 4)) (POLYGON ((4 4, 5 5, 6 6, 7 7, 7 4, 4 4)))

And the type of df_geo is ... a geopandas dataframe!

In [21]:
type(df_geo)
Out[21]:
geopandas.geodataframe.GeoDataFrame

So you can take the advantage of GeoPandas features too!

In [22]:
df_geo.set_geometry('geo_multipolygon').head(1).plot();

Now, let's check if there are any geo_multipolygon's that contain the shape point we just created.

In [23]:
t_geo[
    t_geo.geo_multipolygon, 
    t_geo['geo_multipolygon'].contains(shp_point).name('contains_point_1')
].execute()
Out[23]:
geo_multipolygon contains_point_1
0 MULTIPOLYGON (((30.00000 20.00000, 45.00000 40... True
1 MULTIPOLYGON (((40.00000 40.00000, 20.00000 45... True
2 MULTIPOLYGON (((2.00000 2.00000, 3.00000 3.000... False
3 MULTIPOLYGON (((3.00000 3.00000, 4.00000 4.000... False
4 MULTIPOLYGON (((4.00000 4.00000, 5.00000 5.000... False

So, as expected, just the first two multipolygons contain a point with coordinates (20, 10).

For more examples of Geospatial Analysis with Ibis, check this nice tutorial written by Ian Rose!

Python versions support

Ibis 1.3 added initial support for Python 3.8 and dropped support for Python 3.5.

Note: currently, the OmniSciDB and PySpark backends are not supported on Python 3.8.

Final words

Do you want to play more with Ibis framework?

You can install it from PyPI:

python -m pip install --upgrade ibis-framework==1.3.0

Or from conda-forge:

conda install ibis-framework=1.3.0 -c conda-forge

Check out some interesting tutorials to help you to start on Ibis: https://docs.ibis-project.org/tutorial.html. If you are coming from the SQL world, maybe Ibis for SQL Programmers documentation section will be helpful. Have fun!

Comments