Using a columnstore index on a small TPC-H data set

Learn how to observe benefits of the columnstore index on a small TPC-H data set.

  • An instance of EPAS is up and running on the local host and accessible by the OS user postgres.
  • Swarm64 DA is installed and enabled.
  • The volume with the data directory has at least 30 GiB of free space available.
This tutorial is intended for laptop computers or small workstations with 16 GiB of main memory or less. On a machine with significantly more main memory the observed benefits might be less pronounced.
  1. Clone the s64da-benchmark-toolkit repository and check out the Git tag corresponding to your version of Swarm64 DA.

    Run the following command in a shell:

    git clone --branch v5.5.0 https://github.com/swarm64/s64da-benchmark-toolkit.git
  2. Change into the toolkit directory.

    Run the following command in a shell:

    cd s64da-benchmark-toolkit
  3. Set up a virtual environment for Python.

    Run the following command in a shell:

    python3 -m venv ./env && source ./env/bin/activate
  4. Install dependencies of s64da-benchmark-toolkit.

    Run the following command in the same shell as the previous one:

    pip3 install -r requirements.txt
  5. Ingest a TPC-H SF10 data set with the s64da_native schema.
    This schema will create regular B-tree indexes only, but no columnstore index. Run the following command in the same shell as the previous one:
    ./prepare_benchmark \
        --dsn postgresql://postgres@localhost/columnstore_test \
        --benchmark tpch \
        --scale-factor 10 \
        --schema s64da_native \
        --chunks 4 \
        --max-jobs 8
    The parameter --max-jobs should not exceed the number of CPU cores on your system.
  6. Connect to the database in a psql shell.

    Run the following command in a shell:

    psql -U postgres -h localhost columnstore_test
  7. Configure the psql shell.

    Enter the following two commands in the psql shell to disable paging and to measure the runtime of each query.

    \pset pager off
    \timing on
  8. Run a query with one of the WHERE clauses of TPC-H query 21.

    Type the following query into the psql shell:

    SELECT COUNT(*) FROM lineitem WHERE l_receiptdate > l_commitdate;
  9. Create a columnstore index

    Type the following command into the psql shell. This will create a columnstore index on table lineitem.

    CREATE INDEX col_lineitem ON lineitem USING columnstore(
        l_shipdate,
        l_commitdate,
        l_receiptdate);
  10. Check whether the columnstore feature is enabled.

    Enter the following two commands in the psql shell.

    SHOW swarm64da.enable_columnstore;
  11. Optional: Enable the columnstore feature

    If the result of the previous query was "off", run the following command in the psql shell.

    SET swarm64da.enable_columnstore TO on;
  12. Run the same query again and take note of the differences in runtime.

    Type the following query into the psql shell:

    SELECT COUNT(*) FROM lineitem WHERE l_receiptdate > l_commitdate;

    You should now see a reduced runtime for the query as the columnstore index allows for less data to be fetched from storage.

In this tutorial, you learnt how to use the Swarm64 DA columnstore index to accelerate SQL queries. You used the Swarm64 64 DA benchmark toolkit to ingest 10 GiB of TPC-H data into a test database (storage size ~30 GiB including indexes). You queried this data on range predicates using PostgreSQL-native B-tree indexes, then created a columnstore index and ran the same query again using columnstore acceleration. If your system memory is smaller than 16 GiB and therefore insufficient to hold the storage data, you should have observed a significant improvement in query execution time with the columnstore index versus without it.