Modern data analysis with Parquet

Three months have passed since the last post, and Apache SuperSet saw very little use, and even that only in the beginning. Why? Because Postgres is not meant for OLAP, period.

My hobby project is about crawling https://meetup.com and exploring the data whenever I'm in the mood. As such, I don't care about the data being up-to-date, I don't need transactions, so Postgres doesn't add much value except for enforcing the data schemas. What I actually need is read-only columnar storage with as little overhead as possible, that is, Apache Parquet - no need for partitioning either, given the scale.

I couldn't find any easy-to-use command-line tools, but a combination of two Go libraries makes the process very straightforward:

  • For reading data from the database, we can use Bun, a lightweight SQL client
  • For writing data, the library of choice is Parquet-go by Twilio Segment

One gotcha I ran into is compatibility of the exported files with PyArrow and Pola.rs. Parquet is an evolving format, and not all libraries support all encoding algorithms. I had most success with "zstd,plain" and "dict" field annotations for data compression:

import "github.com/uptrace/bun"

type Event struct {
	bun.BaseModel `bun:"table:event,alias:e"`

	Id       string    `bun:",pk" parquet:"id,zstd,plain"`
	MeetupId string    `parquet:"meetup_id,dict"`
	Time     time.Time `parquet:"time"`
}

For small-scale data that fits into RAM, there's no need to bother with cursors and batching, so the code looks like this:

import (
	"github.com/segmentio/parquet-go"
	"github.com/uptrace/bun"
)

func exportEvents(ctx context.Context, db *bun.DB, path string) error {
	var events []Event
	if err := db.NewSelect().Model(&events).Scan(ctx); err != nil {
		return err
	}

	return parquet.WriteFile(path, events)
}

Once you have the Parquet files, get familiar with Pola.rs: you can run all queries you want, loading just the needed columns, with speed of Rust and expressiveness of Python.
I recently discovered this library and can't recommend it enough. It takes a while to get used to it, if you've been working with Pandas for a long time, but overall it's very similar.