OSS Updates September and October 2024
This is a summary of the open source work I spent my time on throughout September and October 2024. This was a very busy period in my personal life and I didn't make much progress on my projects, but I did have more time than usual to think about things, which prompted many further thoughts. Keep reading for details :)
Sponsors
I always start these posts with a sincere thank you to the generous ongoing support of my sponsors that make this work possible. I can't say how much I appreciate all of the support the community has given to my work and would like to give a special thanks to Clojurists Together and Nubank for providing incredibly generous grants that allowed me reduce my client work significantly and afford to spend more time on projects for the Clojure ecosystem for nearly a year.
If you find my work valuable, please share it with others and consider supporting it financially. There are details about how to do that on my GitHub sponsors page. On to the updates!
Personal update
I'll save the long version for the end but there is one important personal update that's worth mentioning up front: I go by Kira Howe now. I used be known as Kira McLean, and all of my talks, writing, and commits up to this point use Kira McLean, but I'm still the same person! Just with a new name. I even updated my GitHub handle, which went remarkably smoothly.
Conj 2024
The main Clojure-related thing I did during this period was attend the Conj. It's always cool to meet people in person who you've only ever worked with online, and I finally got to meet so many of the wonderful people from Clojure Camp and Scicloj who I've had the pleasure of working with virtually. I also had the chance to meet some of my new co-workers, which was great. There were tons of amazing talks and as always insightful and inspiring conversations. I always leave conferences with tons of energy and ideas. Then get back to reality and realize there's no time to implement them all :) But still, below are some of the main ideas I'm working through after a wonderful conference.
SVGs for visualizing graphics
Tim Pratley and Chris Houser gave a fun talk about SVGs, among other things, that made me realize using SVGs might be the perfect way to implement the "graphics" side of a grammar of graphics.
Some of you may be following the development of tableplot (formerly hanamicloth), in which Daniel Slutsky has been implementing an elegant, layered, grammar-of-graphics-inspired way to describe graphics in Clojure. This library takes this description of a graphic and translates it into a specification for one of the supported underlying Javascript visualization libraries (currently vega-lite or plotly, via hanami). Another way to think about it is as the "grammar" part of a grammar of graphics; a way to declaratively transform an arbitrary dataset into a standardized set of instructions that a generic visualization library can turn into a graphic. This is the first half of what we need for a pure Clojure implementation of a grammar of graphics.
The second key piece we need is a Clojure implementation of the actual graphics rendering. Whether we adopt a similar underlying representation for the data as vega-lite, plotly, or whatever else is less consequential at this stage. Currently we just "translate" our Clojure code into vega-lite or plotly specs and call it a day. What I want to implement is a Clojure library that can take some data and turn it into a visualization. There are many ways to implement such a thing, all with different trade-offs, but Tim and Chouser's talk made me realize SVGs might be a great tool for the job. They're fast, efficient, simple to style and edit, plus they offer potentially the most promising avenues toward making graphics accessible and interactive since they're really just XML, which is semantic, supports ARIA labels, and is easy to work with in JS.
Humble UI also came up in a few conversations, which is a totally tangential concern, but it was interesting to start thinking about how all of this could come together into a really elegant, fully Clojure-based data visualization tool for people who don't write code.
A Clojurey way of working with data
I also had a super interesting conversation on my last night in Alexandria about Clojure's position in the broader data science ecosystem. It's fair to say that we have more or less achieved feature parity now for all the essential things a person working with data would need to do. Work is ongoing organizing these tools into a coherent and accessible stack (see noj), but the pieces are all there.
The main insight I left with, though, was that we shouldn't be aiming for mere feature parity. It's important, but if you're a working data scientist doing everything you already do just with Clojure is only a very marginal improvement and presents a very high switching cost for potentially not enough payoff. In short, it's a tough sell to someone who's doesn't already have some prior reason to prefer Clojure.
What we should do is leverage Clojure's strengths to build tools that could leapfrog the existing solutions, rather than just providing better implementations of them. I.e. think about new ways to solve the fundamental problems in data science, rather than just offering better tools to work within the current dominant paradigm.
For example, a fundamental problem in science is reproducibility. The current ways data is prepared and managed in most data (and regular) science workflows is madness, and versioning is virtually non-existent. If you pick up any given scientific paper that does some sort of data analysis, the chances that you will be able to reproduce the results are near zero, let alone using the same tools the author used. If you do manage to, you will have had to use a different implementation than the authors, re-inventing wheels and reverse-engineering their thought process. The problem isn't that scientists are bad at working with data, it's the fundamental chaos of the underlying ecosystem that's impossible to fight.
If you've ever worked with Python code, you know that dependency management is a nightmare, never mind state management within a single program. Stateful objects are just a bad mental model for computing because they require us to hold more information in our heads in order to reason about a system than our brains can handle. And when your mental model for a small amount of local data is a stateful, mutable thing, the natural inclination is to scale that mental model to your entire system. Tracking data provenance, versions, and lineage at scale is impossible when you're thinking about your problem as one giant, mutable, interdependent pile of unorganized information.
Clojure allows for some really interesting ways of thinking about data that could offer novel solutions to problems like these, because we think of data as immutable and have the tools to make working with such data efficient. None of this is new. Somehow at this Conj between some really interesting talks focused on ways of working with immutable data and subsequent conversations it clicked for me, though. If we apply the same ways we think about data in the small, like in a given program, more broadly to an entire system or workflow, I think the benefits could be huge. It's basically implementing the ideas from Rich Hickey's "Value of values" talk over 10 years ago to a modern data science workflow.
Other problems that Clojure is well-placed to support are:
- Scalability – Current dominant data science tools are slow and inefficient. People try to work around it by implementing libraries in C, Rust, Java, etc. and using them from e.g. Python, but this can only get you so far and adds even more brittleness and dependency management problems to the mix.
- Tracking data and model drift – This problem has a very similar underlying cause as the reproducibility issue, also fundamentally caused by a faulty mental model of data models as mutation machines.
- Testing and validation – Software engineering practices have not really permeated the data science community and as such most pipelines are fragile. Bringing a values-first and data-driven paradigm to pipeline development could make them much more robust and reliable.
Anyway I'm not exactly sure what any of this will look like as software yet, but I know it will be written in Clojure and I know it will be super cool. It's what I'm thinking about and experimenting with now. And I think the key point that thinking about higher-level problems and how Clojure can be applied to them is the right path toward introducing Clojure into the broader data science ecosystem.
Software engineers as designers
Alex Miller's keynote was all about designing software and how they applied a process similar to the one described in Rich Hickey's keynote from last year's conj to Clojure 1.12 (among other things). The main thing I took away from it was that the best use of an experienced software engineer's time is not programming. I've had the good fortune of working with a lot of really productive teams over the years, and this talk made me realize that one thing the best ones all had in common is that at least a couple of people with a lot of experience were not in the weeds writing code all the time. Conversely a common thread between all of the worst teams I've been a part of is that team leads and managers were way too in the weeds, worrying too much about implementation details and not enough about what was being implemented.
I've come to believe that it's not possible to reason about systems at both levels simultaneously. My brain at least just can't handle both the intense attention to detail and very concrete, specific steps required to write software that actually works and the abstract, general conceptual type of thinking that's required to build systems that work. The same person can do both things at different times, but not at the same time, and the cost of switching between both contexts is high.
Following the process described by Rich and then Alex is a really great way to add structure and coherence to what can otherwise come across as just "thinking", but it requires that we admit that writing code is not always the best use of our time, which is a hard sell. I think if we let experienced software engineers spend more time thinking and less time coding we'd end up with much better software, but this requires the industry to find better ways to measure productivity.
Long version of personal updates
As most of you know or will have inferred by now, I got married in September! It was the best day ever and the subsequent vacation was wonderful, but it did more or less cancel my productivity for over a month. If you're into weddings or just want a glimpse into my personal life, we had a reel made of our wedding day that's available here on instagram via our wedding coordinator.
Immediately after I got back from my honeymoon I also started a new job at BroadPeak, which is going great so far, but also means I have far less time than I used for open source and community work. I'm back to strictly evening and weekend availability, and sadly (or happily, depending how you see it) I'm at a stage of my life where not all of that is free time I can spend programming anymore.
I appreciate everyone's patience and understanding as I took these last couple of months to focus on life priorities outside of this work. I'm working on figuring out what my involvement in the community will look like going forward, but there are definitely tons of interesting things I want to work on. I'm looking forward to rounding out this year with some progress on at least some of them, but no doubt the end of December will come before I know it and there will be an infinite list of things left to do.
Thanks for reading all of this. As always, feel free to reach out anytime, and hope to see you around the Clojureverse :)
Published: 2024-10-31
OSS Updates July and August 2024
This is a summary of the open source work I've spent my time on throughout July and August, 2024. There was a blog post, some library updates, and lots of community work.
Sponsors
This work is made possible by the generous ongoing support of my sponsors. I appreciate all of the support the community has given to my work and would like to give a special thanks to Clojurists Together and Nubank for providing me with lucrative enough grants that I can reduce my client work significantly and afford to spend more time on these projects.
If you find my work valuable, please share it with others and consider supporting it financially. There are details about how to do that on my GitHub sponsors page. On to the updates!
Blog post
At the beginning of the summer Daniel Slutsky and I were feeling very ambitious and thought we might be able to put together a course for data scientists coming to Clojure from other languages. For many reasons, this hasn't materialized yet, but in service of these plans I wrote a blog post comparing tablecloth to other common data processing tools, like dplyr
, pandas
, and polars
. My goal was to put tablecloth in perspective, illustrating some of the key differences between it and other standard, more popular, data processing tools.
tcutils
I added a few more helpers to tcutils
, like between
and duplicate-rows
, and also made a docs website for the project. I also had many interesting conversations with people in the community about how Clojure's data processing tools "feel" to work with, and how we might adopt APIs and conventions that are familiar to data scientists in the interest of making their transition to Clojure's ecosystem as smooth as possible.
Clojure Data Cookbook
This month I added a chapter about working with data from databases, starting with SQL, and also continued to work on the end-to-end example for the introductory section. Working with real data is very difficult and interesting, and it's a fun challenge to try to figure out the right balance between getting into the weeds and compromising on the final result. So much of data science is just cleaning up messy data from the world, but surprisingly often you have to make some assumptions about how you're going to use the data in order to make decisions about how to do the cleaning. And there are tons of different ways to "clean" data, but the strategies you use depend on what information you're after.
In the particular example of the housing dataset I'm working with there are many missing values to handle, and some questionable rows that look like duplicates but aren't exactly duplicates. There are also lots of illogical data points, like house sales from the future or multiple sales for the same property on a given date. Deciding how to handle these cases to build up a "clean" dataset to actually work with is a very interesting exercise in domain modelling and goal setting.
Scicloj mentoring program
This one is really mostly Daniel Slutsky's amazing work, but we collaborated on launching it and it's definitely worth mentioning. We put together a structured way for people to get involved in contributing to Clojure's open source data science ecosystem, and got an overwhelmingly positive response. Over 25 people reached out to express an interest in contributing their time to Scicloj projects. The structured parts of the program include having some help choosing a meaningful and impactful project to work on, and up to an hour per week of one-on-one time with a mentor to help things progress. Daniel is doing all the heavy lifting coordinating the mentors, but it's been great so far participating as one and meeting some very keen and smart people who are willing to help us move things forward.
Another big part of this is thinking of the projects to work on. We came up with a list of projects that would deliver high value to the community but remain small enough to tackle by a single developer. We also tried to come up with ones that would require a wide range of skills and interests to try to accommodate as many people as possible. I am super excited to see how things go over the next few months with all of these projects.
Other community connections
I'm still doing my weekly data-science drop-in streaming with Clojure Camp. I really enjoy connecting with other people who are interested in Clojure for data science, and I often get great suggestions and tips, too.
I also met with a couple of groups of people who are presenting at the Conj this year to help brainstorm some ideas for how to make the most of the talks. Daniel has amazing vision for the community and organized these calls that I was lucky enough to join. The goal is to connect all of the people who are giving data-related talks to optimize the overall messaging, like minimizing duplication across talks or drawing examples from each other's presentations. I love conference speaking and hope to do more of it in future years when my personal commitments allow for it, but in the meantime it's really amazing getting to connect with such cool people in the community to learn about their talks and brainstorm ideas for making them the best they can be. I'm hoping to attend the conference this year to see some of these great talks in person.
Personal Updates
This has been a really amazing year professionally, having had the opportunity to spend much more time than in the past on open source and public work for the Clojure community. I've been trying to make the most of it and it's been really rewarding. Over the next couple of months, there are some other parts of my life that will be taking precedence, however.
The main one is my relationship. I'm getting married in a couple of weeks and will be taking almost a month off between getting ready for the wedding, wrapping up all the loose ends afterward, and a nearly 3-week-long honeymoon. I've never taken this long off of work in my life, so I'm both excited and curious to see what it's like. For over a decade now my career has been taking up most of my time and energy. It's been well worth it and I'm really happy with my work now, but I'm also excited to be stepping into a new chapter of life where things can be more balanced.
Related to this, the other major update I have to share is that I've accepted a full time job with a company called BroadPeak which I will be starting as soon as I'm back from my honeymoon. It's a small fintech company built primarily with Clojure that handles trade data management, commodities transaction surveillance, regulatory compliance, and other things related to the behind-the-scenes of commodities trading. I think it's a perfect fit for my skills and interests, and I'm hoping to have a chance to build some bridges between a really exciting, growing company that uses Clojure for real-world financial data processing and the Clojure open source community. Initial conversations about how the engineering team there feels about open source and community involvement have been really promising, so I'm optimistic that it will work out well for everyone. I'm not sure yet what exactly my open source work will look like once this job starts, but at a minimum I will still be working on the various side projects, like I always was before I tried giving it a go full time.
No matter how things go, I'll be back in two more months with another update. Thanks for reading. As always, feel free to reach out, and hopefully see some of you at the Conj! :)
Published: 2024-08-31
Data Manipulation in Clojure Compared to R and Python
I spend a lot of time developing and teaching people about Clojure's open source tools for working with data. Almost everybody who wants to use Clojure for this kind of work is coming from another language ecosystem, usually R or Python. Together with Daniel Slutsky, I'm working on formalizing some of the common teachings into a course. Part of that is providing context for people coming from other ecosystems, including "translations" of how to accomplish data science tasks in Clojure.
As part of this development, I wanted to share an early preview in this blog post. The format is inspired by this great blog post I read a while ago comparing R and Polars side by side (where "R" here refers to the tidyverse, an opinionated collection of R libraries for data science, and realistically mostly dplyr
specifically). I'm adding Pandas because it's among the most popular dataset manipulation libraries, and of course Clojure, specifically tablecloth, the primary data manipulation library in our ecosystem.
I'll use the same dataset as the original blog post, the Palmer Penguin dataset. For the sake of simplicity, I saved a copy of the dataset as a CSV file and made it available on this website. I will also refer the data as a "dataset" throughout this post because that's what Clojure people call a tabular, column-major data structure, but it's the same thing that is variously referred to as a dataframe, data table, or just "data" in other languages. I'm also assuming you know how to install the packages required in the given ecosystems, but any necessary imports or requirements are included in the code snippets the first time they appear. Versions of all languages and libraries used in this post are listed at the end. Here we go!
Reading data
Reading data is straightforward in every language, but as a bonus we want to be able to indicate on the fly which values should be interpreted as "missing", whatever that means in the given libraries. In this dataset, the string "NA"
means "missing", so we want to tell the dataset constructor this as soon as possible. Here's the comparison of how to accomplish that in various languages:
Tablecloth
(require '[tablecloth.api :as tc])
(def ds
(tc/dataset "https://codewithkira.com/assets/penguins.csv"))
Note that tablecloth interprets the string "NA" as missing (nil
, in Clojure) by default.
R
In reality, in R you would get the dataset from the R package that contains the dataset. This is a fairly common practice in R. In order to compare apples to apples, though, here I'll show how to initialize the dataset from a remote CSV file, using the readr
package's read_csv
, which is part of the tidyverse:
library(tidyverse)
ds <- read_csv("https://codewithkira.com/assets/penguins.csv",
na = "NA")
Pandas
import pandas as pd
ds = pd.read_csv("https://codewithkira.com/assets/penguins.csv")
Note that pandas has a fairly long list of values it considers NaN
already, so we don't need to specify what missing values look like in our case, since "NA"
is already in that list.
Polars
import polars as pl
ds = pl.read_csv("https://codewithkira.com/assets/penguins.csv",
null_values="NA")
Basic commands to explore the dataset
The first thing people usually want to do with their dataset is see it and poke around a bit. Below is a comparison of how to accomplish basic data exploration tasks using each library.
Operation | tablecloth | dplyr |
---|---|---|
see first 10 rows | (tc/head ds 10) | head(ds, 10) |
see all column names | (tc/column-names ds) | colnames(ds) |
select column | (tc/select-columns ds "year") | select(ds, year) |
select multiple columns | (tc/select-columns ds ["year" "sex"]) | select(ds, year, sex) |
select rows | (tc/select-rows ds #(> (% "year") 2008)) | filter(ds, year > 2008) |
sort column | (tc/order-by ds "year") | arrange(ds, year) |
Operation | pandas | polars |
---|---|---|
see first n rows | ds.head(10) | ds.head(10) |
see all column names | ds.columns | ds.columns |
select column | ds[["year"]] | ds.select(pl.col("year")) |
select multiple columns | ds[["year", "sex"]] | ds.select(pl.col("year", "sex")) |
select rows | ds[ds["year"] > 2008] | ds.filter(pl.col("year") > 2008) |
sort column | ds.sort_values("year") | ds.sort("year") |
Note there are some differences in how different libraries sort missing values, for example in tablecloth and polars they are placed at the beginning (so they're at the top when a column is sorted in ascending order and last when descending), but dplyr and pandas place them last (regardless of whether ascending or descending order is specified).
As you can see, these commands are all pretty similar, with the exception of selecting rows in tablecloth. This is a short-hand syntax for writing an anonymous function in Clojure, which is how rows are selected. Being a functional language, functions in Clojure are "first-class", which basically just means they are passed around as arguments willy-nilly, all over the place, all the time. In this case, the third argument to tablecloth's select-rows
function is a predicate (a function that returns a boolean) that takes as its argument a dataset row as a map of column names to values. Don't worry, though, tablecloth doesn't process your entire dataset row-wise. Under the hood datasets are highly optimized to perform column-wise operations as fast as possible.
Here's an example of what it looks like to string a couple of these basic dataset exploration operations together, for example in this case to get the bill_length_mm
of all penguins with body_mass_g
below 3800:
Tablecloth
(-> ds
(tc/select-rows #(and (% "body_mass_g")
(> (% "body_mass_g") 3800)))
(tc/select-columns "bill_length_mm"))
Note that in tablecloth we have to explicitly omit rows where the value we're filtering by is missing, unlike in other libraries. This is because tablecloth actually uses nil
(as opposed to a library-specific construct) to indicate a missing value , and in Clojure nil
is not treated as comparable to numbers. If we were to try to compare nil
to a number, we would get an exception telling us that we're trying to compare incomparable types. Clojure is fundamentally dynamically typed in that it only does type checking at runtime and bindings can refer to values of any type, but it is also strongly typed, as we see here, in the sense that it explicitly avoids implicit type coercion. For example deciding whether 0 is greater or larger than nil
requires some assumptions, and these are intentionally not baked into the core of Clojure or into tablecloth as a library as is the case in some other languages and libraries.
This example also introduces Clojure's "thread-first" macro. The ->
arrow is like R's |>
operator or the unix pipe, effectively passing the output of each function in the chain as input to the next. It comes in very handy for data processing code like this.
Here is the equivalent operation in the other libraries:
dplyr
ds |>
filter(body_mass_g < 3800) |>
select(bill_length_mm)
Pandas
ds[ds["body_mass_g"] < 3800]["bill_length_mm"]
Polars
ds.filter(pl.col("body_mass_g") < 3800).select(pl.col("bill_length_mm"))
More advanced filtering and selecting
Here is what some more complicated data wrangling looks like across the libraries.
Select all columns except for one
Library | Code |
---|---|
tablecloth | (tc/select-columns ds (complement #{"year"})) |
dplyr | select(ds, -year) |
pandas | ds.drop(columns=["year"]) |
polars | ds.select(pl.exclude("year")) |
Another property of functional languages in general, and especially Clojure, is that they really take advantage of the fact that a lot of things are functions that you might not be used to treating like functions. They also leverage function composition to simply combine multiple functions into a single operation.
For example a set (indicated with the #{}
syntax in Clojure) is a special function that returns a boolean indicating whether the given argument is a member of the set or not. And complement
is a function in clojure.core
that effectively inverts the function given to it, so combined (complement #{"year"})
means "every value that is not in the set #{"year"}
, which we can then use as our predicate column selector function to filter out certain columns.
Select all columns that start with a given string
Library | Code |
---|---|
tablecloth | (tc/select-columns ds #(str/starts-with? % "bill")) |
dplyr | select(ds, starts_with("bill")) |
pandas | ds.filter(regex="^bill") |
polars |
|
Select only numeric columns
Library | Code |
---|---|
tablecloth | (tc/select-columns ds :type/numerical) |
dplyr | select(ds, where(is.numeric)) |
pandas | ds.select_dtypes(include='number') |
polars | ds.select(cs.numeric()) |
The symbol :type/numerical
in Clojure here is a magic keyword that tablecloth knows about and can accept as a column selector. This list of magic keywords that tablecloth knows about is not (yet) documented anywhere, but it is available in the source code.
Filter rows for range of values
Library | Code |
---|---|
tablecloth | (tc/select-rows ds #(< 3500 (% "body_mass_g" 0) 4000)) |
dplyr | filter(ds, between(body_mass_g, 3500, 4000)) |
pandas | ds[ds["body_mass_g"].between(3500, 4000)] |
polars | ds.filter(pl.col("body_mass_g").is_between(3500, 4000)) |
Note here we handle the missing values in the body_mass_g
column differently than above, by specifying a default value for the map lookup. We're explicitly telling tablecloth to treat missing values as 0
in this case, which can then be compared to other numbers. This is probably the better way to handle this case, but the method above works, too, plus it gave me the opportunity to soapbox about Clojure types for a moment.
Reshaping the dataset
Tablecloth
(tc/pivot->longer ds
["bill_length_mm" "bill_depth_mm"
"flipper_length_mm" "body_mass_g"]
{:target-columns "measurement" :value-column-name "value"})
dplyr
ds |>
pivot_longer(cols = c(bill_length_mm, bill_depth_mm,
flipper_length_mm, body_mass_g),
names_to = "measurement",
values_to = "value")
Pandas
pd.melt(
ds,
id_vars=ds.columns.drop(["bill_length_mm", "bill_depth_mm",
"flipper_length_mm", "body_mass_g"]),
var_name="measurement",
value_name="value"
)
Polars
ds.unpivot(
index=set(ds.columns) - set(["bill_length_mm",
"bill_depth_mm",
"flipper_length_mm",
"body_mass_g"]),
variable_name="measurement",
value_name="value")
Creating and renaming columns
Adding columns based on some other existing columns
There are many reasons you might want to add columns, and often new columns are combinations of other ones. Here's how you'd generate a new column based on the values in some other columns in each library:
Library | Code |
---|---|
tablecloth |
|
dplyr | mutate(ds, ratio = bill_length_mm / flipper_length_mm) |
pandas | ds["ratio"] = ds["bill_length_mm"] / ds["flipper_length_mm"] |
polars |
|
Note that this is where the wheels start to come off if you're not working in a functional way with immutable data structures. Clojure data structures (including tablecloth datasets) are immutable, which is not the case Pandas. The Pandas code above mutates the dataset in place, so as soon as you do any mutating operations like these, you now have to keep mental track of the state of your dataset, which can quickly lead to high cognitive overhead and lots of incidental complexity.
Renaming columns
Library | Code |
---|---|
tablecloth | (tc/rename-columns ds {"bill_length_mm" "bill_length"}) |
dplyr | rename(ds, bill_length = bill_length_mm) |
pandas | ds.rename(columns={"bill_length_mm": "bill_length"}) |
polars | ds.rename({"bill_length_mm": "bill_length"}) |
Again beware, the Pandas implementation shown here mutates the dataset in place. Also manually specifying every column name transformation you want to do is one way to accomplish the task, but sometimes that can be tedious if you want to apply the same transformation to every column name, which is fairly common.
Transforming column names
Here's how you would upper case all column names:
Library | Code |
---|---|
tablecloth | (tc/rename-columns ds :all str/upper-case) |
dplyr | rename_with(ds, toupper) |
pandas | ds.columns = ds.columns.str.upper() |
polars | ds.select(pl.all().name.to_uppercase()) |
Like the other libraries, tablecloth's rename-columns
accepts both types of arguments – a simple mapping of old -> new column names, or any column selector and any transformation function. For example, removing the units from each column name would look like this in each language:
Library | Code |
---|---|
tablecloth | (tc/rename-columns ds #".+_(mm|g)" #(str/replace % #"(.+)_(mm|g)" "$1")) |
dplyr | rename_with(penguins, ~ str_replace(.x, "^(.+)_(mm|g)$", "\1")) |
pandas |
|
polars |
|
Grouping and aggregating
Grouping behaves somewhat unconventionally in tablecloth. Datasets can be grouped by a single column name or a sequence of column names like in other libraries, but grouping can also be done using any arbitrary function. Grouping in tablecloth also returns a new dataset, similar to dplyr, rather than an abstract intermediate object (as in pandas and polars). Grouped datasets have three columns, (name of the group, group id, and a column containing a new dataset of the grouped data). Once a dataset is grouped, the group values can be aggregated in a variety of ways. Here are a few examples, with comparisons between libraries:
Summarizing counts
To get the count of each penguin by species:
Tablecloth
(-> ds
(tc/group-by ["species"])
(tc/aggregate {"count" tc/row-count}))
dplyr
ds |>
group_by(species) |>
summarise(count = n())
Pandas
ds.groupby("species").agg(count=("species", "count"))
Polars
ds.group_by("species").agg(pl.count().alias("count"))
Find the penguin with the lowest body mass by species
Tablecloth
(-> ds
(tc/group-by ["species"])
(tc/aggregate {"lowest_body_mass_g" #(->> (% "body_mass_g")
tcc/drop-missing
(apply tcc/min))}))
dplyr
ds |>
group_by(species) |>
summarize(lowest_body_mass_g = min(body_mass_g, na.rm = TRUE))
Pandas
ds.groupby("species").agg(
lowest_body_mass_g=("body_mass_g", lambda x: x.min(skipna=True))
).reset_index()
Polars
ds.group_by("species").agg(
pl.col("body_mass_g").min().alias("lowest_body_mass_g")
)
Conclusions
As you can see, all of these libraries are perfectly suitable for accomplishing common data manipulation tasks. Choosing a language and library can impact code readability, maintainability, and performance, though, so understanding the differences between available toolkits can help us make better choices.
Clojure's tablecloth emphasizes functional programming concepts and immutability, which can lead to more predictable and re-usable code, at the cost of adopting a potentially new paradigm. Hopefully this comparison serves not only as a translation guide, but an an intro to the different philosophies underpinning these common data science tools.
Thanks for reading :)
Versions
The code in this post works with the following language and library versions:
Tool | Version |
---|---|
MacOS | Sonoma 14.5 |
JVM | 21.0.2 |
Clojure | 1.11.1 |
Tablecloth | 7.021 |
R | 4.4.1 |
Tidyverse | 2.0.0 |
Python | 3.12.3 |
Pandas | 2.1.4 |
Polars | 1.1.0 |
Published: 2024-07-18