Blog
Notebooks weren't built for the modern data stack
Our tools for exploration and analysis are living in the past.

Computational notebooks might seem like a recent phenomenon, but they have actually been around for many years. The first data notebook, Mathematica, was developed in the 1980s, and Jupyter (then iPython) was first introduced in December 2011.

In the intervening decades, however, a lot has changed in the data world. Unfortunately, data notebooks havenât kept up with these changes, and today they feel like relics of a bygone age, despite being used for analysis and modeling workflows by many thousands of data practitioners. The rest of our tools have evolved into a "modern data stack", but most notebooks are moving in the past.
The shortcomings of notebooks will be familiar to anyone using them day-to-day, and can be broken down into 3 big buckets:
- Scale - notebooks canât handle the size of data weâre storing in the cloud
- State - notebooks have confusing and frustrating compute models
- Sharing - notebooks werenât built for teams
Scale: data notebooks weren't built for the cloud
Probably the biggest story in the last 10 years has been, well, big data. Cloud data warehouses and data lakes can now flexibly handle all of an organizationâs information, at reasonable cost. Terabytes of data can be stored in easy-to-query tables, and a bevy of new tools for extraction, transformation, and observability makes it simple to process. With this âmodern data stack'', organizations can finally have complete, clean, reliable data assets available for analysis and modeling.

Notebooks, however, werenât built for this world. In a traditional notebook, computation occurs in an in-memory kernel, which contains all the data youâre working with. These kernels can be fast, but are inherently limited by the memory capacity of whatever machine youâre running them on. This means that if you run too big of a query, youâll blow up your kernel, lose your work, and have to re-build your workflows to use smaller samples.

If youâre still running notebooks locally, the data limit is the RAM installed in your laptop. Today, many data notebooks are run âin the cloudâ, but now youâre just constrained by the size of the virtual machine youâre on. The only way to scale up your data workloads is with brute force, upping the memory for your kernel to 16, or 32, or 64GB, and so on.
This is expensive and inefficient: youâre keeping a bunch of memory wired up to store your data, but only occasionally running computations on it*.* In between operations, thereâs a bunch of time where youâre thinking, or typing, or looking at charts, or playing Wordle, but youâre still paying for that big memory footprint.

Most of us just want to run queries without having to also reason about memory management or running up huge compute bills. Itâs 2022, and I have a cloud warehouse with all my data, but I canât just use it? I can run select * from big_table in my cloud console, but in my notebook it crashes my whole workspace? I donât worry about table sizes in my warehouse⊠why am I sweating dataframe sizes in my notebook?

State: the notebook execution model is broken
The way notebooks handle compute and memory state is a massive source of frustration, especially for beginners. This angst is most infamously documented by Joel Grus's "I don't like notebooks" talk at JupyterCon (video, slides).
The issue, in short: data notebooks are backed by a kernel, i.e., an in-memory process with mutable state. The commands run in each cell of a notebook modify that state, such as creating or reassigning a variable. You can run anything in a cell, and execute those cells in any order, meaning that the current kernel state can be difficult to reason about and reproduce.
Check out this basic example: quick, whatâs the current value of x?

At first you might think 2, given it's the last cell. But you're a grizzled Notebook Veteran, so you look to the [execution order numbers] and see that the first cell was run most recently, and reason that it's actually 1. Ok, let's check:

Oops, turns out the correct answer is 100, because the cell 1 was edited after it was previously run with x = 100. Better luck next time!
This is a really simple example, but as projects grow more complex, itâs very easy to tie yourself in awful state knots. You run a few cells out of order to test some things, or avoid running an expensive chunk of logic, and suddenly youâre seeing weird, inconsistent outputs. There isnât a great solution to this, other than building up a complex state map in your head, or frequently restarting your kernel and running everything from scratch.
This causes three major issues:
- Interpretability: itâs hard to reason about whatâs happening in a data notebook, whether it's your own or someone elseâs.
- Reproducibility: because cells can be run in any arbitrary order, reproducing a current state can be difficult without frequently re-running all the logic from scratch.
- Performance: re-running everything from scratch can be wasteful and time-consuming. This is true for any notebook, but unfortunately it's especially painful in Hex: published Data Apps run all of their logic top to bottom every time, explicitly to avoid weird state issues. While elegant, this has obvious performance implications if cells contain expensive logic like a query or model re-compute.
Sharing: notebooks werenât built for teamwork
Traditional notebooks do basically nothing when it comes to making it easy to work with others, as co-editors or consumers. They were built for an older era, when most data work was done locally on one individualâs laptop, and feel out of place in the era of modern cloud productivity software.
Trying to work together on a notebook is somewhere between infuriating and impossible: youâre sending back-and-forth a hard-to-diff JSON file that includes serialized outputs. Good luck seeing what changed! And even if you figure that out, youâll inevitably run into issues with environments, database credentials, and package versions.

Worse, perhaps, is the inability to share with stakeholders. 99% of people canât open an .ipynb, so the the default way to share outputs from notebooks is sending screenshots and slides, which are severed from the underlying data, not interactive, and certainly not secure.
Notebooks have so much more potential!
Ok weâre a few hundred words into trash-talking notebooks, and why they arenât keeping up in the modern data world, but itâs really a shame because notebooks actually rock! The notebook format is great for iterative, exploratory data work, but is only accessible to people who can figure out how to get it set up and fight through the issues above.
At Hex, we set out to fix all of this, and empower millions of people who want to do more with data, together. If you're curious about what a fully-collaborative, built-for-the-cloud, perfect-state notebook solution looks like, check out what we're up to or get started now.