Practical Data Modeling

Practical Data Modeling

Ch 9 - Counting and Aggregation: Controlling the Grain

Chapter 9 of Mixed Model Arts

Joe Reis's avatar
Joe Reis
Mar 21, 2026
∙ Paid

Here’s Chapter 9, which covers aggregation and counting. I made some pretty major revisions from the draft I dropped last year, and it took quite a few cycles to get this one right. There’s still some work to do on the artwork, but I figured I’d launch the chapter anyway, as I’m in the middle of editing part the last few chapters of Part 2, as well as editing Part 3.

Expect chapters to start dropping more frequently throughout next week, as they’re mostly edited and just need some fine tuning. Also, I am introducing hand drawings into the artwork, which I think is fun to produce as well as gives it a nice touch. Artwork also takes a while to get right (you’ll notice some temporary artwork in this chapter), but I’d rather get the chapter out rather than have you wait for the final version, which will be in the book :)

My cadence right now - Part 2 and 3 edits completed by early/mid next week. Then that’s it. There might be some slight revisions to Part 1 (Chapters 1-3), but the manuscript is mostly done with development and line edits. There will be some copy editing and new artwork, and I’ll drop the revised chapters as replacements to their current locations on Substack. Then comes formatting, proofing, and printing. The final round is approaching!

Thanks and have a great weekend!

Joe


Conditioning day. The coach has everyone doing rounds on the bag—three minutes on, one minute off—and tracking your output. After the session, she reads the numbers back. “You averaged 47 strikes per round.” You feel pretty good about that. Then she breaks it down: round one, 98 strikes. Round five, 23. “Your average doesn’t describe your strike count. You start off strong, but by round five you’re cooked,” she says. “I need to know when you fall off, not some number in the middle.” Your conditioning needs some work.

The U.S. Air Force learned about averages the hard way. In the 1950s, the Air Force designed its cockpits around the “average” pilot—average arm length, average torso height, average leg reach—based on measurements of over 4,000 airmen. Build for the middle, the thinking went, and most pilots will fit. But when researcher Lt. Gilbert S. Daniels checked how many pilots fell within the average range on all ten key dimensions, the answer was zero. Not one pilot out of 4,000 was average in everything. The cockpit built for the average pilot fit nobody. The Air Force abandoned the concept and moved to adjustable seats, pedals, and harnesses, designing for variability rather than the mean.

Aggregation is the process of combining many individual instances into a single summary value—taking detail and compressing it, sometimes for speed, sometimes for meaning, sometimes to communicate across scale. When you count, sum, average, or calculate a percentile of a group of records, you’re aggregating. You’re trading granularity for simplicity. The art of data modeling is knowing when that trade-off reveals a signal and when it destroys it.

If you can’t aggregate it, you haven’t modeled it. Or, more accurately, if you don’t know exactly what happens when you try to aggregate it, you don’t understand your own grain.

That’s the thesis of this chapter. Aggregation isn’t just a calculation you run after the model is built. It’s a constraint you build into the model itself. Aggregation forces you to define precisely what exists, how things relate, and how they sum up. A model that supports clean, reproducible aggregation has well-defined grains, stable boundaries, disjoint groupings, and attributes with predictable mathematical behavior. A model that produces contradictory or irreconcilable aggregates reveals deeper issues with identity, structure, or relationships. No amount of downstream analytics, computation, or fancy AI will fix those inconsistencies.

We’ve covered identifying things, relating them, and figuring out grain. Now we get to the hard part. Both counting and aggregation work on top of the grain. Get the grain right, and everything flows. Get it wrong, and your aggregations won’t make sense. This applies to any data shape: tables, text, graphs, events, images, streaming data—it doesn’t matter.

Let’s start with counting.

What Does It Mean to Count Something?

You’ve been counting all your life. Seems simple enough, right? Open a tab in your SQL editor, write SELECT COUNT(*), and hit execute. Done. Except it’s not that simple. The moment you decide to count something, you’ve already made implicit assumptions. And those assumptions will break if you don’t think them through.

Four things must be true before a count means anything:

You need to know what you’re counting. Refer back to our earlier discussion on instances and identity. Are we counting rows? Words or characters in a text document? Events? Pixels in an image? Two items can only be counted separately if we can distinguish them.

You need to establish existence and cardinality. Does the thing exist at all, and if so, how many unique instances are there? “Does this customer exist in our system? And if so, how many orders does she have?” In data modeling, your counting basis is strictly defined by the boundaries of the dataset you’ve chosen to frame.

You need to determine discreteness or continuity. Is what we’re counting discrete or continuous? Some things are counted as whole units (orders, users, word count). Others are measured continuously (temperature, pixel intensity, precipitation). A data model must specify which is which.

You need to account for context and scope. Counts depend on the conditions of the question being asked. “How many orders does she have?” only makes sense within a context: within what time range? Under what definition of “customer”? From which system?

Let’s make this concrete. Suppose someone asks: “How many active users do we have?”

That sounds like a simple question with a simple count. It’s not. Start with identity: what is a “user”? Is it defined by an email address, a device ID, or an account ID? If a person has two accounts, are they one user or two? If they browse on their phone and their laptop, do you count them once or twice? Next, existence and cardinality: what does “active” mean? Logged in at least once? Performed a specific action? Opened the app? The definition changes the number dramatically—I’ve seen “active user” counts vary by a factor of 3 within the same company, depending on which team defined “active.” Then, the context: active over what period? Last 7 days? Last 30? Calendar month? Rolling window? Each answer produces a different count from the same underlying data.

All of this ties back to defining what “one” means, how many such “ones” can exist, and under what conditions. Counting is technically a form of aggregation—a COUNT function is still an aggregate—but it’s doing something different from summing or averaging. Counting proves something exists. It establishes identity and cardinality. Get counting wrong, and every aggregation downstream inherits the confusion.

Now that we understand “one,” let’s talk about “many.”

Aggregations: From Many to One

Here’s the gear shift. Counting establishes that things exist and tells you how many there are. Aggregation takes those things and compresses them. It’s the difference between “there are 10,000 orders” and “total revenue is $2.3 million.” Counting is a census. Summarization is an operation that transforms multiple values into a single value.

Aggregation allows us to perform mathematical operations—counting, summing, averaging (mean, mode, median), or finding minimum or maximum values—on a set of values to derive a new summary value. Aggregates always lose their original information. You can’t compress many things into one without throwing something away. Every aggregation is a trade-off: you gain simplicity and speed, but you lose granularity and detail. Your job is to know what you’re trading and whether that trade makes sense.

Aggregation goes way beyond tabular data and GROUP BY statements. Compression looks completely different depending on your domain:

  • Machine learning engineers use pooling layers to extract the sharpest edge of an image and intentionally drop the rest.

  • Graph architects roll up chaotic, spiderweb relationships into simplified entity properties.

  • Streaming engineers slap artificial time windows onto an infinite river of events just to make the math work.

  • Text processing relies on word frequencies and topic modeling to compress thousands of pages into a handful of numerical vectors.

None of these approaches is right or wrong. They just rely on completely different assumptions about grain, identity, and what “many” actually means. To effectively mix these models, you can’t just memorize the syntax for each tool. You have to understand the underlying structural compromise. You need to know exactly which paradigm to deploy, and more importantly, exactly what information you are choosing to destroy.

Grouping is how you change grain. You take high-grain data and collapse it into larger buckets. A transaction becomes a session, which becomes a customer, which then becomes a cohort.

Transaction → Session → Customer → Cohort

Each step is a new grain level. Each step loses a different kind of information. You need to understand what each loss means for your use case. And that aggregated result is a real thing now, a new entity with its own grain and its own rules. You can aggregate it again, but the further you go, the more detail you lose.

Structural Principles of Safe Aggregation

Aggregation seems simple enough, but poorly constructed aggregations have ruined many data models. There are structural rules that determine whether an aggregation is safe and valid—and these rules hold across all forms of data: tables, documents, graphs, events, vectors, images, and more. If these rules aren’t respected, aggregation produces contradictions, double counting, or mathematically invalid results. Data modeling is about making data that’s useful and trustworthy. These principles are how you guarantee that.

Identify and Align the Data Grain

The most fundamental concept governing aggregation is the data’s grain, which we covered in Chapter 8. Data can only be successfully aggregated when it shares the same grain. Combine different datasets with mismatched grains—user event sessions from a mobile game mixed with IoT readings of hummingbird flight patterns—and the result is silly at best and structurally unsound by default.

Mismatched grains cause three specific failures:

Double counting. A low-grain item (such as a transaction) gets mistakenly combined with a high-grain item (such as a customer) without a proper many-to-one relationship, causing the high-grain item to be counted multiple times.

Confusing aggregates. Aggregating a high-grain item (like a session) with a low-grain item (like a minute-by-minute heartbeat ping) obscures or averages out critical detail in the lower-grain data.

Ambiguous interpretation. The resulting metric has no clear, traceable definition. Combining “daily unique visitors” (grain: day-user) with “page views” (grain: page view event) without proper grouping yields a number that’s neither an accurate count of users nor a precise count of views.

A robust data model must enforce three rules to guarantee grain integrity:

Each instance has a defined grain. Every table or dataset must have its grain explicitly documented and understood by all users. This eliminates ambiguity about what each data instance represents.

Every aggregate operates on instances of that grain. When calculating a sum, average, or count, the operation must be performed on records that share the same grain as the final metric. If an aggregate requires a grain change (e.g., counting daily sessions to get monthly unique users), the change must be an explicit, deliberate operation (e.g., using COUNT(DISTINCT user_id)).

Grain changes are explicit, never accidental. Any transformation that alters the underlying grain—such as joining two tables or performing a grouping operation—must be a conscious, documented step. Grain alteration should never be a side effect of a poorly constructed query or an ill-conceived join.

Violations here are the root cause of most incorrect aggregates. Inaccurate revenue figures, skewed user metrics, and misleading performance reports are almost always traceable to a failure to manage grain. Proper grain management is a prerequisite to good data modeling.

Grouping and Disjointness

Once you’ve established the correct grain, you need to group items. Grouping implies aggregation, and the critical question is whether your groups are disjoint or overlapping.

Disjointness means every single instance in a dataset belongs to exactly one group—mutually exclusive buckets. When you drop a data point into a group, it cannot exist in any other group simultaneously.

Figure 9-1: Disjoint sets A and B

User's avatar

Continue reading this post for free, courtesy of Joe Reis.

Or purchase a paid subscription.
© 2026 Joe Reis · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture