My favorite data(base) fields

Fields and columns that could save you lots of time

The other kind of favorite columns

I’ll share my two tips for database structure based on “what I wish I had” when working with — and maintaining legacy applications. I’ve experienced the pain of not following the recommendations below — fixing it required lots of time and a deliberate approach. As it usually is with tips like these, one should think about where and when to use them and even more importantly, when to not use them.

They apply for both relational and nonrelational databases, but I’ll use simple Postgres as a reference. Examples are written in Scala3 but should be applicable in other typed languages.

Use Enumerations instead of Booleans

You have a User table and you wish to track whether the user is active or not. The first temptation might be having a separate specific boolean column to track whether the user is active or not.

CREATE TABLE users (
... # your columns
is_active BOOLEAN NOT NULL
);

being represented as:

case class User(..., isActive: Boolean)if user.isActive then
...

According to this tip, you’d create them in a different form — with status as a string (or anyform supporting encoding of enums), where

CREATE TABLE users (
... # your columns
user_status CHARACTER VARYING(255) NOT NULL
);

and representation in the code as an enum:

case class User(..., status: UserStatus)enum UserStatus:
case Active, Inactive

Instead of if user.isActive then now we pattern match in code:

user.status match { 
case UserStatus.Active => ...
case UserStatus.Inactive => ...
}

So far, both solutions yield equal capabilities — arguably with the enum solution being more readable. Later down the road, when we need to support further requirements, updating enum is simpler. If we need to add Banned and Deleted statuses, we just update the enum definition. A compiler will guide you through places where you need to update your implementation to handle new cases.

enum UserStatus:
case Active, Inactive, Banned, Deleted

You’d get a warning from a type checker on every place where you were matching, forcing you to be explicit about desired behavior — e.g. making sure you don’t allow Banned users to post messages.

user.status match { 
case UserStatus.Active => ...
case UserStatus.Inactive => ...
<-----error: missing implementations for cases...help!
}

If we stick to boolean flags — an approach that I don’t recommend (I have seen this way too often in the past), we could add “isDeleted” and “isBanned” flags.

The following code is brittle, painful to manage and maintain, and most importantly, makes me sad.

CREATE TABLE users (
... # your columns
is_active BOOLEAN NOT NULL
is_deleted BOOLEAN NOT NULL
is_banned BOOLEAN NOT NULL
);

Our class would be the following:

case class User(..., isActive: Boolean, isDeleted: Boolean, isBanned: Boolean)

We should strive to make impossible states unrepresentable in our data layer. The above violates that — a user should not be able to be active and deleted at the same time. We now have to actively and carefully search the codebase for usages of these flags and maintain boolean puzzles such as below. We also get no warnings about places that could be affected by these new flags.

if user.isActive && !user.isDeleted && !user.isBanned then
...

Storing enums instead of booleans may end up costing you few more bytes of database storage, but I’d say it’s a small price to pay to avoid maintenance burden.

I’d say this approach useful even for nonpersistent data modeling. I’m explicitly mentioning database context because once the actual data is stored (or is missing), the maintenance burden increases.

Track source of the data

Data today can come from various sources. The obvious one is that the user had created it. But it could have been imported, or some system value. Or maybe some machine-learning overnight job generated it with some confidence. I recommend tracking the source of data, especially if there are multiple possible ones.

CREATE TABLE eshop_products (
... # your columns
created_by CHARACTER VARYING(255) NOT NULL
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
);

with the created_by being represented in the code as an enum:

enum DataSource:
case ImportJob, UserInput, MachineLearningGenerated

This comes in very handy when you discover a bug (i.e. let’s say your import job had a bug, or your machine learning algorithm generated something non-sensible), and you need to find, delete or update it in your storage. We could easily embed also precise source (let’s say we used JSONB representation as a column).

enum DataSource:
case ImportJob(jobId: Int) extends DataSource
case UserInput(userId: UUID) extends DataSource
case MachineLearningGenerated(version: Long) extends DataSource

In a similar fashion, you could track updates to the row, as well as time when the creation/update was performed.

Enjoyed this post? Subscribe to the Machine Learnings newsletter for an easy understanding of AI advances shaping our world.