Data Quality Assurance by Automating the Boring Stuff

Written by Matt Hawthorn | October 30, 2025 at 10:53 PM

At Trilliant Health, we maintain dozens of reference datasets from various sources for use in our data pipelines. Some come from government agencies, others from industry standards bodies, some are extracted by us from raw claims data and others from our own team's careful research and curation. They cover various domains: procedures, diagnoses, providers, addresses, geographies, demographic statistics. Each dataset has its own format, its own quirks, its own schema and its own update schedule.

Reference data is the foundational basis of our analytics products. If it's not solid, none of our internal downstream functions will be reliable: our machine learning models, business logic or analyses. Further, none of the downstream functions made by our customers would be reliable either.

We have to keep all of these datasets current and error-free to ensure that our products are built on the most timely and high-quality information. If you manage a lot of third-party datasets in your organization, you probably know how challenging this can be!

The problem: data complexity

Anyone integrating many disparate external data sources into their business can probably understand the challenges we face:

Diverse formats
Constant updates at multiple cadences
Complex derivations involving multiple inputs
Lots of boilerplate code to maintain for accessing the data consistently in multiple formats

To integrate this data efficiently we need to:

Ensure that we know exactly what's in every update: no data quality issues should sneak in silently
Make the data easily accessible to internal users for multiple use cases
Enable technical and non-technical team members to understand the data: its origin, derivation, structure and meaning
Enable developers to update, modify and add datasets with a minimum of repetitive, error-prone labor
Alert developers if data updates break any existing business logic, statically, before the code even runs

It's like trying to maintain a library full of books written in multiple languages, printed in multiple formats – paperback, hardcover, spiral-bound – that are constantly being updated with new editions (sometimes with the introduction of typos!). You want to maintain a high-quality, crisp hardcover English translation of every book in the library, in the latest edition, free of typos. And you want to keep your card catalog current so that your patrons can find the books they're looking for and easily determine who the original authors were, what languages they were writing in and when the latest edition was published.

Enter `tabularasa`

To solve this problem, we created tabularasa (latin for "blank slate"), a tool that helps us start fresh and bring order to this chaos. The name is fitting – we want to wipe the slate clean and build a uniform interface for versioning, checking, accessing, documenting and maintaining our tabular datasets.

In the library analogy above, tabularasa is the library management system that keeps the card catalog current and ensures those hardcover English versions are consistent in style, up-to-date and typo-free.

How it works

tabularasa takes an opinionated view that, before writing any production code for using your data, you should be able to specify declaratively, in detail, what that data should look like. In other words, you should know and document explicitly, things like:

What the data type of each column is, being as specific as possible (e.g., if it’s an integer, is it signed or unsigned? Are 16 bits enough? 64?)
Beyond just the serialized storage type of the values, what constraints they satisfy. For strings, are they expected to be uniformly uppercase? Do they match some regular expression pattern? For numerical values, should they be in some range? Should there only be a distinct set of values allowed (an enum)?
Are any sets of columns expected to be globally unique? Should any column sets serve as a primary key or index for fast lookups?

You should also be able to explain in natural language what the data means, from a business perspective.

What is this table used for?
What are the nuanced semantics of each column, especially those derived internally for business use cases?
Where does the data in it come from – what is its provenance?
How would a developer update the data? On what cadence?
Where are the final, cleaned up, version-controlled tabular datasets stored or published?

tabularasa accomplishes this by defining a minimal schema language in which all of these things can be specified in one place for any project that version-controls tabular datasets. We call this the schema file. In practice it’s just a subset of YAML that conforms to a “meta-schema” – the schema for specifying the schema of your data. It’s like the “information schema” that you would find in a relational database which describes which tables are present as well as their columns, data types and indexes – but it also provides further information such as documentation and specification of raw source files and derivations.

Importantly, the schema file constitutes one central artifact from which a great many useful things can be derived. It serves as a “lingua franca” or “Rosetta Stone” for data definition and documentation. Let’s say you’re a developer maintaining an internal view of some 3rd party datasets with custom derived columns, and you need to:

Orchestrate the derivation of the data from raw external sources
Test that the data satisfies expected constraints – uniqueness, conformance of data values to expected patterns, etc
Publish the derived data in a centralized location where it will be available to users
Access the data in a sqlite database for fast lookups in transactional workflows
Access the data in a clean parquet format for efficient analytic processing
Document the meaning and suggested usage of the data for non-technical users in a publishable format such as HTML or markdown

These sound like 5 distinct tasks, each of which might require some hand-written code to accomplish. However, with tabularasa, they’re nearly complete just as soon as you define your schema file. You will have to write the code that performs any custom derivations you want (tabularasa can’t read your mind), but even this requires only defining a function which takes certain structured inputs representing the raw source data. And if you don’t need a custom derivation, just a conversion from a standard tabular format into clean, version-controlled Parquet, you don’t even need that – tabularasa will handle parsing, column renaming and data type casting for you! It's that simple. With the schema file in hand you can accomplish the above tasks with a few simple commands:

Run tabularasa datagen and have all your final derived tables generated and hashes pinning their new versions updated in the schema file
- Constraint checks will optionally be performed after each derivation, ensuring all your assumptions still hold
Run tabularasa push and have the derived tables published as versioned parquet files to a shared blob store
Run tabularasa init-sqlite to get the data into a SQLite archive format complete with indexes for fast lookups
Run tabularasa codegen to generate Python code for accessing the data in multiple ways:
- Functions for loading the Parquet files for analytic processing as e.g. pandas data frames
- A SQLite interface, with Python functions for performing lookups according to your defined indexes
Run tabularasa docgen to generate publishable web documentation for non-technical users, encompassing data definitions, types, constraints and provenance

Of course these can be scripted or automated further with the tools of your choosing - CI, pre-commit, etc.

Our experience

As of the time of writing, we maintain five internal libraries which define their own tabularasa schemas and publish their own tabular data artifacts. Among those five libraries, there are 99 tables defined, of which 51 are published internally, and the remainder are “transient,” used only for internal derivations (but still equally strictly constrained, so that we can also check our assumptions at each derivation step).

In these libraries collectively, there are ~5,000 lines of tabularasa schema files, from which are generated ~14,000 lines of Python schemas and accessor code, ~750 lines of SQL (not all projects or all tables defined therein require a SQL interface) and ~3700 lines of reStructuredText Markdown documentation. That means that for every one line of declarative schema we maintain, we get almost four lines of python, SQL and Markdown maintained for free, correct by construction. At a first approximation, this allows our team to move about four times as fast at keeping those datasets up-to-date and well documented, while producing fewer errors. But the speedup is probably greater: by catching issues early and legibly – sometimes right down to the specific data row that failed a constraint check – we also save a lot of time on tedious debugging.

Catching issues before they get to production

At Trilliant Health, we have a value of “surfacing the bad news quickly.” If we’ve made a mistake, we want to know as soon as possible, so we can fix it quickly. In software engineering, there’s an analogous virtue of “failing fast.” tabularasa embodies these values in a few ways:

Since your schema file itself is validated against the meta-schema during any tabularasa operation, you can’t accidentally misspecify your schema (e.g. by introducing a typo in a field name or referencing a file that doesn’t exist).
Since the constraints you declare in your schema file are checked as soon as your tabular data artifacts are derived, those artifacts can never be published in an invalid state.
Since the Python code generated for your transactional processing is statically typed, you can’t break internal code or a downstream application without knowing it as soon as your type checker runs (we work in a monorepo on the ML team at Trilliant, which also helps with surfacing this kind of problem quickly). Even enumerated values are strongly typed - not just str but Literal["one", "of", "these", "values"].
Since your data accessor code and schemas (SQL, Python records, Parquet schemas) are all automatically derived from your schema file, they’re correct by construction; you can’t forget to add that new column you’re bringing in, or forget to update the type or name of a field when it changes.
Since your nontechnical documentation is auto-generated from your schema file, it can’t be out of date.

In addition to the “fail fast” and “correct by construction” features above, tabularasa also provides visibility on the schema, data and changes to either – allowing developers to quickly understand the state of a project and the impact of their changes.

With tabularasa dag, you can quickly visualize the flow of source data through a series of complex derivations. Here’s the output for our internal reference data table on procedure codes for example, one of our most complex derivations:

With tabularasa data-diff and tabularasa schema-diff, you can quickly understand the impact of updates to source data or derivation logic. When there is positive output from these, we automatically post comments on pull requests to get people’s attention that there are changes worth reviewing and providing a high-level summary of those changes. For a data-driven organization like Trilliant Health, reviewing data changes is as important as reviewing code changes! Here’s a comment that was posted for review on a recent pull request:

## `pt_organization_type`
### Key Changes:
|              |   count |    percent |
|:-------------|--------:|-----------:|
| dropped_rows |      14 | 0.00511954 |
| added_rows   |    2116 | 0.76788    |

### Metadata Changes:
|          |   before |   after |
|:---------|---------:|--------:|
| num_rows |   273462 |  275564 |

### Value Changes:
| column                                |   updated |   updated_percent |
|:--------------------------------------|----------:|------------------:|
| rendering_organization_ownership_type |      1579 |          0.577441 |

This tells the reviewer that a few rows were dropped, and a couple thousand added, in the pt_organization_type table, and that the rendering_organization_ownership_type column was updated on 1,579 or 0.57% of rows. A helpful comment at the bottom provides a command that can be run locally to produce more detail about the changes. Some results could signal the need for a deeper review and possible remediation. For example, if any null values were introduced, we would also see a nulled column, and we might want to check for a bad join or a decrease in fill rate in one of our 3rd party datasets. Similarly, a very high rate of updates in a column could be cause for concern: perhaps a data provider changed the format or case convention of a string column?

The bigger picture

What started as a solution to our own data management headaches has become an essential tool in our data infrastructure. It's not just about making developers' lives easier (though it definitely does that), it's also about creating a foundation of trust in our data and knowing that when we build healthcare analytics products, we're building them on solid ground. tabularasa helps us build trust in our data by:

Failing fast and ensuring that bad news is surfaced quickly, before data is shipped production
Validating that our assumptions are met and constraints satisfied prior to release
Treating data as a first-class citizen on equal footing with code, and helping developers quickly and intuitively understand the effects of their changes on it, just as git helps us do with code

In the world of healthcare data, where accuracy and reliability are paramount, having a tool like tabularasa isn't just a convenience – it's a necessity. By automating the tedious and error-prone aspects of reference data management, we've freed up our team to focus on what really matters: building better healthcare data products.

To that end, we're releasing tabularasa to the public. You can install it today with pip install thds.tabularasa. We invite you to try the package out and see how it can help your teams manage datasets.

View full post