At Trilliant Health, we maintain dozens of reference datasets from various sources for use in our data pipelines. Some come from government agencies, others from industry standards bodies, some are extracted by us from raw claims data and others from our own team's careful research and curation. They cover various domains: procedures, diagnoses, providers, addresses, geographies, demographic statistics. Each dataset has its own format, its own quirks, its own schema and its own update schedule.
Reference data is the foundational basis of our analytics products. If it's not solid, none of our internal downstream functions will be reliable: our machine learning models, business logic or analyses. Further, none of the downstream functions made by our customers would be reliable either.
We have to keep all of these datasets current and error-free to ensure that our products are built on the most timely and high-quality information. If you manage a lot of third-party datasets in your organization, you probably know how challenging this can be!
Anyone integrating many disparate external data sources into their business can probably understand the challenges we face:
To integrate this data efficiently we need to:
It's like trying to maintain a library full of books written in multiple languages, printed in multiple formats – paperback, hardcover, spiral-bound – that are constantly being updated with new editions (sometimes with the introduction of typos!). You want to maintain a high-quality, crisp hardcover English translation of every book in the library, in the latest edition, free of typos. And you want to keep your card catalog current so that your patrons can find the books they're looking for and easily determine who the original authors were, what languages they were writing in and when the latest edition was published.
tabularasaTo solve this problem, we created tabularasa (latin for "blank slate"), a tool that helps us start fresh and bring order to this chaos. The name is fitting – we want to wipe the slate clean and build a uniform interface for versioning, checking, accessing, documenting and maintaining our tabular datasets.
In the library analogy above, tabularasa is the library management system that keeps the card catalog current and ensures those hardcover English versions are consistent in style, up-to-date and typo-free.
tabularasa takes an opinionated view that, before writing any production code for using your data, you should be able to specify declaratively, in detail, what that data should look like. In other words, you should know and document explicitly, things like:
You should also be able to explain in natural language what the data means, from a business perspective.
tabularasa accomplishes this by defining a minimal schema language in which all of these things can be specified in one place for any project that version-controls tabular datasets. We call this the schema file. In practice it’s just a subset of YAML that conforms to a “meta-schema” – the schema for specifying the schema of your data. It’s like the “information schema” that you would find in a relational database which describes which tables are present as well as their columns, data types and indexes – but it also provides further information such as documentation and specification of raw source files and derivations.
Importantly, the schema file constitutes one central artifact from which a great many useful things can be derived. It serves as a “lingua franca” or “Rosetta Stone” for data definition and documentation. Let’s say you’re a developer maintaining an internal view of some 3rd party datasets with custom derived columns, and you need to:
These sound like 5 distinct tasks, each of which might require some hand-written code to accomplish. However, with tabularasa,  they’re nearly complete just as soon as you define your schema file. You will have to write the code that performs any custom derivations you want (tabularasa can’t read your mind), but even this requires only defining a function which takes certain structured inputs representing the raw source data. And if you don’t need a custom derivation, just a conversion from a standard tabular format into clean, version-controlled Parquet, you don’t even need that – tabularasa will handle parsing, column renaming and data type casting for you! It's that simple. With the schema file in hand you can accomplish the above tasks with a few simple commands:
tabularasa datagen and have all your final derived tables generated and hashes pinning their new versions updated in the schema file
tabularasa push and have the derived tables published as versioned parquet files to a shared blob storetabularasa init-sqlite to get the data into a SQLite archive format complete with indexes for fast lookupstabularasa codegen to generate Python code for accessing the data in multiple ways:
pandas data framestabularasa docgen to generate publishable web documentation for non-technical users, encompassing data definitions, types, constraints and provenanceOf course these can be scripted or automated further with the tools of your choosing - CI, pre-commit, etc.
As of the time of writing, we maintain five internal libraries which define their own tabularasa schemas and publish their own tabular data artifacts. Among those five libraries, there are 99 tables defined, of which 51 are published internally, and the remainder are “transient,” used only for internal derivations (but still equally strictly constrained, so that we can also check our assumptions at each derivation step).
In these libraries collectively, there are ~5,000 lines of tabularasa schema files, from which are generated ~14,000 lines of Python schemas and accessor code, ~750 lines of SQL (not all projects or all tables defined therein require a SQL interface) and ~3700 lines of reStructuredText Markdown documentation. That means that for every one line of declarative schema we maintain, we get almost four lines of python, SQL and Markdown maintained for free, correct by construction. At a first approximation, this allows our team to move about four times as fast at keeping those datasets up-to-date and well documented, while producing fewer errors. But the speedup is probably greater: by catching issues early and legibly – sometimes right down to the specific data row that failed a constraint check – we also save a lot of time on tedious debugging.
At Trilliant Health, we have a value of “surfacing the bad news quickly.” If we’ve made a mistake, we want to know as soon as possible, so we can fix it quickly. In software engineering, there’s an analogous virtue of “failing fast.” tabularasa embodies these values in a few ways:
tabularasa operation, you can’t accidentally misspecify your schema (e.g. by introducing a typo in a field name or referencing a file that doesn’t exist).str but Literal["one", "of", "these", "values"].In addition to the “fail fast” and “correct by construction” features above, tabularasa also provides visibility on the schema, data and changes to either – allowing developers to quickly understand the state of a project and the impact of their changes.
With tabularasa dag, you can quickly visualize the flow of source data through a series of complex derivations. Here’s the output for our internal reference data table on procedure codes for example, one of our most complex derivations:
With tabularasa data-diff and tabularasa schema-diff, you can quickly understand the impact of updates to source data or derivation logic. When there is positive output from these, we automatically post comments on pull requests to get people’s attention that there are changes worth reviewing and providing a high-level summary of those changes. For a data-driven organization like Trilliant Health, reviewing data changes is as important as reviewing code changes! Here’s a comment that was posted for review on a recent pull request:
## `pt_organization_type` ### Key Changes: | | count | percent | |:-------------|--------:|-----------:| | dropped_rows | 14 | 0.00511954 | | added_rows | 2116 | 0.76788 | ### Metadata Changes: | | before | after | |:---------|---------:|--------:| | num_rows | 273462 | 275564 | ### Value Changes: | column | updated | updated_percent | |:--------------------------------------|----------:|------------------:| | rendering_organization_ownership_type | 1579 | 0.577441 |
This tells the reviewer that a few rows were dropped, and a couple thousand added, in the pt_organization_type table, and that the rendering_organization_ownership_type column was updated on 1,579 or 0.57% of rows. A helpful comment at the bottom provides a command that can be run locally to produce more detail about the changes. Some results could signal the need for a deeper review and possible remediation. For example, if any null values were introduced, we would also see a nulled column, and we might want to check for a bad join or a decrease in fill rate in one of our 3rd party datasets. Similarly, a very high rate of updates in a column could be cause for concern: perhaps a data provider changed the format or case convention of a string column?
What started as a solution to our own data management headaches has become an essential tool in our data infrastructure. It's not just about making developers' lives easier (though it definitely does that), it's also about creating a foundation of trust in our data and knowing that when we build healthcare analytics products, we're building them on solid ground. tabularasa helps us build trust in our data by:
git helps us do with codeIn the world of healthcare data, where accuracy and reliability are paramount, having a tool like tabularasa isn't just a convenience – it's a necessity. By automating the tedious and error-prone aspects of reference data management, we've freed up our team to focus on what really matters: building better healthcare data products.
To that end, we're releasing tabularasa to the public. You can install it today with pip install thds.tabularasa. We invite you to try the package out and see how it can help your teams manage datasets.