The Unexpected Impacts of Pure Functions

Written by Peter Gaultney | June 12, 2025 at 5:26 PM

Modern software development has been transformed by tools that automate what was once manual coordination. Version control eliminated the need for developers to manually track the state of the code — which version they're working with, who changed what and how to collaborate on and share contributions. More recently, various tools have emerged to do something similar for data assets — tracking the state of datasets, model files and their lineage through systems like DVC, git lfs or cloud-native solutions.

But there's a third category of state that remains largely implicit in most organizations: the state of work itself. Who has run which analyses? What inputs did they use? Are the results I'm depending on still current, or do they need to be regenerated? In any non-trivial endeavor, answering these questions eventually requires a web of Slack messages, meetings and tribal knowledge. In fact, this exhibits characteristics of an N² coordination problem similar to the one that version control has long been helping us to avoid.

Just as git fundamentally changed how we think about code collaboration, the right approach to tracking computable work can eliminate entire categories of organizational overhead. This is the story of how our team discovered that a seemingly technical choice — representing components of our work as pure functions — began to transform not just our codebase, but the way our Machine Learning department collaborates and produces the valuable analytical insights that drive our business.

Why pure functions matter: foundation of confidence

The solution to this coordination problem lies in a foundational shift that we touched on in our article introducing mops: representing every meaningful chunk of work as a pure function. If 'pure function' seems like an oddly technical phrase for describing an organizational transformation, let's revisit what makes these functions special.

A pure function represents a unique result that we can arrive at reproducibly via the application of some well-defined logic to fully-specified inputs. Unlike most code which might read from databases, check the current time or depend on files that change, a pure function's output is completely determined by its inputs. Consider the difference between get_user_count() (which queries a database and might return different results each time) versus calculate_risk_score(patient_data, model_weights) (which will always return the same score given the same inputs). The latter is a pure function.

This significant, simple constraint unlocks powerful capabilities:

Reproducibility and Provenance: We can track inputs and outputs explicitly, providing confidence in the provenance of our work. Knowing precisely what went into a process allows us to understand and trust its result.

Self-Verifying Systems: The question "Should I re-run this?" becomes trivial to answer automatically. Instead of spending time asking around about whether something has been run, or with what data — or re-doing work "just in case" — the system can simply check: Have any of the inputs changed since the last time this function was executed? And the "checking" code is the same as the code that orchestrates the system — there's only the single system, operating in the most efficient mode at each step.

When every step in your system is expressed as a pure function, dependency chains are fully explicit and automatically managed.

Unified Directory: self-coordination emerges

Over the past year, we migrated one of our most complex pipelines, Unified Directory, to use mops end to end.

The old system exemplified the difficulty of N² coordination. Running the full pipeline required careful orchestration across multiple team members. Someone needed to track down the appropriate inputs and verify their status, then shepherd the computation and finally manage downstream handoffs. Senior developers and subject matter experts would spend days coordinating these runs, becoming bottlenecks for the entire team.

Running our legacy pipeline

Figure 1: Seeing results from changes to a model in the legacy pipeline could take a week of coordinating with other teammates.

The Directed Acyclic Graph (DAG) of new system tells a dramatically different story. The new Unified Directory system now runs as a collection of pure functions, with all inputs to each one made explicit. When a developer wants to integrate changes into the system, they import and call the Python function that produces what they need, and plug their result into the few functions that need their result. Behind the scenes, the system automatically determines what needs to be computed, what can be reused from cache and what order operations should run in.

But the transformation revealed something we didn't anticipate: the real limitation wasn't just the time cost – it was how this overhead had constrained our ambitions. The mental energy required to coordinate a complex pipeline meant we had settled for easier approaches, even when we knew more sophisticated analysis would create better business outcomes. The new DAG reveals this plainly by the number of new nodes and the proliferation of edges between them.

Unified Directory DAG post-`mops`

Figure 2: the DAG of our new Unified Directory pipeline, generated by mops with some formatting updates done by hand.

Feel free to hover and zoom - It's interactive!

The cherry on top? The new system documents itself. Because every dependency is explicit and tracked, we can generate the visualization of the entire pipeline structure on demand; not as an abstract idea of what is supposed to happen, but dynamically, directly from an actual run of the code. The system's architecture emerges from its constituent parts rather than requiring separate documentation that inevitably falls out of sync.

This shift was powerful even as it verges on paradoxical: we built a system that exhibits a far more finely grained structure, but at the cost of far less coordination overhead. The new Unified Directory pipeline handles more data sources, performs more sophisticated transformations and produces more comprehensive outputs than its predecessor – yet any team member can trigger a partial or a complete run with a single script, confident that all of the necessary work and only the necessary work will be performed. Meanwhile, the team members who are the experts on the varying parts of the system are building new solutions, undistracted by the need to consult on or shepherd the work that’s already been done.

A path forward for any organization

To be clear: when we created mops, we didn't set out to change how our Machine Learning department collaborates. We needed to be able to run code across distributed memory, and we wanted fault tolerance across the components of our data pipelines.

But what we got was a dramatic reduction in coordination overhead and operational uncertainty that has freed us to focus on communicating about and answering questions that are valuable to our customers. It wasn’t just time that we got back - it was the ability to think strategically about the bigger picture, without getting lost in the weeds. The transformation was bigger than we expected because coordination overhead is pervasive yet sneaky - it's the hidden tax that every growing team pays.

Critically – you don't need to change everything at once. We started with our most painful coordination points and gradually expanded. Today, we still have systems that don't use this approach, and we can perform individual cost-benefit analyses about migrating pieces as our business evolves.

The engineering effort to build mops was low. It was a part-time project for one developer over roughly a year, driven mainly by first-principles thinking about pure functions. Applying it to our use case took just a few weeks. Whether you achieve this by adopting mops itself, building your own tool in-house or adapting existing tools to enforce similar patterns, the core insight remains accessible.

When teams spend significant mental energy on coordination mechanics rather than essential business problems, that friction is often more addressable than it appears, and the opportunity cost of continuing with business as usual may be higher than you've realized. The key is recognizing that explicit dependency tracking and automatic coordination isn't just a technical feature - it's what enables teams to tackle the ambitious challenges that create real impact for your customers.

View full post