11/7/2025

Correction for "GDP for Data Science"

If my blog were Stack Overflow, I think this post would earn me an Excavator badge.

Excavator badge on SO

(Or maybe a Necromancer if this post got a little traction.)

In the course of revitalizing this site and reviewing some historical material (a process I documented in the previous post), I found my coverage of Matt Noonan’s Ghosts of Departed Proofs.

At a previous job, I undertook a research project investigating how to apply GDP to data science (a project, I should add, that I greatly enjoyed). The idea was to encode domain knowledge about the pre- and post-conditions of data science and machine learning algorithms into the type system to catch mistakes earlier in the research process and give better, more domain-relevant error messages in the process.

I contrived an example about a preprocessing pipeline that, as it worked, accumulated proof that each processing step had been performed. These proofs were later discharged to satisfy the preconditions of the library’s clustering and classification algorithms (not pictured):

data Normalized
normalize :: df -> df ::: Normalized

data NoMissingVals
dropMissingVals :: df -> df ::: NoMissingVals
fillMissingVals :: df -> df ::: NoMissingVals

(.|) :: (a -> a ::: p) -> (a -> a ::: q) -> a -> a ::: p && q

preprocess :: df -> df ::: NoMissingVals && Normalized
preprocess = normalize .| fillMissingVals

With the benefit of hindsight, I realized that this formulation has a subtle but critical mistake.

The proven properties must be linked to the data they describe by a name, or else the proofs could be “conjured” away from the original data and attached to another term that did not undergo the preprocessing pipeline:

import Data.Refined ((...), conjure)

dfGood :: df ::: NoMissingVals && Normalized
dfGood = preprocess someData

dfBad :: df ::: NoMissingVals && Normalized
dfBad = otherData ...(conjure dfGood)

Our algorithms would accept dfBad even though it doesn’t actually meet their preconditions.

The fix

Bridge the earthly realm of data and phantom realm of proofs using a name. Something like:

import Theory.Named (Defn)

newtype Normalized df = Normalized Defn
normalize :: df ~~ name -> df ::: Normalized name

newtype NoMissingVals df = NoMissingVals Defn
dropMissingVals :: df ~~ name -> df ::: NoMissingVals name
fillMissingVals :: df ~~ name -> df ::: NoMissingVals name

preprocess :: df ~~ name -> df ::: NoMissingVals name && Normalized name

I still need to work through the details and implementation in a real example, but this is where I’d start.

The naming process in GDP leverages the ST trick, using a rank-2 signature to protect names from leaking to the wrong terms. Unlike in the original formulation, here there’s no well-typed way for me to steal the proof of processing from dfGood and attach it to dfBad; the processing proof would only apply to dfGood’s name, which will never be the same as dfBad’s name thanks to the ST trick.

Having written the original post five years ago, I don’t remember if this mistake was an intentional elision for the sake of brevity in a blog post, or a genuine oversight. In any case, it was fun to trawl the archives, and satisfying that my recollection of GDP was still just sharp enough to notice the error!

Previous: Rebuilding the site with Astro

Next: "Welcome to Basilisk"