11/26/2025
Correction for "GDP for Data Science"
If my blog were Stack Overflow, I think this post would earn me an Excavator badge.

(Or maybe a Necromancer if this post got a little traction.)
In the course of revitalizing this site and reviewing some historical material (a process I documented in the previous post), I found my coverage of Matt Noonan’s Ghosts of Departed Proofs.
At a previous job, I undertook a research project investigating how to apply GDP to data science (a project, I should add, that I greatly enjoyed). The idea was to encode domain knowledge about the pre- and post-conditions of data science and machine learning algorithms into the type system to catch mistakes earlier in the research process and give better, more domain-relevant error messages in the process.
I contrived an example about a preprocessing pipeline that, as it worked, accumulated proof that each processing step had been performed. These proofs were later discharged to satisfy the preconditions of the library’s clustering and classification algorithms (not pictured):
data Normalized
normalize :: df -> df ::: Normalized
data NoMissingVals
dropMissingVals :: df -> df ::: NoMissingVals
fillMissingVals :: df -> df ::: NoMissingVals
(.|) :: (a -> a ::: p) -> (a -> a ::: q) -> a -> a ::: p && q
preprocess :: df -> df ::: NoMissingVals && Normalized
preprocess = normalize .| fillMissingVals
With the benefit of hindsight, I realized that this formulation has a subtle but critical mistake.
The proven properties must be linked to the data they describe by a name, or else the proofs could be “conjured” away from the original data and attached to another term that did not undergo the preprocessing pipeline:
import Data.Refined ((...), conjure)
dfGood :: df ::: NoMissingVals && Normalized
dfGood = preprocess someData
dfBad :: df ::: NoMissingVals && Normalized
dfBad = otherData ...(conjure dfGood)
Our algorithms would accept dfBad even though it doesn’t actually meet their
preconditions.
The fix
Bridge the earthly realm of data and phantom realm of proofs using a name. Something like:
import Theory.Named (Defn)
newtype Normalized df = Normalized Defn
normalize :: df ~~ name -> df ::: Normalized name
newtype NoMissingVals df = NoMissingVals Defn
dropMissingVals :: df ~~ name -> df ::: NoMissingVals name
fillMissingVals :: df ~~ name -> df ::: NoMissingVals name
preprocess :: df ~~ name -> df ::: NoMissingVals name && Normalized name
I still need to work through the details and implementation in a real example, but this is where I’d start.
The naming process in GDP leverages the ST trick, using a rank-2 signature to protect
names from leaking to the wrong terms. Unlike in the original formulation, here there’s
no well-typed way for me to steal the proof of processing from dfGood and attach it to
dfBad; the processing proof would only apply to dfGood’s name, which will never be
the same as dfBad’s name thanks to the ST trick.
Having written the original post five years ago, I don’t remember if this mistake was an intentional elision for the sake of brevity in a blog post, or a genuine oversight. In any case, it was fun to trawl the archives, and satisfying that my recollection of GDP was still just sharp enough to notice the error!