Untitled

The high-level concept is basically this:

a) Versioning is obviously pretty useful. Developers have been versioning their code for forever, but the reality is that right now most companies' data science teams are versioning their data sets by naming them with today's date or something (if that)

b) Versioning data sets by dating them isn't actually terrible if each of your data pipelines only consumes one data set and produces one output, but if you have e.g. a pipeline P consuming data sets A and B, and you want to understand the output of P (e.g. to debug it) you need to know what the state of both A and B were at the time that P produced its output.

In other words, it's not enough to look at yesterday's version of A, you have to know what the version of B was at the same time, and if your implementation of P is changing (e.g. it's model and you keep adjusting hyperparameters) then you need to know the version of P too. So PFS tracks versions of A and B, and PPS tracks the dependence of P on A and B, so that you retain the ability to go from "output of P" to "data in A and B that produced it".

Tracking this metadata also means that P won't run if A and B are generated by other pipelines that haven't finished yet (a problem in a lot of hadoop clusters that run pipelines using essentially cron), that you don't garbage collect data you're not done with, that e.g. if you're copying data out of PFS into a database that you serve from in production, that you can easily copy yesterday's data in if today's is bad, etc.