We were delighted to welcome Professor Paulo Missier who hosted last seminar series of 2023. The past few years have seen the emergence of what the AI community calls “Data-centric AI”, namely the recognition that some of the limiting factors in AI performance are in fact in the data used for training the models, as much as in the expressiveness and complexity of the models themselves. One analogy is that of a powerful engine that will only run as fast as the quality of the fuel allows. A plethora of recent literature has started exploring the connection between data and models in depth, along with startups that offer “data engineering for AI” services. Some concepts are well-known to the data engineering community, including incremental data cleaning, multi-source integration, or data bias control; others are more specific to AI applications, for instance the realisation that some samples in the training space are “easier to learn from” than others.
In this “position talk”, Paulo suggested that, from an infrastructure perspective, there is an opportunity to efficiently support patterns of complex pipelines where data and model improvements are entangled in a series of iterations. He focused in particular on end-to-end tracking of data and model versions, as a way to support MLDev and MLOps engineers as they navigate through a complex decision space.