One of the core fundamental pieces of technology every software-related tech stack is heavily dependent on is Git. The ability to version code and control the flow of development is the only common focus for every software project. We take for granted that everyone in the working industry can indeed properly version code.
In today’s new era of Big Data and ML systems, every expert now has to handle not only high-quality code, but also data… a lot of data! As the system evolves, data gets accumulated, and ML models start to drift in production, the need for a solid data strategy becomes essential. As we treat code, we should treat data in the same way, carefully with a versioning system, reviews, PRs, and be able to make a pipeline to reproduce the state of complex data-driven systems.
In this talk, we’ll explore the meaning of data versioning and how we could borrow methodologies from the software engineering field to better manage our data.