Machine Learning Governane is an investment for the present and the future

Giancarlo Cobino
5 min readMar 29, 2020
Photo by Clarisse Croset on Unsplash

Governing the Machine Learning lifecycle

Machine Learning models are now widely used at many levels in every organizations. They are implemented for recommending products to buy, recognize images, detect frauds, and many other cool stuff.

Up until now, the approach has been very naïve, pretending that Data Science and Machine Learning are different from any other Software development process and that Data Scientists have some sort of gray area where anything is admitted.

Finally, we have now reached a point where this is not true anymore. Machine Learning is becoming a mature technique, widely used and therefore it requires precise structures in place to deliver reliable outputs.

Feed Feeding Fed

One of the key aspect in enabling Machine Learning is the constant availability of data to feed the models and continuously train them in order to obtain the most effective performance. However, to be truly effective it is important to create an organized process, which relies on Data Governance principles to treat the data and clean them, before serving to the next layer.

There are many controls that are worth to put in place, adopting an enterprise-like platform or — although I strongly suggest not to go this way — a custom solution. There are different vendors in this space, such as Informatica, Collibra, Tibco: choose the one that fits your budget.

Controls in Data Governance

Nonetheless, it is important to set a clear strategy for managing the data at proper level and not being exposed to potential hiccups due to inconsistent checks or manual tasks that are prone to let erroneous, missing, misleading data flowing. Therefore, starting with the collection of Data from sources (batch, streaming, real-time, near real-time, it does not really matter) to the ingestion and processing of data, ending with storing data and visualizing them, everything needs to be followed and rightly addressed, in order to be sure that data used in Machine Learning are secure and certified.

At the end of the day everything goes down to rely on the information provided and certify that outputs of Machine Learning models are reliable themselves. That is the only way to make business believing in Machine Learning Magic, isn’t it?

From start to end

Continuous Delivery of cleaned Data

Now the question is how we can continuously serve our Machine Learning layer with cleaned and certified data.

Especially in large organizations the dispersion of data, where there are several sources in different shapes, may be a serious problem to face. If you add the fact that heterogeneous sources of data must be normalized and cleaned, you can figure out how difficult the process might become. The situation has also been compromised in years of project-driven culture, where data have been stored in silos, used for a single project and then forgotten.

We now ask for a paradigm shift, where data used in Machine Learning needs to be as rich and as broad as possible. Collecting data for one single purpose is not an option anymore and for this reason, it becomes necessary to plan a strategy able to define what data will be available where, accommodating the largest possible pool of users, ranging from Data Analysts to Data Scientists, Data Engineers, Business users or whoever might be interested in analyzing collected data.

The challenge here is of course to guarantee the constant flow of data in a form that will be usable by both Data Scientists and Data Engineers. Both are also in charge of generating cleaned, transformed and manipulated data for the business users. Those data have to be checked, verified, governable, in a way that everyone who is relying on those data have the peace of mind to work on them with the right confidence.

Continuous delivery of transformed data

One of the key aspect once data are cleaned involve their transformation and manipulation, which are essential to generate a baseline useful for Machine Learning, analytics, dashboards and reporting. It means that you have to create a process which has to steady ingest data, transform them and serve to the next layer. However, it is crucial that all the above steps are triggered automatically and that they follow a successful execution on a previous step. For example, it is possible to imagine a case where the data ready for modelling is triggered when the data preparation tasks have been successfully performed and so forth, until models are deployed and ready for monitoring, as explained in the Figure below.

Machine Learning lifecycle

Organizations need a paradigm shift in order to change the way Data Scientists and ML Engineers usually work, typically a manual process from building and deploying ML models. This might (or may not) work in research but it is far from optimal in a production environment. In a situation where data changes, external conditions are mutating rapidly, models have to be retrained constantly, it is impossible to adopt a manual process. The entire Machine Learning team has to be restructured and trained to embrace a DevOps strategy applied to Machine Learning.

This is something you could not wait anymore for. It is either you embrace it or it will kill you and your dream Machine Learning projects.

Thanks to Virginia Becchetti, Martina Trojani, Ivo Guerra, Paolo Bassini, Anna Impedovo and Desantila Sulcaj.

Stay tuned… we have more for you!

--

--

Giancarlo Cobino

Quant portfolio manager in the past. Now Machine Learning enthusiast, focused on the whole lifecycle of ML projects. Insights with control!