Five Trends In Machine Learning Ops: Takeaways From The First Operational ML Conference

I recently co-chaired the first conference on Machine Learning Ops - USENIX OpML 2019. It was an energetic gathering of experts, practitioners and researchers who came together for one day in Santa Clara CA to talk about the problems, practices, new tools and cutting edge research on Production Machine Learning in industries ranging from finance, insurance, healthcare, security, web scale, manufacturing, and others.

While there were many great presentations, papers, panels, and posters (too many to talk about individually - check out all the details here), there were several emergent trends and themes. I expect each of these will expand and become even more prominent over the next several years as more organizations push ML into production and use machine learning ops practices to scale ML in production.

Agile Methodologies meet Machine Learning

Many practitioners emphasized the importance of iteration and continuous improvement to successful production ML. Much like software, machine learning improves through iteration and regular production releases. Those who have ML running at scale make it a point to recommend that projects should start with either no ML or simple ML to establish a baseline. As one practitioner put it, you don’t want to spend a year investing in a complex Deep Learning solution, only to find out after deployment that a simpler non-ML method can outperform it!

Bringing agility to ML also requires that the infrastructure be optimized to support agile rollouts. This means that successful production ML infrastructure includes automated deployment, modularity, use of microservices, and also avoiding excessive fine-grained optimization early on.

Recognition that ML bugs differ from software bugs, ML specific production diagnostics

Various presentations provided memorable examples of how ML errors not only bypass conventional production checks, they can actually look like better production performance! For example - an ML model that fails and generates a default output can actually cause a performance boost!

Detecting ML bugs in production requires specialized techniques like Model Performance Predictors, comparisons with non-ML baselines, visual debugging tools and metric driven design of the operational ML infrastructure. Facebook, Uber, and other organizations experienced with large scale production machine learning emphasized the importance of ML specific production metrics that range from health checks to ML specific (such as GPU) resource utilization metrics.

Rich Open Source ecosystem for all aspects of Machine Learning Ops

The rich open source ecosystem for model development (with TensorFlow, ScikitLearn, Spark, Pytorch, R, etc.) is well known. OpML showcased how the open source ecosystem for Machine Learning Ops is growing rapidly, with powerful publicly available tooling used by large and small companies alike. Examples include Apache Atlas for Governance and Compliance, Kubeflow for machine learning ops on Kubernetes, MLFlow for lifecycle management and Tensorflow tracing for monitoring. Classic enterprise vendors are starting to integrate these open source packages to provide full solutions for their customers. An example is Cisco’s support of Kubeflow. Furthermore, web-scale companies are open sourcing the core infrastructure that drives their production ML, such as the ML Orchestration tool TonY from LinkedIn.

As these tools become more prominent, practitioners are also documenting end-to-end use cases, creating design patterns that can be used as best practices by others.

Cloud-based Services and SaaS make production ML easier

For a team trying to deploy ML in production for the first few times, the process can be daunting, even with open source tools available for each stage of the process. The cloud offers an alternative. Since the resource management aspects (such as machine provisioning, auto-scaling, elasticity, etc.) are handled by the cloud backend, cloud deployments can be simpler. When accelerators (GPUs, TPUs, etc.) are used, production resource management can be challenging and using cloud services is a way to get started by leveraging the investments made by cloud providers to optimize accelerator usage.

Cloud deployment can also create a ramp-up path for an IT organization to try ML deployment without a large in-house infrastructure roll out. Even on-premise enterprise deployments are moving to a self-service production ML model [16] similar to that of a cloud service, enabling an IT organization to serve the production ML needs of multiple teams and business units.

Expertise Leverage: Web-based At-scale ML Operations to Enterprise

At-scale experts like LinkedIn, Facebook. Google, Airbnb, Uber, and others, who were the first ML adopters, had to build from scratch all of the infrastructure and practices needed to extract monetary value out of ML. These experts are now sharing not only their code but their practice experiences and hard-won learnings, all of which can be adopted for the benefits of enterprise. As the Experts Panel at OpML pointed out, the best practices that these organizations follow for ML infrastructure (from team composition and reliability engineering to resource management) contain powerful insights that enterprises can benefit from as they seek to expand their production ML footprint. Experiences from scale ML deployments at Microsoft and others can show enterprises how to deliver performant machine learning into their business applications.

Other end-to-end experiences from at-scale companies showed how business metrics can be translated into ML solutions, and the consequent ML solution iteratively improved for business benefit. Finally, organizations facing the unique challenges that Edge deployment places on Machine Learning Ops can benefit from learning of scale deployments already in place.

Summary

A great op-ed piece by Michael Jordan in Medium - “Artificial Intelligence: The Revolution Hasn’t Happened Yet”, highlighted the need for an AI engineering practice. OpML 2019, the first Machine Learning Ops conference, illustrated how the ML and AI industry is maturing in this direction, with more and more organizations either struggling with the operational and lifecycle management aspects of production Machine Learning or pushing to scale ML operations and develop operational best practices. This is great news for the AI industry since it is a step further towards generating real ROI from AI investments. Trends like those above should help realize the long-awaited potential of AI-generated business value.

More From Forbes

Five Trends In Machine Learning Ops: Takeaways From The First Operational ML Conference