Whether you’re building recommender systems, risk and fraud detection systems, or anything in between, MLOps is critical to successfully delivering production ML applications at scale.
At our most recent virtual event, apply(ops), speakers from Uber, HelloFresh, RiotGames, Remitly, and more shared their best practices and lessons learned around platforms and architectures to more effectively drive production ML projects.
In this post, we’ll share the four major takeaways that any team looking to deploy ML apps into production should keep in mind as they build, scale, or modify their approach.
1. Effective data handling is foundational to successful MLOps
Several speakers at apply(ops) touched on the fact that, while ML is often perceived as primarily model-driven, in reality it’s a complex amalgamation of data handling, MLOps, and model innovation.
According to Aayush Mudgal, Senior ML Engineer at Pinterest, “Machine learning is not really about just training models … machine learning is about collecting data, verifying it, extracting the features, analyzing it, managing training jobs, [and] serving the infrastructure, to name a few.” His talk, “Evolution of the Ads Ranking System at Pinterest,” focused on how standardized storage of features was a foundational step that enabled many downstream tools. These tools, which enabled the team to standardize and manipulate data, were critical to allowing Pinterest’s MLOps infra to support the speed they needed to iterate production models and data to stay competitive in their ads product.
Good MLOps enables effective data management, which lets teams move faster, according to Federico Bassetto, MLOps Engineer at Prima, a leading auto insurance agency in Europe. In his Lightning Talk on building real-time pricing applications, he pointed out the importance of dataset curation for iterating quickly on Prima’s real-time pricing pipeline. Without the right tools, and as Prima grew rapidly, dataset creation became an expensive, months-long process—but the adoption of a feature platform moved dataset creation speed from months to days, allowing data scientists to iterate much more quickly on model development, improving the quality of quotes and the speed at which Prima can generate them.
The relationship between effective data handling and good business outcomes was also front and center for Benjamin Bertincourt, Senior Manager of ML Engineering at HelloFresh. In the talk, “Building an MLOps Strategy at the World’s Largest Food Solutions Company,” Benjamin explained that, “As a business interacting with customers, the only way we learn from that experience is if we collect data and collect the right data.” Investing in data quality continues to pay off as HelloFresh is able to create information-rich features from customer signals and use that personalization to drive more engagement and better outcomes for customers.
2. Unified MLOps infrastructure is on the rise
We also heard from organizations that started down the MLOps path at different times and with different goals. Yet one thing seemed true for every team: As they’ve matured, they’ve moved from fragmented tools towards a single, unified structure for their infrastructure.
Michael Johnson, Director of AI & ML at HelloFresh, recalled the process of moving from “MVP data science” to production platform components. Motivated by the need to free up engineers from being bogged down with maintenance, the team used the mantra of “everything that is standard should be automated … whatever cannot be automated easily should be easy to do.” By both adopting some external tools (like Tecton) and creating standard internal automation, they were able to lower the barrier of entry to productionizing ML.
Pinterest’s experience was similar. According to Aayush, “Each team had independently evolved to build specific solutions … This led to a lot of custom infrastructure for each use case with a large maintenance cost and incomplete tuning.”
Working on ML is iterative, collaborative, and complex—making a unified system all the more powerful. In his talk, “Journey of the AI Platform at Uber,” Distinguished Engineer Min Cai explained that: “A centralized end-to-end platform like [Uber’s] Michelangelo is very very useful … The ML lifecycle involves lots of personnel.” In this case, the value of the unified system is manifold: it enables easier collaboration across teams and, removes unnecessary complexity by introducing standards, like feature definitions, which are shared across the entire development process, from data science exploration to production.
As evidenced by Uber, HelloFresh, and Pinterest, the demands on ML teams are scaling up—and so are the size and multidisciplinary nature of those teams. This has pushed industry veterans towards production-tested platforms that abstract away the critical infrastructural pieces of running an effective MLOps platform. This shift has been driven by a few factors: the increasing centrality of ML to companies’ core value propositions, the increasingly collaborative nature of ML, and the rapid iteration that is increasingly required to stay competitive in the world of ML-enabled software and experiences.
3. “Build vs. buy” is one of the most important MLOps decisions
The decision to build MLOps tools in-house versus buying vendor-based solutions is often a critical strategic choice for many organizations, influenced by factors like the need for customization, control, and scalability, as well as the economics of core competencies.
Benjamin and Michael from HelloFresh faced this dilemma in the “rich, very [quickly] growing” ecosystem of MLOps tools. Their decision-making process balanced the need for skill specialization and quality with the practicality of tool selection. For Michael, the decision to lean toward open-source solutions for transparency and flexibility ended up making the most sense, but he clarified that depending on the decision criteria for your organization, the right call might be something else. (Read more about HelloFresh’s decision-making process.)
In the fireside chat, “LLMs, Real-Time, and Other Trends in the Production ML Space,” Tecton CEO Mike Del Balso reflected on his time at Uber, describing the shift in strategy, from “fully build” to “buy”: “I don’t think we ever purchased any vendor tool ever when I was on the AI team at Uber. And we’re seeing that change a decent amount.” There’s a trend in enterprises doubling down on their core competencies, and otherwise sourcing best-in-class tools, all to deliver the absolute best end-user experience possible.
Trevor Allen, Software Developer on the Remitly ML platform team, added a practical perspective in his Lightning Talk. In the past, he explained, feature infrastructure at Remitly was largely one-off, and the team was quickly outgrowing the ability to maintain it. They needed a way to onboard new features onto models even faster and adopted Tecton as a feature platform, which reduced their maintenance overhead and increased their delivery speed.
In all these cases, as the organizations grew, the scale and complexity of MLOps grew with them. That growth necessitated adopting some external solutions, and oftentimes integrating them with software components that needed to stay in-house, all to ensure scalability and efficiency.
4. Context is king, especially when working with LLMs
Personalization is becoming a cornerstone in the world of ML, and with the arrival of paradigm-shifting Large Language Models (LLMs), the value of personalization only increases.
In his talk, “Personalized Retrieval for RAGs with a Feature Platform,” Tecton CEO Mike Del Balso described the applications of this paradigm in personalized travel recommendations using an LLM. A model with no context might give any user a simple recommendation: “You should visit Paris.” But LLMs can provide incredibly tailored, valuable experiences with the right context. Using the appropriate signals, an LLM can provide an entire to-go Paris itinerary. It might know whether there’s a last-minute opening at a nearby chef’s workshop that happens to align with your personal interests because you’ve blogged about similar food—and it recommends you can walk there too because the weather is nice.
Beyond Mike’s example, LLMs and contextual data will have numerous profound impacts on user experience, according to Databricks CEO Ali Ghodsi. In the “LLMs, Real-Time, and Other Trends in the Production ML Space” fireside chat, he described a future where the necessity of writing code or using complex interfaces is replaced by more intuitive, natural language interactions, reflecting a broader trend where all software, not just AI-focused tools, will need to be infused with AI or entirely redeveloped to integrate intelligent capabilities. And doing so will, critically, involve pulling in the right contextual information—both about the person using the software or context the software exists in and can interact with.
Alongside these paradigm-shifting capabilities comes the data infrastructure to support them. And data infrastructure gets more difficult the more context you try to bring in, regardless of the use case. “In higher degrees, this personalization unlocks a lot of value. But as we saw [in the travel example], there’s a lot of data engineering behind the scenes, and that makes this harder to build,” Mike explained.
Organizations can take different approaches to the data complexity: You can either use systems that enable the level of context you need today (without flexibility), you can omit the context altogether, or you can adopt systems that were built from the ground up to accommodate both the context and flexibility needed to scale modern ML workflows. Feature platforms, like Tecton, are purpose-built for the latter: They help automate complex data engineering tasks, integrating with existing infrastructure to manage real-time data processing, allowing for rapid, context-aware responses with less operational overhead.
To close out the conference, Matt Bleifer, Group Product Manager at Tecton, ran the “From Idea to Inference with Python Only: A Live Tecton Demo” workshop, where he showed attendees how they can build a real-time AI application to detect fraud entirely with Python, using Tecton’s new Rift compute engine. The talk covered everything from defining and testing features entirely locally to productionizing them for real time inference using MLOps best practices.
MLOps is a highly complex, rapidly changing field with many exciting opportunities and significant challenges. Your team can prepare to face them by learning from experts and using the right mix of tools to enable seamless scaling and rapid iteration. To get insights and practical knowledge from our speakers, be sure to watch all the talks from apply(ops).