Real-world machine learning is always messier than what we expect at the start of a project—there are many challenges that aren’t immediately apparent despite all the planning you do beforehand. Tecton’s apply() conference is for machine learning and data teams to discuss the practical data engineering challenges faced when building real-time machine learning systems. If you missed the last one, we have another on the calendar: The next apply() event is happening December 6 and the theme is all about recommender systems!
Not sure if apply() is the right conference for you? In this post, we’ll summarize six themes from the May 2022 apply() event so you can get a peek into the content, learnings, and best practices. Want to watch these talks instead? You can access the apply() video archive anytime.
1. Productionization is (still) the hardest part of machine learning—and how the machine learning flywheel can help
Several speakers talked about productionization as the most difficult problem to solve when scaling large machine learning platforms. In his talk “Why is Machine Learning Hard?” Tal Shaked, a Machine Learning Architect for Snowflake, highlighted the need for faster iteration cycles and better ways to manage data for production models.
He gave an example from his time at Google Ads, where optimizing model quality took three months to do offline (which was the easy part). Data quality is much messier outside a controlled environment, where interactions with other models, the ad auction, and much more are at play. It took six more months of tuning and debugging to turn a 10% improvement in offline metrics into a 1.5% gain in live metrics.
Then, there was still another year’s worth of work doing productionization. Implementing and maintaining systems like serving infrastructure, monitoring, and machine resource management takes up time that would ideally be spent improving models. But, as Tal put it, “good models in production are better than great models in a laptop.”
What can we do to get faster at putting good (and great) models into production? Mike Del Balso, Co-founder and CEO of Tecton, gave a talk about the ML data flywheel, a framework that successful teams use to tackle these problems. This flywheel consists of four stages: data collection, data organization, training/model iteration, and prediction. It’s an ongoing cycle where feedback from predictions is fed back into the flywheel to improve models on a continuous basis.
The hard part about starting a flywheel at most organizations is that different teams use different ML tools. “The challenge for the ML flywheel is having [a team’s tools] all work together coherently to support a self-improving application,” Mike said. He recommended establishing a unified data model that maintains consistency across all your infrastructure. You can read more about strategies to kickstart your machine learning data flywheel in this blog post.
2. Data architectures for machine learning applications
In his talk, “Lakehouse: A New Class of Platforms for Data and AI Workloads,” Matei Zaharia, Chief Technologist and Co-founder of Databricks, charted the evolution of data management systems. As far back as the 1980s, companies used data warehouses to unify data across operational data stores and optimize that data for analytical queries. The 2010s saw the rise of data lakes, which reduced storage costs and expanded compatibility to more data formats. But they didn’t eliminate the need for structured data management in a data warehouse, so companies required a two-tier architecture to handle both use cases.
Data lakehouses simplify this architecture into a single system that contains both a data lake and a layer for data management and performance optimizations. Downstream applications like business intelligence tools can interact with a lakehouse using SQL while machine learning tools have direct access to raw data. According to Matei, “Today’s world with many separate data systems doesn’t have to be that way. I think, over time, we’ll converge on ways to do very high-quality SQL workloads, data engineering, and ML on the same system that is based on low-cost cloud storage.”
3. Machine learning in real time
Chip Huyen, Co-founder and CEO of Claypot AI, spoke about the industry shift toward online predictions and away from batch learning. Because batch predictions for machine learning are computed periodically (before requests arrive) and online predictions are computed only when requested, online predictions are more adaptive and less computationally expensive. But they also need to be fast to avoid slowing down the product or the end-user experience.
Feature stores and feature platforms make online predictions possible because they’re optimized to handle data streams as inputs (in addition to batch data). Feature stores and feature platforms also offer machine learning monitoring functionalities by detecting data distribution shifts that could affect model quality. Chip’s view is that feature stores are still an active area of innovation and that they will be central in the shift to online learning.
At the May 2022 apply(), we got to see exciting case studies of successful real-time machine learning applications. Meenal Chhabra and Austin Mackillop from the engineering team at CashApp gave an overview of the CashApp machine learning ranking system underpinning many personalization experiences. Real-time ML ranking is used across the app, for features like presenting the most relevant user rewards or helping users search for their intended recipient efficiently.
As part of the team’s philosophy of producing “good-enough results quickly as opposed to the best results slowly,” CashApp’s engineers took several steps to optimize their ML system’s performance for low latency. Their strategies include concurrency tuning, accessing in-memory data (instead of on disk), network optimizations, and more, including a technique called “request hedging” that reduces latency by sending a second request if an original request is taking too long.
4. Rapid model deployment in healthcare
ERAdvisor is a product that uses machine learning to automatically provide wait time estimates and status updates to emergency room visitors. Felix Brann, Head of Data Science at the parent company Vital, shared how the company approached scaling ERAdvisor across hospitals and addressed the cold-start problem in machine learning. What makes this problem particularly challenging to solve is that emergency department wait times vary widely and hospitals expect ERAdvisor to give accurate estimates as early as possible.
Felix and his team devised a facility-agnostic ML model to sidestep this cold-start problem by leveraging existing hospital data when onboarding a new customer. Rather than predicting an absolute wait time at a given facility, ERAdvisor’s model predicts a wait time percentile, which can then be projected onto the distribution of wait times at any particular facility. Vital uses Tecton to automate its real-time ML pipeline and unify the developer experience for engineering and data science.
5. Best practices for scaling a company’s data team
Several apply() speakers shared best practices for scaling data teams as their companies and products mature. Daniele Perito, Co-founder and Chief Data Officer at Faire, spoke about how his company empowers small businesses with machine learning. In 5 years, Faire grew to over 1,000 employees and became the largest online independent wholesale community in the world. Even with this massive growth, Faire still uses five out of their six original data tools today, and Daniele noted that “data infrastructure decisions can be very sticky.”
Daniele emphasized several big-picture learnings from Faire’s growth:
- Establish a decision-making rubric for architectural decisions early on
- Consider the cost of keeping data consistency when using external data platforms
- Have a project tracker around big data infrastructure changes
- Have seasoned technical leaders who understand when it’s time to invest in the future
6. Improving collaboration between engineering and data teams
One of our most popular panels at apply() featured data scientists discussing how to nurture collaboration between data science and engineering teams. A recurring theme in the panel was that the relationship between the two functions changes as a company grows. For startups, employees might be wearing multiple hats out of necessity, whereas in a larger company the separation of responsibilities may be clearer and it’s more efficient for people to specialize.
Additionally, to improve cross-functional collaboration, it’s important that different functions understand each other’s roles and how they fit into the broader product process. To help with this, panelist Mark Freeman, Founder of On the Mark Data, suggested that companies ask the question, “Where in the data lifecycle do people sit?” For example, software engineers tend to create the surfaces where data gets collected and later presented to users. Data scientists fit in between those steps by organizing that data and generating insights from it.
Join us for apply(recsys) in December!
We’ve summarized some of the learnings from the last apply() event in this post, but if you’d like to watch all the sessions in full, you can find them on the apply() event page. Here’s a quick peek from one of the talks:
Couldn’t make it to the last apply()? Then join us for apply(recsys) on December 6! Speakers from Slack, ByteDance, Feast, and more will share best practices and other learnings about machine learning recommender systems at this free, half-day virtual event. Registration is now open—sign up to get updates on the agenda, speakers, and topics!