Data and machine learning shape Faire’s marketplace – and as a company that serves small business owners, our primary goal is to increase sales for both brands and retailers using our platform. During this session, we’ll discuss the machine learning and data-related lessons and challenges we’ve encountered over the last 5 years on Faire’s journey to empowering entrepreneurs to chase their dreams. … Read More
Dagger or Data Aggregator is an easy-to-use, configuration over code, cloud-native framework built on top of Apache Flink for stateful processing of real-time streaming data. With Dagger, you don’t need to write custom applications or manage resources to process data in real-time. Instead, you can write SQLs to do the processing and analysis on streaming data.
At Gojek, Data Platform team use Dagger for feature engineering on realtime features. Computed features are then ingested to Feast for model training and serving. Dagger powers more than 200 realtime features at Gojek. This talk will about the end to end architecture and how Dagger and Feast work together to provide a cohesive feature engineering workflow. … Read More
Declarative Machine Learning Systems are a new trend that marries the flexibility of DIY machine learning infrastructure and the simplicity of AutoML solutions. In this talk we will discuss about Ludwig, the open source declarative deep learning framework, and Predibase, an enterprise grade solution based on it. … Read More
Each of us has a different answer for “why is machine learning so hard.” And how long you have been working on ML will drastically influence your answer.
I’ll share what I learned over the past 20 years, implementing everything from scratch for 1 model in web search ranking, 100s of models for Sybil and 1000s of models for TFX. You’ll see why I’m convinced that data and software engineering are critical for successful data science – more so than models. Regardless of your experience, I’ll share some tips that will help you overcome the hard parts of machine learning. … Read More
We walk you through how we adopted Feast at Adyen. We’ll discuss the decisions we made because of infra and tech constraints, and the customizations we added— in particular for our open source project, spark-offline-store, which was adopted into the main feast repo. We hope our journey can help you reason about adopting Feast into your stack. … Read More
Microservices are stream processing; whether you’re using Redis, Kafka, or gRPC, you continuously handle events and manage consistency. And given that these are some of the most challenging problems in databases, you’re probably not doing a very good job at it.
But that’s not your fault, these problems are hard! Just like you wouldn’t implement your own database for every web service, you shouldn’t be building your own stream processor for every new product feature.
Today’s stream processors have failed to gain widespread adoption outside niche use cases because they put themselves at the forefront. They force you to think about streaming when building your application and when deploying to production.
In my talk, I will argue that “streaming” is the right tool but the wrong abstraction. Of course, we all want the benefits – increased speed, stronger consistency – but they need to meet developers where they are. Only when streaming becomes an implementation detail can it gain widespread adoption and bring forth the benefits it has promised for so long. … Read More
Once models go to production, observability becomes key to ensuring reliable performance over time. But what’s the difference between “ML Observability” and “Data Observability”, and how can ML Engineering teams apply them to maintain model performance? Get fast, practical answers in this lightning talk by Uber’s former leader of data operations tooling, and founder of data observability company, Bigeye. … Read More
[Open Source] Hamilton, a micro framework for creating dataframes, and its application at Stitch Fix
At Stitch Fix, we have 130+ “Full Stack Data Scientists” who, in addition to doing data science work, are also expected to engineer and own data pipelines for their production models. One data science team, the Forecasting, Estimation, and Demand team, was in a bind. Their data generation process was causing them iteration & operational frustrations in delivering time-series forecasts for the business. In this talk I’ll present Hamilton, a novel open source Python micro framework, that solved their pain points by changing their working paradigm.
Specifically, Hamilton enables a simpler paradigm for Data Science & Data Engineering teams to create, maintain, and execute code for generating dataframes, especially when there are lots of inter-column dependencies. Hamilton does this by building a DAG of dependencies directly from Python functions defined in a special manner, which also makes unit testing and documentation easy; tune into the talk to find out how. I’ll also cover our experience migrating to it, our best practices in using it in production for over two years, along with planned extensions to make it a general purpose framework. … Read More
The ML Engineer’s life has become significantly easier over the past few years, but ML projects are still too tedious and complex. Feature stores have recently emerged as an important product category within the MLOps ecosystem. They solve part of the data problem for ML by automating feature processing and serving.
But feature stores are not enough. What data teams need is a platform that automates the complete lifecycle of ML features. This platform must provide integrations with the modern DevOps and data ecosystems, including the Modern Data Stack. It should provide excellent support for advanced use cases like Recommender Systems and Fraud. And it should automate the data feedback loop, abstracting away tasks like data logging and training dataset generation. In this talk, Mike will cover his vision for the evolution of the feature store into this complete feature platform for ML. … Read More