Tecton

Empowering Small Businesses with the Power of Tech, Data, and Machine Learning

apply(conf) - May '22 - 30 minutes

Data and machine learning shape Faire’s marketplace – and as a company that serves small business owners, our primary goal is to increase sales for both brands and retailers using our platform. During this session, we’ll discuss the machine learning and data-related lessons and challenges we’ve encountered over the last 5 years on Faire’s journey to empowering entrepreneurs to chase their dreams.

Welcome everybody. My name is Daniele Perito. I’m one of the founders at Faire. Today I would like to tell you about how we grew the tech stack at Faire to serve our community of customers over the past five years. Can you all see my screen?

Yep.

Thank you. I hope that the talk is not going to be too self-indulgent. I have been known to like to tell a good story about the good old times. I hope that if you’re joining a smaller startup, if you are thinking about starting your company or you just started your company, maybe you’ll gather some pointers on things to think about as you’re scaling your team, you’re scaling your data machine learning stacks. Here we go. First, let me

Daniele Perito:

Welcome everybody. My name is Daniele Perito. I’m one of the founders at Faire. Today I would like to tell you about how we grew the tech stack at Faire to serve our community of customers over the past five years. Can you all see my screen?

Demetrios:

Yep.

Daniele Perito:

Thank you. I hope that the talk is not going to be too self-indulgent. I have been known to like to tell a good story about the good old times. I hope that if you’re joining a smaller startup, if you are thinking about starting your company or you just started your company, maybe you’ll gather some pointers on things to think about as you’re scaling your team, you’re scaling your data machine learning stacks. Here we go. First, let me tell you a little bit more about Faire. Faire is a wholesale B2B marketplace founded in 2017 to help independent brands and retailers connect.

Daniele Perito:

Just to give you maybe a little more of a vivid picture of what we do, if you’re in San Francisco, imagine one of those stores on Valencia Street, anywhere else, imagining one of the stores on main street and those stores they’re curated stores, they need to buy merchandise to sell. Traditionally, they will buy their merchandise from sales reps or at a trade show. But with Faire, they can buy it on our marketplace. We also serve online stores, but I just wanted to give you a picture of who we are. A little bit of numbers, just so that it gives you some context on the types of journey that we went on, right now Faire serves over 450,000 retailers around the world. We have 70,000 brands that are selling on Faire. On the buyer side, the retailer’s side, the website is available in nearly 20 countries.

Daniele Perito:

I’ll tell you more about our stack. I think it’s also good to have a sense of scale, the number of employees, and we have now over a thousand employees and we are hiring. Let me get into it, and please ask any questions you like, and I’ll try to answer them towards the end. As a marketplace, we have a variety of data challenges. I think marketplace businesses naturally land themselves to many data optimization problems, machine learning problems. So starting from sort of bread and butter search discovery ranking, when someone types something on the search bar to when they land on the homepage, we need to show them something. That’s driven by a variety of models. We need to manage risk and fraud in the marketplace. That’s just the cost of doing business. We need to enable our sellers to be successful on the platform.

Daniele Perito:

For that, we need to find the best sellers brands that exist around the world, score them so that we can prioritize which ones should be onboarded on first, and then maybe help give them a helping hand as they’re joining the platform to sort of optimize their profile and all. All of that is done by several machine learning models. We have incentives as a marketplace naturally, and we need to optimize those incentives. We have folks that are working on that. Then we also have other data and machine learning challenges, but this is a representative set. Then on the platform side, we need to manage things like our experimentation platform, our analytics platform, our machine learning platform, and so on and so forth.

Daniele Perito:

The structure of this talk is that I’ve been going over year by year with some of the main things that we worked on in each year, we’re a five-year old company, give or take, and some of the learnings that I’ve seen and that we’ve seen as we grew the company. Starting from year one, the size of the company was roughly four people at the beginning going to 12 people towards the end. The size of data team was one, it was me. I had some expertise in data because I was working in machine learning at Square prior to starting the company. I was the person that had to make all the decisions, I guess. In that first year, we had to make several decisions. We had decided on what online database to use. We started replicating data to the warehouse. So we had to also choose the data warehouse.

Daniele Perito:

We stood up an events recording framework to understand what customers were doing on the website. We stood up a BI tool, and then we had our first realtime feature store that was at that time used for fraud. Just a quick call out here, out of these six sort of systems that we built back then, five of them are still in place as they were built or an evolution of what that was originally built. One of the first things that I’d like you to take home is that data decisions, data infrastructure decisions can be very sticky because it’s not easy to migrate things over. So think those decisions thoroughly. But let’s deep dive a little bit into the decision on the online database. Now, so just to set the stage and give you a picture of where we were, we had just started the company in January 2017.

Daniele Perito:

We didn’t have really a product. We didn’t have customers, but we had a deadline three months later to make it to demo day at YCombinator, which is an accelerator program. We really needed to get out of the gate fast. My co-founder, Marcelo, wanted to use MongoDP, primarily because it was really, really fast to set up and get off the ground. I don’t want to get too much into the pros and cons of each technology, I think this is a topic that has been embedded on length, but the primary reason why we ended up choosing MySQL was because it was so much easier to replicate the data into some format that would allow us to do analytics. I think that was one of the primary designing factors there in that decision.

Daniele Perito:

To summarize some of the learnings of that first year, so first you’re going to be making a lot of important and far reaching decisions about your data and ML organizations. Make sure that even if you’re looking at engineering only decisions, you keep an eye on your data stack and how it’s going to evolve based on those decisions. If you don’t have expertise in that first year in data or machine learning, maybe your company’s doing something else, maybe it’s a SaaS company focusing on something different, but at some point you’ll face challenges of getting data into a data warehouse and things like that. Make sure that from the very beginning, you either have advisors or employees or co-founders who have worked with data slash machine learning systems in there, and they can give you some solid advice.

Daniele Perito:

Our second year, the company went from being roughly 12 people to roughly 45 people, and the size of the data team group from one to three. In that second year, we were doing many things, including improving our search system. We had to start paying attention to orchestration. Then we started standardizing our core tables for analytics. We had to make a decision for our orchestration. Just to give you some context, at that time, we were using our BI tool, which is more analytics and still is, to and scheduling Chrome tabs in modern analytics in order to compute, I don’t know, nightly things. Obviously that was a little bit of a hack solution that was getting us off the ground, but we needed to start orchestrating workflows in order to do all sorts of things. We didn’t have much expertise in either Kubeflow and Airflow.

Daniele Perito:

Again, here, I’m not trying to take sides on any of those technologies. I think they’re all capable technologies. Your mileage may vary. Your particular circumstances might be different than ours, but at that time we didn’t know much about either. I think we ended up going with Kubeflow because there was the alert of the entire ML orchestration, an entire end to end system for ML that we thought we would build in Kubeflow. Now, eventually we had to revert back to Airflow a year or two later. Looking back to those days, I wish we had really thought a little bit more about our specific use. Both of those are capable technologies, but they’re different tools that they’re opportunity for different use cases.

Daniele Perito:

As you’re starting to scale your data organization, as you’re starting to scale your company, think about the fact that once you don’t have direct expertise with things, you should really start having rigorous decision making, which might be a rubric where you’re evaluating the pros and cons of a certain solution, identifying exactly what problems you’re trying to solve with that solution so that your batting average, your average decision is going to be a little bit more right and that’s going to be a long-term advantage for you. In year three, we went from a company of roughly 45 to a company of roughly 150. The data team doubled in size from three to six. Now, the main decisions we made that year were that we started unifying our feature store. We had a couple of ways to compute features. We started putting them all together in one unified way.

Daniele Perito:

We migrated back to Airflow. We started really building data volume monitoring to make sure that tables wouldn’t go scale, that features wouldn’t start drifting that model stores, sorry, model scores wouldn’t start drifting. Crucially, we started trying to see whether or not we needed to get a new experimentation platform. That’s the thing I wanted to focus a little bit more on. At the time, we had a homegrown experimentation system. It was extremely basic. When customers were assigned in either control or treatment, we will just write sequel queries in order to compute some metrics on both customer sets and then display them in a modern analytics dashboard. We knew that the system was limited. So we started evaluating options for replacement. We started talking with a company that provides experimentation as a platform, and we actually signed a contract with them. The first thing I realized is that the project took longer than we expected, just because we had to synchronize state between our own data warehouse and the external experimentation platform.

Daniele Perito:

I think looking back, this is a cost that we hadn’t factored in properly. What do I mean here specifically? If you have your own warehouse, you have your own dashboards, and you have your own metrics, you have a definition of every important fact in your business. Maybe it’s going to be orders and visits, active customers, add to cart events, and all of those things. Those things are going to be defined in your warehouse. When you’re integrating with something like an experimentation platform, they will have their own data model typically. There is a real cost in understanding how to map your data model into their data model and make sure that they’re close enough to one another. Ultimately, the effort failed for lack of use. So we went back to using our own homegrown solution, which by then had become a lot more sophisticated. It was a proper experimentation platform with lots of analytics and variance reduction and all sorts of cool tricks.

Daniele Perito:

Couple important things that I’ve learned. When fact considering external providers for data platforms or anything, and I think by the way, you should because I think technologies are maturing very rapidly and I think we’ve benefited a ton from using external providers, do factor in the fact that keeping data consistent between your own data warehouse, your own metrics definitions, and their own metrics definition is going to be a real cost, and you have to factor that in. Secondly, as you’re evaluating solutions, also consider the specific composition of your team. Here, what I mean is that for the experimentation platform, the external provider we were considering was very capable, but our team tends to be quite literate in terms of SQL. They always tended to have follow-up questions. They said, “Oh, okay. Treatment is winning on the primary metric, but can we cut it by this? Can we cut it by that?” So they really wanted to be able to do that and the outside platform was a little bit lacking in that regard.

Daniele Perito:

If you have a company that is more sequel leader, let sequel leader consider that as you’re choosing your tools. In year four, we went from roughly 150 people to roughly 300 people, and the size of the team in data dual again from six to 12. We started working on a more sophisticated version of our search sort of discovery algorithm, real time ranking for low and latency inference. Migrated to Snowflake from Redshift and Terraform and everything in our data world. Again, this doc is not a, I’m not advocating for either solution, but in Redshift serve us really, really well for several years. But at some point in time became clear for us that for our own use case, the speed of scalability of snowflake was going to be a big advantage. So we started embarking in the process, that I know that a lot of you also went through, of migrating thousands of tables, thousands of definitions, thousands of dashboards from one data warehouse to another. It’s a very painful process that took us over six months.

Daniele Perito:

There are a couple of things that I wish I had known before embarking in that process. Number one, I wish we had really kept more at tighter ship in terms of project management. Big data infrastructure changes are hard that require a lot of coordination. You have to migrate many, many, many different things. You have to bring the whole company along with you as you’re doing it. I wish we had kept a little bit of a better project management, like stack in order to do that. Another important learning that I’ve had over in that year is that none of these migrations can ever be perfect. When you’re migrating, the definitions of hundreds of tables and ETL, thousands of dashboards, inevitably there is going to be small inconsistencies. That’s not because of lack of rigor position, but that’s just because some functions are going to be slightly different. Some there’s going to be timestamp misalignment between when metrics have been computed in different warehouses, for example.

Daniele Perito:

As you’re doing large scale data migrations, make sure to define an accuracy threshold explicitly so that people can actually finish the job, because otherwise you might be stuck in an infinite loop of saying, oh, we’re going to make this better, we’re going to make this better. Moving on to year five, which is last year 2021 for us, the company went from roughly 300 people to roughly 800, and the size of the data team grew again from 12 to 30. This is still a work in progress, but we embarked in some big pieces of work. For example, we are starting to do a unification on our ML platform. And here for ML platform, I mean feature store, model training pipelines, model registry, model monitoring, model serving. It is a little scattered at the moment. Multiple teams are doing things slightly differently. We want to bring everything under one umbrella.

Daniele Perito:

Then analytics reunification, meaning that many teams at Faire have defined their own metrics, their own summary tables and dimension tables, fact tables. We are working on unifying all those definitions to make sure that everybody’s looking at the world in the same way. The main learning for me here is that as your organization evolves and as it grows, you have to try to establish some type of technical committee to understand when it’s time to make big investments, because inertia is a hell of a drug here. If you don’t make explicit effort into collecting feedback from data engineer, sorry, data scientists, machine learning engineers, data engineers, about their job and where they’re spending most of their time, and whether there is a lot of toil in their workflows, if you don’t spend time collecting that information, then you might not understand that it’s time to make some big change, whatever that big change might be. For us, it was investing in a machine learning platform or investing in an analytics unification, unified analytics system, or investing in a Snowflake migration.

Daniele Perito:

Those projects are really, really important for the future of your company and your organization. So spend some time, set up some system so that you can gather feedback from all of your org to understand when it’s time to make the next big change. In summary, make sure that you have expertise in the early days of your company so that you can make the right decisions early on, because those are going to be very impactful. Make sure that your engineering side of the org and your data side of the org speak the same language, especially in the early days. Otherwise, you might have to replicate data stores that are hard to replicate, or figure out things that are hard to figure out. As you grow, make sure you establish rubrics to make decisions in your data infrastructure so that you have a higher banning average. Consider the team composition as you’re making data choices, like who are the people that this solution is going to be serving.

Daniele Perito:

I think that concludes my talk. To learn more about Faire, you can visit these links. Thank you all. I think I’ll be taking on some questions.

Speaker 2:

Yeah, dude. I got a few questions for you, man. There’s some questions coming through Slack. First thing I want to say, though, I called you Italian. I hear your accent. It doesn’t sound Italian. You aren’t Italian, are you?

Daniele Perito:

No, I’m born and raised Italian. I’ve been in United States for 12 years and my accent is a little bit funky.

Speaker 2:

Oh, all right. Cool. I was afraid that I misspoke there. Let’s get into some questions. Can you give concrete examples of a rubric for making data decisions?

Daniele Perito:

Yeah, I think maybe recently, most recently we had to decide on a data catalog, and part of the rubric was, does the data catalog integrate with our BI tool? Because ideally, the data catalog should have contextual information about tables and columns that appear wherever you write in your queries. So does it appear in our BI tool, and is that a non-negotiable action like thing? Can it derive automatically lineage from the way that which tables have been created? Does it have some ways for people to follow tables? I mean, I think one powerful idea in this all data catalog space is that, when people follow tables, they’re going to be kept abridged on when tables are changed or they’re going to be tapped on, because if you’re following a table, maybe you’re implicitly more of an expert on that table. So people can see who follows that table and therefore they can ask questions like, “Hey, do you know how this table works exactly?”

Daniele Perito:

We had a few things that we wanted a data catalog tool to do. We divided them into must have and nice to have, and then we took all the data catalog tools that we wanted to potentially use, put them on that table. So first, we decided on the rubric, that’s very important. You have to design it on the rubric first, and then you evaluate the solutions. You want to make sure to set a consistent, set the foundation of playground first that makes sense.

Speaker 2:

Yeah. Yeah. What data are you using? Do you see transactions from customers? What data do you buy from the marketplace?

Daniele Perito:

Well, so we are a marketplace. The best way to think about us is that we are eCommerce marketplace, not unlike Etsy, for example. People come on our website, they buy products, retailers buy products to sell on their source. The vast, vast majority of the data that we use to make decisions or to build models is data that is collected from the behavior of customers. Customers placing orders, customers’ favorite things, customers visiting pages. That’s the very bulk of the data that we use. We do also try to understand a little bit the landscape outside of Faire. We try to see, for example, what does the website of this brand look like? So we get some that data from publicly available sources.

Speaker 2:

What triggered your decision to move from Redshift to Snowflake, and how did you identify the right time to make that move?

Daniele Perito:

Yeah. I think we had a long outage. This was a little bit of uproar a little bit. We started having people complaining about the fact that their query started taking longer, longer. We spent a good amount of time trying to make Redshift work. I mean, again, Redshift is very capable. It took us from zero to a year and a half ago, so it’s been amazing. But at some point we realized that the amount of effort that we were putting in trying to optimize Redshift, trying to scale it, trying to sniping bad queries, optimizing large queries was getting harder and harder, and yet query times were getting longer and longer and there was being no hand on the inside. So we started talking a bunch of companies. We started getting feedback on what they were using and Snowflake ended up being the solution we run with.

Speaker 2:

Yeah. Yeah. Another one that is good from Ash, this is a follow-up question. You mentioned in the slides analytics platform and ML platform. What are the differences between the two?

Daniele Perito:

Yeah. ML platform, the way I think about it, and I think people also refer to it as MLOps, is like a unified feature store where all teams can sort of operate on, all data teams can operate on. It is a unified machine learning model training pipeline. It is a model registry where previously trained models can be stored in and then sort of analyzed for later use. Model serving, which could be batch model serving or real time owner serving. And then model monitoring, if our models continue to work as well as they did when we trained them. That’s an ML platform. In analytics platform, it is really just making sure that, how are we computing all our dimensions and facts in the business? What is the system that does that? What is our data catalog that allows us to folks to know where stuff is? How are we adding definition and who is adding definitions? As like of general questions and specific questions, what are the orders? When is an order happening? What was the revenue for that order? What was the freight cost for that order?

Daniele Perito:

All of these things need to be defined somewhere, and ideally they need to be defined once and then everybody in the organization knows exactly where they’re defined and how to query those things. To me, that’s the analytics platform, is this unified system where every single fact of the business is defined once and everybody knows where it is and everybody uses it.

Speaker 2:

Yeah. Then the machine learning platform just goes a little bit deeper down the rabbit hole for the machine learning use cases.

Daniele Perito:

Yeah. I mean, it definitely needs to make use of these things. A lot of your features for your models are going to be using those summary facts in fact tables that can then be used to train models and things.

Speaker 2:

Yep. John is asking about why you went from Kubeflow to Airflow. He is going through that same situation right now, trying to choose between the two.

Daniele Perito:

This is a little bit out of my depth, but if I had to… I wasn’t much in the weeds of that. But I think that there was a little bit of, there were some issues in the cleanliness of the logs. I think when a pipeline broke in Kubeflow, it was always very hard for the team to figure out exactly where it broke. Parsing the laws was very hard and simulating just the UI. Airflow is integrated a lot, so just making sure that you understand exactly where a certain that broke and it allows you to really go in quickly and debug it. I think those were some of the things that didn’t fully work for us.

Speaker 2:

We’re hitting up on time. There’s a few more questions and I’ll try and ask them fast and rapid fire it. What technologies are you using for the online feature store or real time feature store?

Daniele Perito:

That’s all homegrown today. Actually, it’s still split between ranking and fraud. The ranking system is real time and homegrown and the fraud system is different. I am very interested in a [inaudible 00:30:40] to this, to the organizers here. I am very interested in a unified real time of batch feature store. I think that the technology is quite getting there. We have something that works for us, and I’m interested in understanding when to make a switch potentially. Issues that are top of mind for me is feature back filling, which is really hard to do, especially for realtime. Feature computation is really, really hard to do. We don’t support that well right now. So getting to that point, I think this might be something that we’ll be looking for external providers. But the cost that I was talking about of syncing state with an external provider is very real. I think for feature stores is something that we’re going to have to consider.

Speaker 2:

The organizers just clipped that soundbite and you’re going to be all over their social networks. I had one on the experiment tracking that you were talking about and how you mentioned the database that you were using, and then the database that the experiment tracking company was using and how it was so hard to just make those play nicely together. Is there a way, or have you thought about having an experiment? And I guess this is what you’re doing at the end of the day, you just brought the experiment tracking into your database. Did you not find anything as a third party vendor that would integrate into your database and do that for you?

Daniele Perito:

Well, it was a little bit of an impedance mismatch, and that’s what I was trying to get to. The way that we think about our metrics is that we replicate our SQL tables from production databases. We then take those tables, we completed the [inaudible 00:32:26] on them and now we have facts about our orders, of our customers, and things like that. The way that the experimentation platform thought about the world was events. Every time an add to cart happened, every time an order happened, it was an event with some tags on it, whether the customer was from United States, Canada, whether it was a tenured customer, whether it was a new customer. They thought about things in events. Whenever we needed to do analysis in their system, we needed to make sure that whatever piece of data that we wanted to do the analysis on, is it a tenure customer or not, we needed to be publishing in those events. So that’s the main mismatch that you need to account for when thinking about integrating an external platform.

Daniele Perito:

It’s not just like, “Hey, give them a giant copy of your data warehouse or giving them access to their warehouse” but most likely they’ll be thinking about how to organize data in a completely different way than you do. So you’re going to have to find a way to remap whatever you do to whatever they’re doing.

Speaker 2:

Last one for you. Daniele, and this has been really cool. I appreciate your experience and the evolution of what you’ve seen at Faire. What are some of the biggest challenges Faire has faced as the platform grows and expands internationally?

Daniele Perito:

Oh, yeah. International expansion has been really fun just because every single currency, every single language changed, and localization, internationalization sort of funds out to a hundred things that need to change in the data analytics world. It’s like, oh, you thought that it was easy to sum up what orders. What’s the total order amount of the last three 30 days? It used to be an easy question, but then once you start thinking about different currencies, now you have to think about, okay, do I store the original currency in interpret column? Do I snapshot the currency? At what time do I snapshot the currency conversion? Is that when we recognize the revenue? It’s like, it’s a whole [inaudible 00:34:33]. It really impacts your data systems a lot.

Speaker 2:

Yeah. I imagine you also have a lot of fun with GDPR and all that good stuff [inaudible 00:34:45]

Daniele Perito:

Oh yeah. That’s for sure. Though, I think we were ready for CCPA in California, and they’re very compatible. I think that was a little bit less for us just because we had done the work for California.

tell you a little bit more about Faire. Faire is a wholesale B2B marketplace founded in 2017 to help independent brands and retailers connect.

Just to give you maybe a little more of a vivid picture of what we do, if you’re in San Francisco, imagine one of those stores on Valencia Street, anywhere else, imagining one of the stores on main street and those stores they’re curated stores, they need to buy merchandise to sell. Traditionally, they will buy their merchandise from sales reps or at a trade show. But with Faire, they can buy it on our marketplace. We also serve online stores, but I just wanted to give you a picture of who we are. A little bit of numbers, just so that it gives you some context on the types of journey that we went on, right now Faire serves over 450,000 retailers around the world. We have 70,000 brands that are selling on Faire. On the buyer side, the retailer’s side, the website is available in nearly 20 countries.

I’ll tell you more about our stack. I think it’s also good to have a sense of scale, the number of employees, and we have now over a thousand employees and we are hiring. Let me get into it, and please ask any questions you like, and I’ll try to answer them towards the end. As a marketplace, we have a variety of data challenges. I think marketplace businesses naturally land themselves to many data optimization problems, machine learning problems. So starting from sort of bread and butter search discovery ranking, when someone types something on the search bar to when they land on the homepage, we need to show them something. That’s driven by a variety of models. We need to manage risk and fraud in the marketplace. That’s just the cost of doing business. We need to enable our sellers to be successful on the platform.

For that, we need to find the best sellers brands that exist around the world, score them so that we can prioritize which ones should be onboarded on first, and then maybe help give them a helping hand as they’re joining the platform to sort of optimize their profile and all. All of that is done by several machine learning models. We have incentives as a marketplace naturally, and we need to optimize those incentives. We have folks that are working on that. Then we also have other data and machine learning challenges, but this is a representative set. Then on the platform side, we need to manage things like our experimentation platform, our analytics platform, our machine learning platform, and so on and so forth.

The structure of this talk is that I’ve been going over year by year with some of the main things that we worked on in each year, we’re a five-year old company, give or take, and some of the learnings that I’ve seen and that we’ve seen as we grew the company. Starting from year one, the size of the company was roughly four people at the beginning going to 12 people towards the end. The size of data team was one, it was me. I had some expertise in data because I was working in machine learning at Square prior to starting the company. I was the person that had to make all the decisions, I guess. In that first year, we had to make several decisions. We had decided on what online database to use. We started replicating data to the warehouse. So we had to also choose the data warehouse.

We stood up an events recording framework to understand what customers were doing on the website. We stood up a BI tool, and then we had our first realtime feature store that was at that time used for fraud. Just a quick call out here, out of these six sort of systems that we built back then, five of them are still in place as they were built or an evolution of what that was originally built. One of the first things that I’d like you to take home is that data decisions, data infrastructure decisions can be very sticky because it’s not easy to migrate things over. So think those decisions thoroughly. But let’s deep dive a little bit into the decision on the online database. Now, so just to set the stage and give you a picture of where we were, we had just started the company in January 2017.

We didn’t have really a product. We didn’t have customers, but we had a deadline three months later to make it to demo day at YCombinator, which is an accelerator program. We really needed to get out of the gate fast. My co-founder, Marcelo, wanted to use MongoDP, primarily because it was really, really fast to set up and get off the ground. I don’t want to get too much into the pros and cons of each technology, I think this is a topic that has been embedded on length, but the primary reason why we ended up choosing MySQL was because it was so much easier to replicate the data into some format that would allow us to do analytics. I think that was one of the primary designing factors there in that decision.

To summarize some of the learnings of that first year, so first you’re going to be making a lot of important and far reaching decisions about your data and ML organizations. Make sure that even if you’re looking at engineering only decisions, you keep an eye on your data stack and how it’s going to evolve based on those decisions. If you don’t have expertise in that first year in data or machine learning, maybe your company’s doing something else, maybe it’s a SaaS company focusing on something different, but at some point you’ll face challenges of getting data into a data warehouse and things like that. Make sure that from the very beginning, you either have advisors or employees or co-founders who have worked with data slash machine learning systems in there, and they can give you some solid advice.

Our second year, the company went from being roughly 12 people to roughly 45 people, and the size of the data team group from one to three. In that second year, we were doing many things, including improving our search system. We had to start paying attention to orchestration. Then we started standardizing our core tables for analytics. We had to make a decision for our orchestration. Just to give you some context, at that time, we were using our BI tool, which is more analytics and still is, to and scheduling Chrome tabs in modern analytics in order to compute, I don’t know, nightly things. Obviously that was a little bit of a hack solution that was getting us off the ground, but we needed to start orchestrating workflows in order to do all sorts of things. We didn’t have much expertise in either Kubeflow and Airflow.

Again, here, I’m not trying to take sides on any of those technologies. I think they’re all capable technologies. Your mileage may vary. Your particular circumstances might be different than ours, but at that time we didn’t know much about either. I think we ended up going with Kubeflow because there was the alert of the entire ML orchestration, an entire end to end system for ML that we thought we would build in Kubeflow. Now, eventually we had to revert back to Airflow a year or two later. Looking back to those days, I wish we had really thought a little bit more about our specific use. Both of those are capable technologies, but they’re different tools that they’re opportunity for different use cases.

As you’re starting to scale your data organization, as you’re starting to scale your company, think about the fact that once you don’t have direct expertise with things, you should really start having rigorous decision making, which might be a rubric where you’re evaluating the pros and cons of a certain solution, identifying exactly what problems you’re trying to solve with that solution so that your batting average, your average decision is going to be a little bit more right and that’s going to be a long-term advantage for you. In year three, we went from a company of roughly 45 to a company of roughly 150. The data team doubled in size from three to six. Now, the main decisions we made that year were that we started unifying our feature store. We had a couple of ways to compute features. We started putting them all together in one unified way.

We migrated back to Airflow. We started really building data volume monitoring to make sure that tables wouldn’t go scale, that features wouldn’t start drifting that model stores, sorry, model scores wouldn’t start drifting. Crucially, we started trying to see whether or not we needed to get a new experimentation platform. That’s the thing I wanted to focus a little bit more on. At the time, we had a homegrown experimentation system. It was extremely basic. When customers were assigned in either control or treatment, we will just write sequel queries in order to compute some metrics on both customer sets and then display them in a modern analytics dashboard. We knew that the system was limited. So we started evaluating options for replacement. We started talking with a company that provides experimentation as a platform, and we actually signed a contract with them. The first thing I realized is that the project took longer than we expected, just because we had to synchronize state between our own data warehouse and the external experimentation platform.

I think looking back, this is a cost that we hadn’t factored in properly. What do I mean here specifically? If you have your own warehouse, you have your own dashboards, and you have your own metrics, you have a definition of every important fact in your business. Maybe it’s going to be orders and visits, active customers, add to cart events, and all of those things. Those things are going to be defined in your warehouse. When you’re integrating with something like an experimentation platform, they will have their own data model typically. There is a real cost in understanding how to map your data model into their data model and make sure that they’re close enough to one another. Ultimately, the effort failed for lack of use. So we went back to using our own homegrown solution, which by then had become a lot more sophisticated. It was a proper experimentation platform with lots of analytics and variance reduction and all sorts of cool tricks.

Couple important things that I’ve learned. When fact considering external providers for data platforms or anything, and I think by the way, you should because I think technologies are maturing very rapidly and I think we’ve benefited a ton from using external providers, do factor in the fact that keeping data consistent between your own data warehouse, your own metrics definitions, and their own metrics definition is going to be a real cost, and you have to factor that in. Secondly, as you’re evaluating solutions, also consider the specific composition of your team. Here, what I mean is that for the experimentation platform, the external provider we were considering was very capable, but our team tends to be quite literate in terms of SQL. They always tended to have follow-up questions. They said, “Oh, okay. Treatment is winning on the primary metric, but can we cut it by this? Can we cut it by that?” So they really wanted to be able to do that and the outside platform was a little bit lacking in that regard.

If you have a company that is more sequel leader, let sequel leader consider that as you’re choosing your tools. In year four, we went from roughly 150 people to roughly 300 people, and the size of the team in data dual again from six to 12. We started working on a more sophisticated version of our search sort of discovery algorithm, real time ranking for low and latency inference. Migrated to Snowflake from Redshift and Terraform and everything in our data world. Again, this doc is not a, I’m not advocating for either solution, but in Redshift serve us really, really well for several years. But at some point in time became clear for us that for our own use case, the speed of scalability of snowflake was going to be a big advantage. So we started embarking in the process, that I know that a lot of you also went through, of migrating thousands of tables, thousands of definitions, thousands of dashboards from one data warehouse to another. It’s a very painful process that took us over six months.

There are a couple of things that I wish I had known before embarking in that process. Number one, I wish we had really kept more at tighter ship in terms of project management. Big data infrastructure changes are hard that require a lot of coordination. You have to migrate many, many, many different things. You have to bring the whole company along with you as you’re doing it. I wish we had kept a little bit of a better project management, like stack in order to do that. Another important learning that I’ve had over in that year is that none of these migrations can ever be perfect. When you’re migrating, the definitions of hundreds of tables and ETL, thousands of dashboards, inevitably there is going to be small inconsistencies. That’s not because of lack of rigor position, but that’s just because some functions are going to be slightly different. Some there’s going to be timestamp misalignment between when metrics have been computed in different warehouses, for example.

As you’re doing large scale data migrations, make sure to define an accuracy threshold explicitly so that people can actually finish the job, because otherwise you might be stuck in an infinite loop of saying, oh, we’re going to make this better, we’re going to make this better. Moving on to year five, which is last year 2021 for us, the company went from roughly 300 people to roughly 800, and the size of the data team grew again from 12 to 30. This is still a work in progress, but we embarked in some big pieces of work. For example, we are starting to do a unification on our ML platform. And here for ML platform, I mean feature store, model training pipelines, model registry, model monitoring, model serving. It is a little scattered at the moment. Multiple teams are doing things slightly differently. We want to bring everything under one umbrella.

Then analytics reunification, meaning that many teams at Faire have defined their own metrics, their own summary tables and dimension tables, fact tables. We are working on unifying all those definitions to make sure that everybody’s looking at the world in the same way. The main learning for me here is that as your organization evolves and as it grows, you have to try to establish some type of technical committee to understand when it’s time to make big investments, because inertia is a hell of a drug here. If you don’t make explicit effort into collecting feedback from data engineer, sorry, data scientists, machine learning engineers, data engineers, about their job and where they’re spending most of their time, and whether there is a lot of toil in their workflows, if you don’t spend time collecting that information, then you might not understand that it’s time to make some big change, whatever that big change might be. For us, it was investing in a machine learning platform or investing in an analytics unification, unified analytics system, or investing in a Snowflake migration.

Those projects are really, really important for the future of your company and your organization. So spend some time, set up some system so that you can gather feedback from all of your org to understand when it’s time to make the next big change. In summary, make sure that you have expertise in the early days of your company so that you can make the right decisions early on, because those are going to be very impactful. Make sure that your engineering side of the org and your data side of the org speak the same language, especially in the early days. Otherwise, you might have to replicate data stores that are hard to replicate, or figure out things that are hard to figure out. As you grow, make sure you establish rubrics to make decisions in your data infrastructure so that you have a higher banning average. Consider the team composition as you’re making data choices, like who are the people that this solution is going to be serving.

I think that concludes my talk. To learn more about Faire, you can visit these links. Thank you all. I think I’ll be taking on some questions.

Yeah, dude. I got a few questions for you, man. There’s some questions coming through Slack. First thing I want to say, though, I called you Italian. I hear your accent. It doesn’t sound Italian. You aren’t Italian, are you?

No, I’m born and raised Italian. I’ve been in United States for 12 years and my accent is a little bit funky.

Oh, all right. Cool. I was afraid that I misspoke there. Let’s get into some questions. Can you give concrete examples of a rubric for making data decisions?

Yeah, I think maybe recently, most recently we had to decide on a data catalog, and part of the rubric was, does the data catalog integrate with our BI tool? Because ideally, the data catalog should have contextual information about tables and columns that appear wherever you write in your queries. So does it appear in our BI tool, and is that a non-negotiable action like thing? Can it derive automatically lineage from the way that which tables have been created? Does it have some ways for people to follow tables? I mean, I think one powerful idea in this all data catalog space is that, when people follow tables, they’re going to be kept abridged on when tables are changed or they’re going to be tapped on, because if you’re following a table, maybe you’re implicitly more of an expert on that table. So people can see who follows that table and therefore they can ask questions like, “Hey, do you know how this table works exactly?”

We had a few things that we wanted a data catalog tool to do. We divided them into must have and nice to have, and then we took all the data catalog tools that we wanted to potentially use, put them on that table. So first, we decided on the rubric, that’s very important. You have to design it on the rubric first, and then you evaluate the solutions. You want to make sure to set a consistent, set the foundation of playground first that makes sense.

Yeah. Yeah. What data are you using? Do you see transactions from customers? What data do you buy from the marketplace?

Well, so we are a marketplace. The best way to think about us is that we are eCommerce marketplace, not unlike Etsy, for example. People come on our website, they buy products, retailers buy products to sell on their source. The vast, vast majority of the data that we use to make decisions or to build models is data that is collected from the behavior of customers. Customers placing orders, customers’ favorite things, customers visiting pages. That’s the very bulk of the data that we use. We do also try to understand a little bit the landscape outside of Faire. We try to see, for example, what does the website of this brand look like? So we get some that data from publicly available sources.

What triggered your decision to move from Redshift to Snowflake, and how did you identify the right time to make that move?

Yeah. I think we had a long outage. This was a little bit of uproar a little bit. We started having people complaining about the fact that their query started taking longer, longer. We spent a good amount of time trying to make Redshift work. I mean, again, Redshift is very capable. It took us from zero to a year and a half ago, so it’s been amazing. But at some point we realized that the amount of effort that we were putting in trying to optimize Redshift, trying to scale it, trying to sniping bad queries, optimizing large queries was getting harder and harder, and yet query times were getting longer and longer and there was being no hand on the inside. So we started talking a bunch of companies. We started getting feedback on what they were using and Snowflake ended up being the solution we run with.

Yeah. Yeah. Another one that is good from Ash, this is a follow-up question. You mentioned in the slides analytics platform and ML platform. What are the differences between the two?

Yeah. ML platform, the way I think about it, and I think people also refer to it as MLOps, is like a unified feature store where all teams can sort of operate on, all data teams can operate on. It is a unified machine learning model training pipeline. It is a model registry where previously trained models can be stored in and then sort of analyzed for later use. Model serving, which could be batch model serving or real time owner serving. And then model monitoring, if our models continue to work as well as they did when we trained them. That’s an ML platform. In analytics platform, it is really just making sure that, how are we computing all our dimensions and facts in the business? What is the system that does that? What is our data catalog that allows us to folks to know where stuff is? How are we adding definition and who is adding definitions? As like of general questions and specific questions, what are the orders? When is an order happening? What was the revenue for that order? What was the freight cost for that order?

All of these things need to be defined somewhere, and ideally they need to be defined once and then everybody in the organization knows exactly where they’re defined and how to query those things. To me, that’s the analytics platform, is this unified system where every single fact of the business is defined once and everybody knows where it is and everybody uses it.

Yeah. Then the machine learning platform just goes a little bit deeper down the rabbit hole for the machine learning use cases.

Yeah. I mean, it definitely needs to make use of these things. A lot of your features for your models are going to be using those summary facts in fact tables that can then be used to train models and things.

Yep. John is asking about why you went from Kubeflow to Airflow. He is going through that same situation right now, trying to choose between the two.

This is a little bit out of my depth, but if I had to… I wasn’t much in the weeds of that. But I think that there was a little bit of, there were some issues in the cleanliness of the logs. I think when a pipeline broke in Kubeflow, it was always very hard for the team to figure out exactly where it broke. Parsing the laws was very hard and simulating just the UI. Airflow is integrated a lot, so just making sure that you understand exactly where a certain that broke and it allows you to really go in quickly and debug it. I think those were some of the things that didn’t fully work for us.

We’re hitting up on time. There’s a few more questions and I’ll try and ask them fast and rapid fire it. What technologies are you using for the online feature store or real time feature store?

That’s all homegrown today. Actually, it’s still split between ranking and fraud. The ranking system is real time and homegrown and the fraud system is different. I am very interested in a [inaudible 00:30:40] to this, to the organizers here. I am very interested in a unified real time of batch feature store. I think that the technology is quite getting there. We have something that works for us, and I’m interested in understanding when to make a switch potentially. Issues that are top of mind for me is feature back filling, which is really hard to do, especially for realtime. Feature computation is really, really hard to do. We don’t support that well right now. So getting to that point, I think this might be something that we’ll be looking for external providers. But the cost that I was talking about of syncing state with an external provider is very real. I think for feature stores is something that we’re going to have to consider.

The organizers just clipped that soundbite and you’re going to be all over their social networks. I had one on the experiment tracking that you were talking about and how you mentioned the database that you were using, and then the database that the experiment tracking company was using and how it was so hard to just make those play nicely together. Is there a way, or have you thought about having an experiment? And I guess this is what you’re doing at the end of the day, you just brought the experiment tracking into your database. Did you not find anything as a third party vendor that would integrate into your database and do that for you?

Well, it was a little bit of an impedance mismatch, and that’s what I was trying to get to. The way that we think about our metrics is that we replicate our SQL tables from production databases. We then take those tables, we completed the [inaudible 00:32:26] on them and now we have facts about our orders, of our customers, and things like that. The way that the experimentation platform thought about the world was events. Every time an add to cart happened, every time an order happened, it was an event with some tags on it, whether the customer was from United States, Canada, whether it was a tenured customer, whether it was a new customer. They thought about things in events. Whenever we needed to do analysis in their system, we needed to make sure that whatever piece of data that we wanted to do the analysis on, is it a tenure customer or not, we needed to be publishing in those events. So that’s the main mismatch that you need to account for when thinking about integrating an external platform.

It’s not just like, “Hey, give them a giant copy of your data warehouse or giving them access to their warehouse” but most likely they’ll be thinking about how to organize data in a completely different way than you do. So you’re going to have to find a way to remap whatever you do to whatever they’re doing.

Last one for you. Daniele, and this has been really cool. I appreciate your experience and the evolution of what you’ve seen at Faire. What are some of the biggest challenges Faire has faced as the platform grows and expands internationally?

Oh, yeah. International expansion has been really fun just because every single currency, every single language changed, and localization, internationalization sort of funds out to a hundred things that need to change in the data analytics world. It’s like, oh, you thought that it was easy to sum up what orders. What’s the total order amount of the last three 30 days? It used to be an easy question, but then once you start thinking about different currencies, now you have to think about, okay, do I store the original currency in interpret column? Do I snapshot the currency? At what time do I snapshot the currency conversion? Is that when we recognize the revenue? It’s like, it’s a whole [inaudible 00:34:33]. It really impacts your data systems a lot.

Yeah. I imagine you also have a lot of fun with GDPR and all that good stuff [inaudible 00:34:45]

Oh yeah. That’s for sure. Though, I think we were ready for CCPA in California, and they’re very compatible. I think that was a little bit less for us just because we had done the work for California.

Daniele Perito

Co-Founder & Chief Data Officer

Faire

Daniele Perito is the co-founder and Chief Data Officer of Faire. Daniele oversees all data and risk management to enable the company and its customers to make data-driven decisions whenever possible. Prior to Faire, Daniele was Director of Security, Risk for Square Cash where he worked on building secure, fast, and easy-to-use products. Daniele led the integration of products that make it easier for individuals to collect payments, including the Snapcash and Twitter political campaign donations via Square Cash. He holds a MSc, Computer Network Security from Sapienza Universita di Roma, a Ph.D, computer science from INRIA Grenoble and a Postdoc in computer science from the University of California, Berkeley.

Request a Demo

Unfortunately, Tecton does not currently support these clouds. We’ll make sure to let you know when this changes!

However, we are currently looking to interview members of the machine learning community to learn more about current trends.

If you’d like to participate, please book a 30-min slot with us here and we’ll send you a $50 amazon gift card in appreciation for your time after the interview.

CTA link

or

CTA button

Contact Sales

Interested in trying Tecton? Leave us your information below and we’ll be in touch.​

Unfortunately, Tecton does not currently support these clouds. We’ll make sure to let you know when this changes!

However, we are currently looking to interview members of the machine learning community to learn more about current trends.

If you’d like to participate, please book a 30-min slot with us here and we’ll send you a $50 amazon gift card in appreciation for your time after the interview.

CTA link

or

CTA button

Request a free trial

Interested in trying Tecton? Leave us your information below and we’ll be in touch.​

Unfortunately, Tecton does not currently support these clouds. We’ll make sure to let you know when this changes!

However, we are currently looking to interview members of the machine learning community to learn more about current trends.

If you’d like to participate, please book a 30-min slot with us here and we’ll send you a $50 amazon gift card in appreciation for your time after the interview.

CTA link

or

CTA button