Access Control in ML Feature Platforms

apply(conf) - May '23 - 10 minutes

Contemporary machine learning platforms connect a variety of data sources, tools, and applications, each with their own data governance requirements, constraints, and capabilities. In this talk, we’ll explore the role of the machine learning platform in integrating these elements, empowering data producers and feature engineers to own access control for their data and features.

Cooper Stimson:

Hi everyone. I’m Cooper and I work on the foundational engineering team here at Block. I focused on our platform for machine learning features. And today I’m going to be talking about access control in the context of ML feature platforms.

Cooper Stimson:

So at a high level, what are our goals? Specifically, we want to be able to connect feature engineers to source data that they use and then connect those features that they produce to the model engineers who use them to train and run models. It’s important to note that these are not necessarily the same and therefore the user stories are distinct. So at Block, our machine learning systems handle a lot of data, including sensitive data, financial data. So it’s critical that we nailed this access control story across the board. And machine learning infrastructure presents some interesting challenges for managing access control. And at a high level, a lot of it just comes down to heterogeneity. So these teams and these systems have heterogeneous service areas.

Cooper Stimson:

So what do I mean by that? We have a variety of user teams who are using these systems, who have different requirements for access control and different authorization levels. We have a variety of data sets and data sources, each of which can have their own data governance policies, regulatory requirements, things like PCI or GDPR. And we also just have a large suite of tools. And this is a part that is frequently shifting, particularly right now with the rise of LLMs and things like that. We’re seeing new tools all the time. And all of these different feature stores, data storage, ML tool chains, can have different access control models, different identity management and authorization mechanisms, and many of them don’t plug into each other out of the box.

Cooper Stimson:

So let’s get into a little bit more detail and some aspects which can differ between these. The first thing I want to talk about is granularity. So when we grant access to data, what do we actually mean? So the bluntest is granting access to a database. You give your user access to all the data and they can access everything. This is actually sometimes appropriate, either if the data is not sensitive or if you have a small enough team that everybody is working on everything. But at enterprise scale, this doesn’t work. As individuals specialize and you have more kinds of data, you need to be able to block access by default.

Cooper Stimson:

So then we move into kind of a data set granularity model. So this might be like a table, or this might be an S3 bucket where you are managing access to a collection of records of the same type. And already you might be working with databases or feature stores which don’t support this granularity level and you may need to start thinking about building an intermediation layer. But once you have a lot of these, once you have a large number of data sets, I’ve worked on systems that manage literally tens of thousands of feature tables for example. Managing access control to data set level is no longer practical. And so then you start moving into concepts that I’m going to lump under workspace, but many systems have different names for this, where you group together a bunch of data sets and treat those as a unit for the purposes of access control. This is actually our host, Tecton. That is what Tecton does. You can define a workspace, define a bunch of features inside it, and then grant or deny access to that workspace as a single action.

Cooper Stimson:

But sometimes you need even more granularity than that, and that’s where we get to the key or row level of granularity. Sometimes certain records need to have access controlled specially, for example, if a user invokes their right to be forgotten under GDPR. However, you also have regulatory requirements around know your customer and record keeping. You may end up in a situation where there are records which should only be accessed under very specific regulatory required circumstances.

Cooper Stimson:

There are a few ways that you could do this. And frankly, most tools don’t support this out of the box. So you need to build some type of layer on top. This could be a gating system where you introspect return values and block one forbidden rows are returned. You could also implement this by duplicating feature values under the hood into redacted and unredacted sets. This is what I’ve seen be most successful. I’ve worked on systems which handled this out of the box under the hood where accumulators for each aggregation were maintained in both redacted and unredacted versions of the data and only allow listed users were served the unredacted version.

Cooper Stimson:

So at this point we have a lot of pairwise relationships between users and resources. And how do we represent those? Well, as grants. So I’m going to talk about three models which roughly correspond to organizational size and complexity. So the first is a distributed model where each tool, each feature store, each data storage system, it probably has its own access control system built in. And so you use that. When you want to grant somebody access to an S3 bucket, you write an IM policy, et cetera.

Cooper Stimson:

This works with small teams and with low complexity data ecosystems. But as you start growing, you have more specialized users, that pairwise relationship is just going to get too complex. So then you need to move to a centralized model. And there are kind of two approaches to this. So the first is what I’m… And this is inserting some an opinion here, a bottleneck model. So this is where you centralize all the grants into one place, which is controlled by your ML platform team. I’ve worked on systems which did this, even systems which used a single config file to manage access control for a variety of resources and a variety of teams.

Cooper Stimson:

Under a system like that, data owners would submit PRs to update their sections of the config, and our team would vet that only appropriate users were updating the appropriate grants. This put us in the loop that slowed down customers because if a data source owner wanted to grant access to their data to another team, they had to go through us. That made us a bottleneck as use of our platform grew.

Cooper Stimson:

So we moved to a delegated system, and this is what I think most organizations will tend toward over time. In this system, you still have a centralized place or a centralized small number of places for grants, but you’ve now delegated management of them to an owner role. So for a given resource, you grant an owner role and that owner can manage grants for that resource or set resources. This completely removes the platform team from the loop and allows feature owners and model engineers to just interface directly and use your platform to get the job done.

Cooper Stimson:

A second benefit of that is auditibility. And I’m talking both in the sense of incident forensics in the sense of formal audit obligations, but also just in the baseline understandability of your systems. So there are also benefits when designing new features or upgrading things. We deal with a lot of systems in machine learning like neural nets and LLMs, which can be somewhat black boxes, and access control cannot be that.

Cooper Stimson:

So when I’m talking about auditibility, the most important part to me is that you support point in time state visibility analysis of your access control system. So for any point between the beginning of your system and now, you can understand what resources were available to what users. There are two sides of this. So first is just understanding what grants were active at any given point. The second is the actual access history. And depending on your use case, both of those are going to be important.

Cooper Stimson:

Okay, so sewing it all together. Given all of this heterogeneity, how do we actually build a platform that supports our users’ needs? So this is just kind of my high level workflow for approaching this. The first is just to gather requirements from all users and all data owners. The second is to gather the capabilities and constraints for each tool. And given those two sets of information, design end-to-end user stories to understand how users will actually be using your platform. And given that and given the constraints of your tools, you’ll quickly identify gaps and that’s where you come in as a platform team and build glue. And once you’ve designed sufficient glue to support those user stories with your system components, step back and let your machine learning experts own their own access control.

Cooper Stimson:

Thanks.

Cooper Stimson

Software Engineer, Machine Learning Platform

Block

Access Control in ML Feature Platforms

Cooper Stimson

Follow Us

Book a Demo

Contact Sales

Request a free trial