Building a Fraud Detection Model

Tecton is powered by Spark to make it easy to do complex transformations with your data and layer in real-time data from your Kafka and Kinesis streams. When extracting a training dataset, this is the only step that needs to be done within a Spark environment, like Databricks or AWS EMR.

Preview Materialized Features

Let's take a look at what kind of data on our users we've got stored in Tecton

Generate a Training Dataset

We're generating a prediction context dataframe, i.e. a dataframe from your data store that Tecton will join with data from the feature service we created. We need to pass at a minimum a join key(s)--the field(s) Tecton will use to match with a row of features in its feature store--and a timestamp when training, so Tecton knows at what time the label column (isfraud in our case) was valid.

We can optionally pass additional columns (below those are the amount, and types columns) that we expect to be passed at inference time to train and make predictions.

In this case, we're querying our transactions database for the following fields:

Let's take a look at what our context dataframe looks like.

We've got a Spark dataframe and we can write it out to the destination of your choice, including flat files, database tables, etc. Here we write to a parquet file on S3: the below command assumes you've mounted an S3 bucket to /mnt/tecton using dbutils.fs.mount() or equivalent.

Next Step: Training

That's it! Check out training.py to see how you can load this training dataset in Databricks or in the Python environment of your choice, including Jupyter notebooks running on your local machine or SageMaker.