Train a Model

In this notebook, which can be run in your favorite Python notebook environment such as on your local machine, SageMaker, or Databricks, we will be training an XGBoost classifier to predict whether a transaction is fraudulent or not, using the training dataframe we constructed from stored Tecton features on our users layered in with the transactional information we added from our fraud_transactions_pq table.

We use pure Python here, so no need to run this in a Spark interface.

We'll assume you've already downloaded the training dataframe you generated in step 1 to a suitable location on your local machine, but you can just as easily load it from S3 using the s3fs library.

We'll load the training dataframe we created from Tecton above

We'll log our model to mlflow (bundled with Databricks) to make it easier to deploy our model to production, but you can use any model registry you'd like.

Scoring

We'll score our model to see how well it predicts against the test set: data it hasn't seen but for which we still have the right answer. No surprises here: we do very well in correctly predicting ok transactions but not fraudulent ones. This might be due to a few reasons: ok transactions may look more similar to each other than fraudulent transactions and there are comparatively fewer fraudulent transactions in the training dataset (aka a class imbalance problem).