Anomaly detection with Layer

Anomaly detection is a data mining process that identifies unusual patterns that don’t conform to expected behavior. These non-conforming patterns have numerous equivalent terms such as anomalies, outliers, noise, novelties, exceptions, discordant observations, and aberrations. This article will look at how you can build and deploy an anomaly detection model with Layer. Anomaly detection with […]
Nov 12th 2021
read

Share this post

Nov 12th 2021
read

Share this post

Anomaly detection with Layer

Brain John

Data Scientist

Anomaly detection is a data mining process that identifies unusual patterns that don’t conform to expected behavior. These non-conforming patterns have numerous equivalent terms such as anomalies, outliers, noise, novelties, exceptions, discordant observations, and aberrations.

This article will look at how you can build and deploy an anomaly detection model with Layer.

Anomaly detection with Layer

Layer is a declarative MLOps platform that helps data teams to build machine learning applications based on code. Models trained with Layer can be deployed and used to run predictions immediately. Layer also stores the model and features used for building the models. This makes it easier to reuse features and models in the future.

Connecting your data

The first step in building models with Layer is to connect to your external data source. In Layer, this is done using integrations. You can access integrations on the organization settings of your Layer account.

Currently, Layer supports two data sources integrations, namely:

Ethereum fraud detection model with Layer

Since the emergence of Bitcoin in 2009, the cryptocurrency market has grown in popularity and drawn much attention from investors, innovators, and the general public. Due to the decentralized and anonymous nature of cryptocurrencies exchanges, it is vulnerable to fraudulent activities.

Cryptocurrencies transactions are instant, portable, and international reach has enabled proliferating fraudulent cases in various forms, such as Ponzi schemes, tax avoidance, money-laundering, and bribery. This use case will help build a predictive system that identifies unethical and fraudulent behavior in Ethereum transactions.

We will be utilizing the BigQuery integration for building our Layer project for fraud detection.

Loading data into BigQuery

Loading of data into BigQuery is achievable via different techniques. The data on Kaggle is in CSV file format. We will be using Google Cloud Storage (precisely GCP bucket) to load the data. The procedure is as follows:

  1. Create a Google cloud project.
  2. Set access control for the created project using IAM via Service Account.
  3. Create and download the service key.
  4. Create a GCP storage bucket to organize and control access to your data.
  5. Upload the use case dataset in the storage bucket.
  6. Navigate to BigQuery on the GCP console.
  7. In the Explorer panel showing your project, create a BigQuery dataset.
  8. Create a BigQuery table (in our case, from Google Cloud storage) and set a unique table name (Set auto-detect schema and input parameters).

Note: If you followed the steps properly, you should have something similar to the image below (the only difference will be naming convention).

word image 14

Configuring the Layer integration

Layer integrations define the connection of external data to Layer. Navigate the Layer Integration UI to start.

word image 15

Note: There exists a default integration using Snowflake for the open-sourced Layer examples.

We will be creating our integration for this project using BigQuery as follows:

  • Select the BigQuery option.

word image 16

  • Configure the BigQuery connection. The BigQuery connection configuration requires inputs of two variables, the `Project` and `Dataset`, which correspond to the GCP project name, and BigQuery table name that represents the data on our GCP bucket.

word image 17

  • Next, you will publish the content of the service key JSON file generated in step 3 in the loading data into the BigQuery section. This will complete the connection to BigQuery. The service key JSON file contains key pair values as shown below with dummy values:
{

"type": "service_account",

"project_id": "layer-ethereum-fraud-detection",

"private_key_id": "********dummykey",

"private_key": "-----BEGIN PRIVATE KEY-----\n********dummykey\n-----END PRIVATE KEY-----\n",

"client_email": "layer-ethereum-fraud-detection@appspot.gserviceaccount.com",

"client_id": "******dummyid",

"auth_uri": "https://accounts.google.com/o/oauth2/auth",

"token_uri": "https://oauth2.googleapis.com/token",

"auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",

"client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/layer-ethereum-fraud-detection%40appspot.gserviceaccount.com"

}

word image 18

  • Finally, we need to name the integration to aid specificity when referencing in our respective Layer project.

word image 19

Note: Avoid using empty spaces; instead, use underscore or hyphen for various required naming fields.

word image 20

Create your Layer anomaly detection project

A Layer project is a directory that contains YAML configuration files (for the project, data, feature sets, and model), a corresponding model (currently only in Python), and features definitions (currently only in Python or SQL).

First, we need to create an isolated environment for the various dependencies unique to this project with the following procedures:

Create a new development folder in your terminal, run:

mkdir anomaly_detection

Create a new Python virtual environment. if you are using Anaconda, you can run the following command:

conda create -n env python=3.8

Activate the environment using:

conda activate env

If you are using a standard distribution of Python, create a new virtual environment by running the command below:

python -m venv env

Activate the new environment on a Mac or Linux computer, run:

source env/bin/activate

Regardless of the method you used to create and activate the virtual environment; your prompt should have been modified to look like the following:

(anomaly_detection) $

Next, to create a new Layer project, there are two approaches:

  1. Initializing an empty directory: Upon successful installation and login with the Layer SDK, run the following command in an empty directory to create a new Layer project:

 

layer init

 

This command creates a hidden `.layer` subdirectory in the current working directory, which contains all the necessary metadata of the Layer project in a `project.yaml` configuration file.

  1. Clone the empty Layer project repository: Layer created an empty project with a recommended folder structure. It can be cloned by running:
layer clone https://github.com/layerml/empty

The content of our `project.yaml` configuration file is:

# Project configuration file

apiVersion: 1

# Name of your project
name: "Layer Anomaly Detection(Ethereum fraud detection)"

Defining your datasets and features

Now, we need to configure our datasets via a `dataset.yaml` file as follows:

# Fraud detection use case

#optional field. 1 by default.

apiVersion: 1#required field

name: "fraud_detection_data" #Unique name of this dataset which will be used in this project to refer to this dataset.

type: dataset

materialization: #Required field that contains the name of the integration where this dataset is.
target: fraud-detection-dataset #Name of integration
table_name: transactions_table #Name of BigQuery table containing the dataset

#Optional field This is the reference of the use case dataset for this

# project
description:"Ethereum Fraud detection"

Next, we need to define our featuresets using the `dataset.yaml` configuration file.

apiVersion: 1

type: featureset

#required field
name: "fraud_detection_features" #Name of the featureset. The name will be used to identify the featureset in the Data Catalog.
#optional field
description: "Ethereum transaction features of the fraud detection dataset"

#required field of all the desired features with there respective name, description and source code of the feature
features:
- name: flag
description: "Identifies whether the transaction flagged as fraud or not"
source: flag.sql
- name: avg_time_sent_txn
description: "Average time between sent transactions for account in minutes"
source: avg_time_sent_txn.sql
- name: avg_time_rcv_txn
description: "Average time between received transactions for account in minutes"
source: avg_time_rcv_txn.sql
- name: time_diff
description: "Time difference between the first and last transaction"
source: time_diff.sql
- name: sent_txn
description: "Total number of sent normal transactions"
source: sent_txn.sql
- name: rcv_txn
description: "Total number of received normal transactions"
source: rcv_txn.sql
- name: no_created_contracts
description: "Total Number of created contract transactions"
source: no_created_contracts.sql
- name: max_val_rcv
description: "Maximum value in Ethereum ever received"
source: max_val_rcv.sql
- name: avg_val_rcv
description: "Average value in Ethereum ever received"
source: avg_val_rcv.sql
- name: avg_val_sent
description: "Average value of Ethererum ever sent"
source: avg_val_sent.sql
- name: total_eth_sent
description: "Total Ethererum sent for account address"
source: total_eth_sent.sql
- name: total_eth_balance
description: "Total Ethererum balance following enacted transactions"
source: total_eth_balance.sql
- name: ERC20_total_eth_rcv
description: "Total ERC20 token received transactions in Ethereum"
source: ERC20_total_eth_rcv.sql
- name: ERC20_total_eth_sent
description: "Total ERC20 token sent transactions in Ethereum"
source: ERC20_total_eth_sent.sql
- name: ERC20_total_eth_tfr
description: "Total ERC20 token transfer to other contracts in Ether"
source: ERC20_total_eth_tfr.sql
- name: ERC20_unique_sent_addr
description: "Number of ERC20 token transactions sent to Unique account addresses"
source: ERC20_unique_sent_addr.sql
- name: ERC20_unique_tok_rcv
description: "Number of Unique ERC20 tokens received"
source: ERC20_unique_tok_rcv.sql

materialization:
target: fraud-detection-dataset

Following featureset configuration is the proper definition of features in our Layer project. Features can be defined in two ways:

  • SQL features: You can use SQL queries to define the transformation on your dataset to extract features.
  • Python features: For advanced feature extraction, you can develop Python scripts with the help of Python libraries.

Here, we will use the SQL feature definition strategy to extract the following features from the dataset in the featureset directory:

├── ERC20_total_eth_rcv.sql

├── ERC20_total_eth_sent.sql

├── ERC20_total_eth_tfr.sql

├── ERC20_unique_sent_addr.sql

├── ERC20_unique_tok_rcv.sql

├── avg_time_rcv_txn.sql

├── avg_time_sent_txn.sql

├── avg_val_rcv.sql

├── avg_val_sent.sql

├── dataset.yaml

├── flag.sql

├── max_val_rcv.sql

├── no_created_contracts.sql

├── rcv_txn.sql

├── sent_txn.sql

├── time_diff.sql

├── total_eth_balance.sql

└── total_eth_sent.sql

Let’s take a look at one of the SQL files:

For `ERC20_total_eth_rcv.sql`:

SELECT INDEX,

_ERC20_total_Ether_received as ERC20_total_eth_rcv

FROM fraud_detection_data

The other feature definitions can be seen on my GitHub repo.

Model development in Layer

Machine learning models are first-class entities in Layer as they are integral and built within a Layer project. Various versions of models are stored in the Layer Model Catalog.

The directory structure for the model is shown below:

models

│ └── fraud_detection_model

│ ├── model.py

│ ├── model.yaml

│ └── requirements.txt

The `model.yaml` file is used for the model configuration. The content is shown below:

apiVersion: 1

# required.
name: fraud_detection_model
type: mode

# optional.
description: "XGBoost Modeling to detect fraudulent transactions"

# required. used to determine how to train this model
training:
name: "fraud_detection_model_training"
description: "Fraud Detection Model Training"

# The source model definition file
entrypoint: model.py

# File includes the required python libraries with their correct versions
environment: requirements.txt

The `model.py` is the model training Python script referred from the `model.yaml` file as the `entrypoint` key. The model training code outlined below:

# Importing libraries

from typing import Any

import pandas as pd # For data wrangling

from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import train_test_split # For splitting data

from sklearn.metrics import roc_auc_score # Performance metrics

from sklearn.ensemble import RandomForestClassifier # Machine learning model to be utilized

from layer import Featureset, Traindef train_model(train: Train, tf: Featureset("fraud_detection_features")) -> Any:

data_df = tf.to_pandas()

X = data_df.drop(["INDEX", "flag"], axis=1)

y = data_df["flag"]

# Split data and log parameters
random_state = 45
test_size = 0.2
# Saving parameters of your training runs that will then be viewable in the Layer Model Catalog UI.
train.log_parameter("random_state", random_state)
train.log_parameter("test_size", test_size)
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=test_size, random_state=random_state)

# Scaling the training features
sc = StandardScaler()
X_train_sc = sc.fit_transform(X_train)
X_train_sc = pd.DataFrame(X_train_sc, columns=X_train.columns)

# For better naming processing, let's rename the training sets
X_train = X_train_sc

# Define the model signature, which can then be used for determining the data lineage of this model.
train.register_input(X_train)
train.register_output(y_train)

# Instantiating machine learning model
rf_clf = RandomForestClassifier()
# Fitting the model to the data
rf_clf.fit(X_train, y_train)

# Transform test features
X_test_sc = sc.transform(X_test)
X_test = X_test_sc

# Making predictions
y_pred = rf_clf.predict(X_test)

# Track performance
score = roc_auc_score(y_test, y_pred)

# Save metrics of your training runs that will then be viewable in the Layer Model Catalog UI
train.log_metric("roc_auc_score", score)

# Return the model
return rf_clf

Next, the libraries used in the training script file have to be passed in the requirements.txt file so that Layer can install them to prepare the environment for the training job. In this project, the requirement file content is as follows:

scikit-learn>=0.24

pandas>=1.3.4

Let’s run our Layer project:

layer start

word image 21

The project execution is successful. Let’s inspect our Layer data and model catalog.

word image 22

From the data catalog UI, the data features and the data profile can be seen.

word image 23

From the model catalog UI, the performance, model attributes, and signature can be seen. Recall that in the training script, ROC AUC score was selected due to the class imbalance ratio.

We can assess the performance of the model by the area under the ROC curve (AUC).

As a rule of thumb:

  • 1–0.9 = Excellent
  • 0.9–0.8 = good
  • 0.8–0.7 = fair
  • 0.7–0.6 = poor
  • 0.6–0.5 = fail

Our model yields a score of approximately 0.93.

Deploy the anomaly detection model

Deployment of the model is required to enable predictions. Click on the `+ Deploy button` on the top right in the model catalog. It will deploy your model to run on the Layer cloud server. This step might take a minute or two. Once deployed, that button will turn green. Click the 🔗 to copy the deployment URL.

word image 24

Making predictions on the anomaly detection model

The model has now been deployed for real-time inference. Let’s test our model with a sample input (make sure to replace $MODEL_DEPLOYMENT_URL):

curl --header "Content-Type: application/json; format=pandas-records" \

--request POST \

--data '[{"Avg_min_between_sent_tnx":1,"Avg_min_between_received_tnx":5000,"Time_Diff_between_first_and_last__Mins_": 199949,"Sent_tnx": 0,"Received_Tnx": 40,"Number_of_Created_Contracts": 1,"max_value_received_": 1.10,"avg_val_received": 0.24,"avg_val_sent": 0.1,"total_Ether_sent": 0,"total_ether_balance":20.75,"_ERC20_total_Ether_received": 0.45,"_ERC20_total_ether_sent": 0,"_ERC20_total_Ether_sent_contract": 0,"_ERC20_uniq_sent_addr": 1,

"_ERC20_uniq_rec_token_name": 0

}]' \

$MODEL_DEPLOYMENT_URL

The API returns either a [1] or a [0], predicting whether a transaction is fraudulent or not.

Import the features on a Jupyter Notebook

We can also use the generated features, datasets, and models in a local notebook. Let’s now look at how you can re-use features after training a machine learning model with Layer:

word image 25

word image 26

word image 27

Conclusion

In this article, you have seen how to train and deploy machine learning models in Layer. Specifically, we have covered:

  • Setting up your Layer project.
  • Creating features in SQL.
  • Defining models in Layer.
  • Deploying and making inferences.
Nov 12th 2021
read

Share this post

Try Layer for free

Get started with Layers Beta

Start Free