Creating SQL features with Layer

Layer enables you to extract features from your dataset. These features are then fed into machine learning model. There are two ways of extracting features in Layer: SQL Python In this article, we will focus on using SQL. We will look at: Creating a Layer project Connecting your data to Layer Defining your dataset Developing […]
Oct 13th 2021
read

Share this post

Oct 13th 2021
read

Share this post

Creating SQL features with Layer

Eric Gitonga

| Data | Watercolour | Photography |

Layer enables you to extract features from your dataset. These features are then fed into machine learning model. There are two ways of extracting features in Layer:

  1. SQL
  2. Python

In this article, we will focus on using SQL.

We will look at:

  • Creating a Layer project
  • Connecting your data to Layer
  • Defining your dataset
  • Developing your features
  • Defining and training a model
  • Passing features and models source code
  • Using generated entities in a Jupyter notebook

Before proceeding to these steps, first install Layer:

pip install layer-sdk

Then log into the system:

layer login

This will bring you to https://beta.layer.co, where you can sign in.

Now you can run the various Layer commands.

Creating a Layer Project

The first step you need to do is set up the project directory. Clone the project from https://github.com/layerml/empty then rename the `empty` directory to `fraud-detection`.

layer clone https://github.com/layerml/empty 
mv empty fraud-detection 
cd fraud-detection

That directory has the following file structure:

. 
├── .layer 
│ └── project.yml # Main configuration of Layer Project 
├── data 
│ ├── dataset 
│ │ ├── dataset.yaml # Declares where the source data is 
│ └── features 
│ └── dataset.yaml # Definition of the various features 
├── models 
│ └── model 
│ ├── model.py # Definition of our model 
│ ├── model.yaml # Training directives of our model 
│ └── requirements.txt # Environment config file 
└── README.md # Optional markdown file describing the project

Configuring your project

To configure your Layer project, go into the `.layer` directory and edit the `project.yaml` configuration file to have the information shown in the code below.

# Fraud Detection Project Example 
# Project configuration file 

apiVersion: 1 

This file has the name of the project. In our case, the name is `Fraud Detection Project`.

Defining your Dataset

With the project configuration done, we can now go ahead and define our datasets. This is a two stage process.

  1. Setting up a YAML file to integrate source data
  2. Setting up a YAML file to list the various features

Setting Up a YAML File to Integrate Source Data

Navigate to the `data` directory and rename the `dataset` directory to ‘transactions` .

cd data

mv dataset transactions

cd transactions

Edit the `dataset.yaml` file to resemble what is shown in the code below:

# Fraud Detection Project Example 
  # In this file, we will integrate a source table from the Layer Demo

# Snowflake database. This source table is the Synthetic Financial

# Datasets For Fraud Detection from Kaggle

# See here: https://www.kaggle.com/ntnu-testimon/paysim1

apiVersion: 1

# Unique name of this dataset which will be used in this project to

# refer to this dataset

name: "transactions"

type: source

materialization:

 target: layer-public-datasets

 table_name: "transactions_log"

Under the `materializations` key and `target` sub-key the value `layer-public-datasets` is used to reference the name of our Snowflake data source integration.

word image 32

The `type` key indicates that this file is configuring a data `source`.

The `name` key with the value `transactions` indicates what name to use when referring to the data source during feature selection.

Setting Up a YAML File to List the Various Features

Navigate back to the `data` directory, rename the `features` directory and go into it.

cd ../ 
mv features transaction_features 
cd transaction_features

From here, edit the `dataset.yaml` file to have the information in the code block below:

The `type` key indicates that this file is configuring the `featureset` to be used in the model.

# Fraud Detection Project Example 
  # Any directory that includes a `dataset.yml` will be treated as a dataset

# project. In this `yaml` file, we define the attributes of our

# featureset.

apiVersion: 1

type: featureset

name: "transaction_features"
description: "Transaction features extracted from transaction logs"

features:
- name: error_balance_dest
description: "Errors in the destination accounts"
source: error_balance_dest.sql
- name: error_balance_orig
description: "Errors in the originating accounts"
source: error_balance_orig.sql
- name: is_fraud
description: "Identifies whether the transaction flagged as fraud or not"
source: is_fraud.sql
- name: new_balance_dest
description: "Destination balance status after transaction"
source: new_balance_dest.sql
- name: new_balance_orig
description: "Originating balance status after transaction"
source: new_balance_orig.sql
- name: old_balance_dest
description: "Destination balance status before transaction"
source: old_balance_dest.sql
- name: old_balance_orig
description: "Originating balance status before transaction"
source: old_balance_orig.sql
- name: type
description: "Type of the transaction (TRANSFER=0, CASH_OUT=1, CASH_IN=2, DEBIT=3, PAYMENT=4)"
source: type.sql

materialization:
target: layer-public-datasets

We give the `featureset` a name, in this case, `transaction-features`. You can give an optional description.

Next up is the `features` key, under which we list all the features to be used in the model. This has sub-keys for each feature defined as follows:

  • `name`: gives the name of the feature
  • `description`: gives a description of the feature
  • `source`: indicates what the SQL file for this feature is.
  • `sql` : This has an actual SQL query as its value. It is ideal for simple queries that do not need their own file.

For the `source` and `sql` sub-keys, you can only use one of each, but not both.

Next is the `schema` key which has a sub-key, `primary_keys`. This needs to be included in all the defined features as it is used to join them. In this example, the value of this key is `transactionID`.

Finally there is the `materializations` key with the `target` sub-key with the value `layer`. This indicates where the features will be materialized.

By materialization, we mean the writing of the features to a data source. In this case, `layer` is the name of the data source to which we refer. But this is just an indirection to the data source linked to through the `project.yaml` file. In this case, we write to a Snowflake data source. You can also write to a BigQuery data source.

Having connected to the data source and defined the dataset, we can now go ahead and work on the features.

Develop Features

For this article, we shall use SQL as specified by ANSI.

Set Up Each SQL File

The next step is to set up each feature that will be used in the model. For the fraud detection example, there are seven features that we’ll use to predict whether a particular transaction is fraudulent or not. The target variable is the column that flags whether a transaction is fraudulent or not.

Five of these features already exist in the data table. We’ll have to generate the remaining two features out of the five existing features, and one other column.

Of the five features, one feature is categorical, and so we have to encode that into numerical values, since the model to be used only accepts numerical values.

For the four numerical features, we’ll modify them slightly to make the data more meaningful for the model.

word image 33

From the table above, the five features we shall use are:

  • NEWBALANCEDEST
  • NEWBALANCEORIG
  • OLDBALANCEDEST
  • OLDBALANCEORG
  • TYPE

The `ISFRAUD` column is what shall be used as the target variable.

To modify the four numerical features and generate the two additional features, we’ll use the `AMOUNT` column.

In all the files, the two main SQL commands used are `SELECT` and `FROM`. The keyword `AS` is also used.

  • `SELECT` is used to identify the data to be obtained from the datasource.
  • `FROM` is used to identify which datasource to obtain the data from.
  • `AS` is used to indicate the name of the new column into which the obtained data will be placed.

Let us now set up the SQL files for each of the five features already in place as well as the target variable. These files should be set up in the same directory as the `dataset.yml` file we walked through in the previous section, in this instance, `data/transaction_features/`.

ISFRAUD

For the target variable, we can just have the SQL file return the values in the data source as is, since it is already in an appropriate format. An entry of `1` signifies a fraudulent transaction. An entry of `0` signifies a regular transaction.

The SQL code to be executed is:

SELECT transactionId,

isFraud AS is_fraud

FROM transactions

This will select the “transactionID` and the fraud value of each row from the transactions table and return a new table with the two columns, `transactionID` and `is_fraud`. The values in those two columns will contain the values drawn from the transactions table.

Create a new SQL file with the code above and save it as is_fraud.sql.

Each of the SQL files we generate needs to have the `transactionID` as part of the `SELECT` statement.

Existing Numerical Features

The next set of features to set up will be the four numerical features we identified before. The structure for these is similar, so we’ll list the code for all four files then describe in general what it does.

SELECT transactionId,

IF(oldbalanceDest == 0 AND newbalanceDest == 0 AND amount != 0, -1,

newbalanceDest) AS new_balance_dest

FROM transactions

SELECT transactionId,
IF(oldbalanceOrg == 0 AND newbalanceOrig == 0 AND amount != 0, -1,
newbalanceOrig) AS new_balance_orig
FROM transactions

SELECT transactionId,
IF(oldbalanceDest == 0 AND newbalanceDest == 0 AND amount != 0, -1,
oldbalanceDest) AS old_balance_dest
FROM transactions

SELECT transactionId,
IF(oldbalanceOrg == 0 AND newbalanceOrig == 0 AND amount != 0, -1,
oldbalanceOrg) AS old_balance_orig
FROM transactions

In each of these four cases, we select the `transactionID` as usual and the specific column. But instead of recording the value in that column direct, we do a slight modification to it if the value is zero.

Taking the example of `NEWBALANCEDEST`. That second line:

IF(oldbalanceDest == 0 AND newbalanceDest == 0 AND amount != 0, -1,
newbalanceDest) AS new_balance_dest

runs a test to determine what value to assign to that column (`-1` or the actual value, in this case, `newbalanceDest`). The test is premised on the fact that if the destination has a value of zero, and the origin also has a value of zero, yet the amount shows a value greater than zero, then chances are the transaction is fraudulent. It might not necessarily be so, given other factors, and it might end up that when other features come into play, that transaction will be marked as genuine.

To make it more useful for a machine-learning algorithm to detect fraud, we replace the `0` with `-1`.

If any of those tests fail, then that entry will be assigned the actual value obtained from the transactions table.

For each of those four features, save the code into the following files:

new_balance_dest.sql

new_balance_orig.sql

old_balance_dest.sql

old_balance_orig.sql

Generated Numerical Features

Next, we generate two additional numerical features from existing columns. The code below does this:

SELECT transactionId,

oldbalanceDest + amount - newbalanceDest AS errorBalanceDest

FROM transactions

SELECT transactionId,
newbalanceOrig + amount - oldbalanceOrg AS errorBalanceOrig
FROM transactions

Each of these generate a value that is recorded as an error value for the destination and originating accounts per transaction respectively. This is done to provide additional features to the machine learning algorithm to determine whether a transaction is genuine or not.

In this example, we see that SQL commands can be expressions. Here we have arithmetic operations of addition and subtraction conducted on values from different columns to obtain values in a new column.

For each of those two generated columns, save the code into the following files:

error_balance_dest.sql

error_balance_orig.sql

TYPE

Finally we have the `TYPE` column. This is a categorical column with five different categories. We encode these categories into numerical values using the code below:

SELECT transactionId,

CASE

WHEN type = "TRANSFER" THEN 0

WHEN type = "CASH_OUT" THEN 1

WHEN type = "CASH_IN" THEN 2

WHEN type = "DEBIT" THEN 3

WHEN type = "PAYMENT" THEN 4

ELSE -1

END

AS type

FROM transactions

The `CASE WHEN THEN` tests for a certain condition, which if met, then performs a specific operation. In our example, taking from the first condition tested, it checks to see if type is equal to `TRANSFER`, and if so, it assigns the value `0` to that entry in the table. If not, it moves on to the next test.

If none of the conditions are satisfied, then the `ELSE` statement will be executed and the value `-1` assigned to that specific entry in the table. If there is no `ELSE` part, and all conditions tested are false, then that entry will be set to `NULL`.

`END` terminates that code block.

For each row, once a particular condition is true, the rest of the conditions following are ignored and the next row is then tested.

Save that code into the file `type.sql`.

This marks the end of setting up the features from the source data using SQL files. The following screenshot shows the table generated after executing the code in these SQL files.

word image 34

Defining and Training a Model

The next step is to define the model and then train it. Navigate to the models folder, then rename the folder in it to a more appropriate name.

cd ../../models

mv model fraud_detection_model

cd fraud_detection_model

In the `fraud_detection_model` directory, edit the `model.yml` file. This is the file that defines the model.

# Fraud Detection Project Example 
  # Any directory includes an `model.yml` will be treated as a ml model project.

# In this `yaml` file, we will define the attributes of our model.

apiVersion: 1

# Name and description of our model
name: "fraud_detection_model"
description: "Detects the fraud transactions"

training:
name: "fraud_detection_model_training"
description: "Fraud Detection Model Training"

# The source model definition file with a `train_model` method
entrypoint: model.py

# File includes the required python libraries with their correct versions
environment: requirements.txt

The `name` key is required, as this is what is used to refer to the model.

The `description` key is optional. The description will appear on the Layer Model Catalog.

The `training` key is required. This describes how the model should be trained. Within this key, there are the following sub-keys:

  • `name`
  • `description`
  • `entrypoint` – This defines the script that executes the model training. For this instance, it points to `model.py`.
  • `environment` – This points to the `requirements.txt` file which lists the required libraries for this model training.

The next file is the model.py file which contains the actual training script. Here is where you develop, train and evaluate your model.

"""Fraud Detection Project Example

This file demonstrates how we can develop and train our model by using the

`transaction_features` we've developed earlier. Every ML model project

should have a definition file like this one.

"""

from typing import Any

from sklearn.model_selection import train_test_split

from sklearn.metrics import average_precision_score

import xgboost as xgb

from layer import Featureset, Train

def train_model(train: Train, tf: Featureset("transaction_features")) -> Any:
"""Model train function

Returns:
model: A trained model object
"""

# We create the training and label data
train_df = tf.to_pandas()
X = train_df.drop(["transactionId", "is_fraud"], axis=1)
Y = train_df["is_fraud"]

random_state = 13
test_size = 0.2
train.log_parameter("random_state", random_state)
train.log_parameter("test_size", test_size)
trainX, testX, trainY, testY = train_test_split(X, Y, test_size=test_size,
random_state=random_state)

# Here we register input & output of the train. Layer will use
# this registers to extract the signature of the model and calculate
# the drift
train.register_input(trainX)
train.register_output(trainY)

max_depth = 3
objective = 'binary:logistic'
train.log_parameter("max_depth", max_depth)
train.log_parameter("objective", objective)

# Train model
param = {'max_depth': max_depth, 'objective': objective}
dtrain = xgb.DMatrix(trainX, label=trainY)
model_xg = xgb.train(param, dtrain)

dtest = xgb.DMatrix(testX)
preds = model_xg.predict(dtest)

# Model Evaluation
auprc = average_precision_score(testY, preds)
train.log_metric("auprc", auprc)

# We return the model
return model_xg

This Python code uses the Layer SDK to define a `train_model` function that takes a `train` argument as well as a series of `featureset` arguments The `featureset` used in the argument is what we defined using the SQL files.

The last file of interest is the `requirements.txt` file which lists the required packages and their versions.

scikit-learn>=0.18

xgboost>=1.2.0

Everything is now in place to generate the required entities.

Passing features and models source code

To build the entities, first log into the platform, then execute the `run` command.

layer login

layer start

Doing this sets off the generation of the entities.

word image 3

Once completed, go to the Layer dashboard to deploy the model.

word image 4

You can now copy the deployed API’s URL to use in your ML application. Use `curl` to access the API, feed in the values for the various features, and let it make the prediction for you.

curl --header "Content-Type: application/json; format=pandas-records" \

--request POST \

--data '[{"errorBalanceDest":181, "errorBalanceOrig":0, "new_balance_dest":-1, "new_balance_orig":0, \

"old_balance_dest":1, "old_balance_orig":0, "t_type":0}]' \

$MODEL_DEPLOYMENT_URL

Running this returns a number between `0` and `1` which is the probability that a transaction is fraudulent. You can set a threshold that divides the two. For this instance, we can set the threshold to `0.5`, so that values greater than that indicate a fraudulent transaction, and those equal or lower indicate a genuine transaction.

The values in the code example above return a value of 0.96, thus this specific transaction is fraudulent.

You can also explore the model as well as the Data Catalog to see what they contain.

word image 5

Using Generated Entities in a Jupyter Notebook

The various entities that we have generated can all be viewed in a Jupyter notebook.

To do this, first import `layer`, then log in to the platform

import layer

layer.login()

You can now load the raw dataset and examine it.

df = layer.get_dataset("transactions")

df = df.to_pandas()

df.head()

word image 35

The various entities are accessed as follows:

Featureset

layer.get_featureset("transactions")
word image 36

Features

layer.get_features([“transaction_features”])

word image 37

When loading features, make sure to pass in the name in a list as shown in the code snippet above. Not doing so will result in an error.

Model

layer.get_model("fraud_detection_model")
word image 38

API Implementation

To finish, we’ll look at an implementation of the API which returns whether a transaction is fraudulent or genuine. This has had the values hard-coded for demonstration purposes, but a production system would have the values fed in as input dynamically and the prediction produced as needed.

import requests

from requests.structures import CaseInsensitiveDict

import json
import re

def get_transaction_details(errorBalanceDest,
errorBalanceOrig,
new_balance_dest,
new_balance_orig,
old_balance_dest,
old_balance_orig,
t_type):
"""
This function sets up the dictionary of features as
keys and their values.
"""
data_dict = {
"errorBalanceDest": errorBalanceDest,
"errorBalanceOrig": errorBalanceOrig,
"new_balance_dest": new_balance_dest,
"new_balance_orig": new_balance_orig,
"old_balance_dest": old_balance_dest,
"old_balance_orig": old_balance_orig,
"type": t_type
}
return data_dict

url = "https://c7d0aa0f-5e83-4140-958a-7f895286708a.inferences.beta.layer.co/invocations"

headers = CaseInsensitiveDict()

headers["Content-Type"] = "application/json; format=pandas-records"

data = json.dumps([get_transaction_details(8, 0, -1, 0, -1, 0, 0)])

resp = requests.post(url, headers=headers, data=data)

NUMBER_REGEX = re.compile("(?!\])-*[0-9]*\.[0-9]*")
response = float(re.findall(NUMBER_REGEX, resp.text[1:])[0])
if response > 0.5:
print("Fraudulent Transaction", response)
else:
print("Genuine Transaction", response)

Conclusion

In this article, we explored the use of SQL in developing features to be fed into a machine learning model. We worked through:

  • setting up a Layer project,
  • defining the data sources,
  • defining and developing required features using SQL,
  • developing and deploying the model, and
  • displaying generated entities in a Jupyter notebook.

We finished off by using the generated model to make a prediction.

Now you should be able to set up your own machine learning project to do the same.

Oct 13th 2021
read

Share this post

Try Layer for free

Get started with Layers Beta

Start Free