Feature engineering guide

Machine learning is a branch of artificial intelligence that focuses primarily on building software that can learn automatically and improve from experience without explicit programming. For machine learning (ML) algorithms to begin learning, data must first be available as input since that is what the algorithm uses to generate the output. The input data consist […]
Oct 13th 2021

Share this post

Oct 13th 2021

Share this post

Feature engineering guide

Kurtis Pykes

Machine Learning Engineer

Machine learning is a branch of artificial intelligence that focuses primarily on building software that can learn automatically and improve from experience without explicit programming. For machine learning (ML) algorithms to begin learning, data must first be available as input since that is what the algorithm uses to generate the output.

The input data consist of features, which are characteristics that describe a phenomenon (i.e., the number of rooms is a characteristic of a house), typically represented in the form of structured columns. To produce models that generate accurate and reliable predictions, practitioners use Feature engineering to extract relevant features that can be used to train machine learning models.

Note: Features may be referenced by different names such as; Variables, attributes, predictors, etc.

What is feature engineering?

Feature engineering is the process of extracting features from raw data to find features that better represent the underlying problem so they may be used as inputs to a predictive model.

For example, at the beginning of any machine learning project, the raw data will inevitably be messy, hence unsuitable for training a model that will generalize well to new instances. The first step towards training an effective model involves exploring the data and cleaning it up (i.e., changing data types, removing values, imputing missing values, etc.). At the end of this procedure, a deeper understanding of the data would have been achieved; therefore, the Data Scientists would have insights into preparing data that can train a useful model.

Given this context, we can say that feature engineering poses the question, “how can we best represent the data to learn a solution to the problem”. It’s fair to say that machine learning problems are generally representation problems, and if you feed garbage into a model, you should expect to receive garbage out of the model. However, knowing what representation to use is impossible without conducting many experiments. Hence many regard feature engineering as an art.

Nevertheless, feature engineering aims to find the best representation to use to train a machine learning model. More informative features will allow the predictive model to learn a function that better maps the input data to the output, resulting in a predictive model that generalizes better on instances that it hadn’t seen during training.

What is feature learning?

Feature engineering typically consists of manually processing data to extract features. While this is one way to find representative data, another way to do this is by feature learning. Feature learning (a.k.a representation learning) can either be supervised or unsupervised. It encompasses several techniques that automatically discover the representations required for feature detection or classification from raw data.

Supervised feature learning

Supervised feature learning involves learning features from labeled data. Consequently, the true label will be used to compute an error term which can then be used as a measure to determine to which degree the system fails to predict the correct label. Feedback from this procedure is then leveraged to minimize the error during the learning process.

Unsupervised feature learning

Contrastingly, unsupervised feature learning involves learning features from unlabeled data. Unsupervised feature learning is generally performed to discover the low-dimensional features that find some form of structure within the high-dimensional input data. Features learned in an unsupervised manner can be used to improve the performance of a supervised task. This is known as semi-supervised learning.

The importance of feature engineering & selection

Feature engineering attempts to achieve two primary goals:

  • Creating a more representative data set to use as inputs for training a machine learning algorithm and
  • Improved model performance.

Another technique we have not quite touched upon as of yet is Feature Selection.

When working with data, some of the features available in the data set will be more influential in the model’s accuracy than others. We typically use feature selection to simplify a machine learning problem by reducing the dimensionality of the data. This is done by removing redundant or irrelevant features.

Reducing the dimensions of the data through feature selection simplifies the problem, which reduces the possibility of our model overfitting the data. It also means that we would require less memory and computational power to train our model, resulting in shorter training times.

Overall, effective use of feature engineering and feature selection allows us to:

  • Be more flexible with our model selection since our features are more representative.
  • Build simpler models, which means less compute times.
  • Achieve better results since our models will find it easier to unearth the inherent structure within the data when it is represented well.

How to do feature engineering

Understanding the process used when approaching machine learning problems helps illustrate how feature engineering is done in industry.

word image 289

Cross-Industry Standard Process for Data Mining (CRISP-DM) is one of the most recognized Data Science workflow frameworks.

The idea of “Data Preparation” includes transforming data from a raw state to a state that is:

  • consumable by a machine learning model and
  • representative of the problem that is to be solved.

Before we can truly know how to prepare our data, we have the “Data Understanding” phase. In the data understanding phase, we attempt to understand the data better using techniques such as data visualization and exploratory data analysis (EDA).

You may have noticed that arrows point from “Data Preparation” to “Modeling” and back. This is to illustrate the iterative nature of machine learning tasks. Once a model has been evaluated, practitioners are expected to go back and improve the features by tackling the scenarios in which the predictive model made the most errors. This is done in hopes of making the data more representative to improve the model’s accuracy.

Common feature engineering techniques in Python

Although each project would have different feature engineering requirements, some common techniques are used in the real world. These techniques include:

Handling missing values

The absence of values within a dataset is one of the most common problems in machine learning tasks. There are several reasons why values may be missing within a dataset. They can be classified into three main types:

  • Missing completely at Random (MCAR) – The missingness of the data is unrelated to the person (or thing) being studied (i.e. A questionnaire is lost in the post, or a blood sample is damaged in the laboratory)
  • Missing at Random (MAR) – The missingness of the data is related to the person (or thing) being studied, but the missing values can be inferred by considering the other information provided
  • Missing Not at Random (MNAR) – This missingness is related to the data that is missing. For instance, a person may decide to miss a drug if they know they’ve taken drugs.

Despite the causes, the absence of values can have adverse effects on a predictive model.

Note: I’ve used the Titanic Dataset for this example

The code below simply retrieves data from a local directory and prints out information about the schema.

import pandas as pd 

# retrieving the data 
df = pd.read_csv("../data/titanic_raw/train.csv")

# information on the columns 

>>>> <class 'pandas.core.frame.DataFrame'> 
RangeIndex: 891 entries, 0 to 890 
Data columns (total 12 columns): 
# Column Non-Null Count Dtype 
--- ------ -------------- ----- 
0 PassengerId 891 non-null int64 
1 Survived 891 non-null int64 
2 Pclass 891 non-null int64 
3 Name 891 non-null object 
4 Sex 891 non-null object 
5 Age 714 non-null float64 
6 SibSp 891 non-null int64 
7 Parch 891 non-null int64 
8 Ticket 891 non-null object 
9 Fare 891 non-null float64 
10 Cabin 204 non-null object 
11 Embarked 889 non-null object 
dtypes: float64(2), int64(5), object(5) 
memory usage: 83.7+ KB

One way to handle missing values is to delete the instances or features with missing values.

# create a copy of data 
df2 = df.copy()

# check the shape before 
print("Before removing rows with missing values") 

# delete missing rows 
print("After removing rows with missing values") 

Before removing rows with missing values 
(891, 12) 
After removing rows with missing values 
(183, 12)

The deletion method could result in the loss of important information (as seen above by the number of rows we lost), which will affect the performance of the predictive model. Another way would be to use imputation methods such as filling missing values with the mean.

# imputing all rows with missing age with the mean age 
df["Age"].fillna(df["Age"].mean(), inplace=True)

Categorical encoding

One-hot encoding involves converting categorical features into a format that a machine can process. When there is no ordinal relationship within the values of a feature (i.e., there is no order to “male” and “female”), practitioners typically use one-hot encoding to transform the feature.

Code Source: https://towardsdatascience.com/a-guide-to-encoding-text-in-python-ef783e50f09e 
import numpy as np 
import pandas as pd 
from sklearn.preprocessing import OneHotEncoder 
from tensorflow.keras.preprocessing.sequence import pad_sequences 

# define the documents 
corpus = ["i cant wait to get out of lockdown", 
"the uk is soon going to be free soon", 
"linkedin is social media"] 

# converting text to integers 
token_docs = [doc.split() for doc in corpus] 
all_tokens = set([word for sentence in token_docs for word in sentence]) 
word_to_idx = {token:idx+1 for idx, token in enumerate(all_tokens)} 

# converting the docs to their token ids 
token_ids = np.array([[word_to_idx[token] for token in token_doc] 
for token_doc in token_docs], dtype=object) 
token_ids_padded = pad_sequences(token_ids, padding="post") 
token_ids = token_ids.reshape(-1, 1) 

# convert the token ids to one hot representation 
one_hot = OneHotEncoder() 
X = one_hot.fit_transform(token_ids_padded) 

# converting to dataframe 
X_df = pd.DataFrame(X.toarray()) 

word image 290

Source: A Guide To Encoding Text in Python

Feature scaling (standardization and normalization)

Feature scaling techniques are used to change the scale of features to ensure all features within the data are on a similar scale. The performance of machine learning models that are sensitive to scale, such as linear models and neural networks, will be severely affected if the dataset features differ significantly in scale. There are two common ways practitioners perform feature scaling.

Note: We will use the same dataset for both examples.

import pandas as pd 
from sklearn.datasets import load_wine 
from sklearn.preprocessing import StandardScaler, MinMaxScaler 

# retrieving data 
wine_json= load_wine() 
df = pd.DataFrame(data=wine_json["data"], columns=wine_json["feature_names"]) 
df["Target"] = wine_json["target"]

Normalization (min-max scaling) which scales the features into a fixed range (typically between 0-1)

# before minmax scaling 
print(f"Original\n: {df.loc[:2, ['alcohol', 'malic_acid']]}\n") 

# minmax scaling 
minmax_scaler = MinMaxScaler().fit(df[["alcohol", "malic_acid"]]) 
df_minmax = pd.DataFrame( 
data=minmax_scaler.transform(df[["alcohol", "malic_acid"]]), 
columns=["alcohol", "malic_acid"] 

# after minmax scaling 
print(f"Scaled\n: {df_minmax.loc[:2, ['alcohol', 'malic_acid']]}")

word image 291

Standardization which scales the feature values so they fit the properties of a normal distribution, meaning the mean will be 0 and the standard deviation will be 1

# before standardization 
print(f"Original\n: {df.loc[:2, ['alcohol', 'malic_acid']]}\n") 

# standardization scaling 
std_scaler = StandardScaler().fit(df[["alcohol", "malic_acid"]]) 
df_std = pd.DataFrame( 
data=std_scaler.transform(df[["alcohol", "malic_acid"]]), 
columns=["alcohol", "malic_acid"] 

# after standardization 
print(f"Scaled\n: {df_std.loc[:2, ['alcohol', 'malic_acid']]}")

word image 292

Power transforms

Power transformation techniques are effective for transforming numerical features to have a more Gaussian-like probability distribution. According to Wikipedia, we may describe power transformation techniques as “techniques used to stabilize variance, make data more normal distribution-like, improve the validity of measure of association such as the Pearson correlation between the features and for other stabilization procedures” [Source: Wikipedia]. An example of a power transform is the Log transformation.

Code Source: https://towardsdatascience.com/feature-engineering-for-numerical-data-e20167ec18 
import numpy as np 
import pandas as pd 
from scipy import stats 
import plotly.graph_objects as go 
from plotly.subplots import make_subplots 

df = pd.read_csv("../data/raw/train.csv") 

# applying various transformations 
x_log = np.log(df["GrLivArea"].copy()) # log 
x_square_root = np.sqrt(df["GrLivArea"].copy()) # square root 
x_boxcox, _ = stats.boxcox(df["GrLivArea"].copy()) # boxcox 
x = df["GrLivArea"].copy() # original data 

# creating the figures 
fig = make_subplots(rows=2, cols=2, 
subplot_titles=("Original Data", 
"Log Transformation", 
"Square root transformation", 
"Boxcox Transformation") 
# drawing the plots 
rows=[1, 1, 2, 2], 
cols=[1, 2, 1, 2] 
text="GrLivArea with various Power Transforms", 
fig.show() # display figure

word image 293

Source: Feature Engineering for Numerical Data

Feature grouping

We refer to data as “tidy” when each feature is separated into its own column, each instance has its own row, and each type of observational unit is a table. It’s not uncommon to have instances spread over several rows; hence a technique called feature grouping is used to connect the rows into a single one. The tricky part is deciding what aggregation to group the features by.

Code Soure: https://rubikscode.net/2020/11/15/top-9-feature-engineering-techniques/ 
# group data by a feature 
grouped_data = data.groupby('species') 

# calculate sum and mean 
sums_data = grouped_data['culmen_length_mm', 'culmen_depth_mm'].sum().add_suffix('_sum') 
avgs_data = grouped_data['culmen_length_mm', 'culmen_depth_mm'].mean().add_suffix('_mean') 

# add mean and sum calculations to dataframe 
sumed_averaged = pd.concat([sums_data, avgs_data], axis=1) 

word image 294

Source: Top 9 Feature Engineering Techniques with Python


Quantization is a technique used to contain the scale of data by grouping different values into bins. Another way to think of quantization is as a mapping technique in which a continuous value is mapped to a discrete value, i.e., an ordered sequence of bins.

import numpy as np

#15 random integers from the "discrete uniform" distribution 
ages = np.random.randint(0, 100, 15)

#evenly spaced bins 
ages_binned = np.floor_divide(ages, 10)

print(f"Ages: {ages} \nAges Binned: {ages_binned}")

>>>> Ages: [97 56 43 73 89 68 67 15 18 36 4 97 72 20 35] 
Ages Binned: [9 5 4 7 8 6 6 1 1 3 0 9 7 2 3]

Feature engineering in computer vision

Computer vision (CV) is an interdisciplinary field that provides computers a high-level understanding of how we see things such as images and videos to automate tasks that the human visual system can perform. In a CV context, features can be considered parts of an object within an image that helps the machine identify what the image contains. For instance, a triangle has three edges; hence we can call these edges the features as they help us to identify it.

The introduction of deep learning has changed how we perform feature engineering in CV (see Feature Engineering in Deep Learning section of this article for more). Nonetheless, some traditional techniques to detect features in CV include:

Harris Corner Detection

Chris Harris and Mike Stephens first introduced Harris Corner detection in 1988 as a tool to extract corners and infer features of an image.

Code Source: https://docs.opencv.org/3.4/dc/d0d/tutorial_py_features_harris.html 
import numpy as np 
import cv2 as cv 

filename = 'chessboard.png' 

img = cv.imread(filename) 
gray = cv.cvtColor(img,cv.COLOR_BGR2GRAY) 

gray = np.float32(gray) 
dst = cv.cornerHarris(gray,2,3,0.04) 

#result is dilated for marking the corners, not important 
dst = cv.dilate(dst,None) 

# Threshold for an optimal value, it may vary depending on the image. 

if cv.waitKey(0) & 0xff == 27: 

word image 295

Source: Harris Corner Detection

Shi-Tomasi Corner Detector

Shi-Tomasi is also a corner detector algorithm. It was introduced in 1994 and added a slight modification to how corners were detected, which showed better results compared to Harris Corner Detector – the interested reader may read Good Features to Track.

Code Source: https://docs.opencv.org/master/d4/d8c/tutorial_py_shi_tomasi.html 

import numpy as np 
import cv2 as cv 
from matplotlib import pyplot as plt 

img = cv.imread('blox.jpg') 
gray = cv.cvtColor(img,cv.COLOR_BGR2GRAY) 

corners = cv.goodFeaturesToTrack(gray,25,0.01,10) 
corners = np.int0(corners) 

for i in corners: 
x,y = i.ravel() 

plt.imshow(img), plt.show()

word image 296

Source: Shi-Tomasi Corner Detection

Feature engineering in NLP

Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with computers and natural language interactions. NLP engineers employ various techniques to analyze, model, and ultimately allow computers to understand human language.

Similar to CV, the way feature engineering in NLP has changed since the introduction of deep learning. However some common features that can be extracted from text include:

The number of unique words
def unique_words_count(text): 
return len(set(text.split()))

The number of capital characters

def count_capital_chars(text): 
for i in text: 
if i.isupper(): 
return count

The number of punctuation marks used

import string 

def punctuation_counts(text): 
for i in string.punctuation: 
d[str(i)+'_count']= text.count(i) 
return d

The number of stopwords

import nltk 
from nltk.corpus import stopwords 

def count_stopwords(text): 
stop_words = set(stopwords.words('english')) 
tokens = word_tokenize(text) 
list_stopwords = [w for w in tokens if w in stop_words] 
return len(list_stopwords)
Bag Of Words (BoW)

BoW represents a text as the bag of its words without regard for grammar or word order.

Code Source: https://towardsdatascience.com/a-guide-to-encoding-text-in-python-ef783e50f09e 
import pandas as pd 
from sklearn.feature_extraction.text import CountVectorizer

# define the documents 
corpus = ["i cant wait to get out of lockdown", "the uk is soon going to be free soon", "linkedin is social media"]

# implementing BoW 
bow = CountVectorizer() 
X = bow.transform(corpus)

# converting to dataframe 
X_df = pd.DataFrame(X.toarray(), columns=sorted(bow.vocabulary_)) 

word image 297

Source: A Guide To Encoding Text in Python

Term Frequency – Inverse Document Frequency (TF-IDF)

The English language consists of words that linguists refer to as stopwords that tend to dominate most text (i.e., the, a, etc.). Since such words rarely add value to a document, TF-IDF penalizes the words that occur most in all documents.

Code Source: https://towardsdatascience.com/a-guide-to-encoding-text-in-python-ef783e50f09e 
import pandas as pd 
from sklearn.feature_extraction.text import TfidfVectorizer 
# define the documents 
corpus = ["i cant wait to get out of lockdown", "the uk is soon going to be free soon", "linkedin is social media"] 
# implement tfidf 
tfidf = TfidfVectorizer() 
X = tfidf.transform(corpus) 
# convert to dataframe 
X_df = pd.DataFrame(X.toarray(), columns=sorted(tfidf.vocabulary_)) 

word image 298

Source: A Guide to Encoding Text in Python

Word embeddings

Word embeddings is a term used to describe how features are extracted from text for text analysis. These features are generally in the form of real-valued vectors, which encode the meaning of a word such that the words with similar meanings are closer together in vector space.

Code Source: https://towardsdatascience.com/a-guide-to-encoding-text-in-python-ef783e50f09e 
from gensim.models import Word2Vec

# define the documents 
corpus = ["i cant wait to get out of lockdown", "the uk is soon going to be free soon", "linkedin is social media"]

# Word2Vec model 
w2v = Word2Vec(min_count=1, 

# building the vocab 

# training the model 
w2v.train(sentences=corpus, total_examples=w2v.corpus_count, epochs=10)

Automated feature engineering

Feature engineering is an art that depends on intuition, data manipulation skills, and domain knowledge. The process of engineering features can be extremely tedious to practitioners. The worst thing about it is that the final features are limited with respect to time and the subjectivity of human beings.

Automated feature engineering aims to alleviate some of the barriers encountered when conducting feature engineering manually. By automatically creating many candidate features from a dataset, Data Scientists can be more productive with other tasks since they do not have to spend the brain juice or time developing features.

Some useful frameworks for automated feature engineering include:


Featuretools is an open-source Python framework for performing automated feature engineering. It’s one of the most popular frameworks designed to speed up the feature generation process.


AutoFeat is an open-source Python framework that permits users to perform automated feature engineering and feature selection. The framework also has built-in linear models such as `AutoFeatRegressor` and `AutoFeatClassifier`. These models have a similar interface to that of Scikit-Learn.


TSFresh performs automated feature engineering and feature selection, specifically for time-series data. The framework contains over 60 different feature extractors such as Standard Deviation, Global maximum, Fast Fourier transform, and many more.

Feature engineering in Deep Learning

The feature engineering approach was the dominant approach until recently when deep learning techniques started demonstrating recognition performance better than the carefully crafted feature detectors. Deep learning shifts the burden of feature design also to the underlying learning system along with classification learning typical of earlier multiple layer neural network learning. From this perspective, a deep learning system is a fully trainable system beginning from raw input, for example, image pixels, to the final output of recognized objects” [Source: Deep Learning and Feature Engineering].

The description of deep learning promised to make obsolete the days of manually engineering features with claims that the models are powerful enough to determine the features for themselves. Although true to some extent, these claims do not paint the full picture.

Deep learning models can indeed learn features, but this hasn’t completely removed the need for feature engineering. A better way to describe the transformation of feature engineering with the introduction of deep learning is that the process has been simplified in many cases. However, this is not without respite:

  • Architectures of the models have become more complex to make up for the lack of representativeness within data.
  • Model architectures are usually specific to a given task in the same way feature engineering may be described.

Stephan Merity was once quoted saying, “Any time a human being forces an architectural decision that couldn’t be learned, we’re essentially hard coding a feature” [Source: Smerity.com]

The Convolution Neural Network (CNN) is a popular deep learning architecture that is generally used for computer vision tasks (although it has been used in other domains such as voice recognition in Natural Language Processing) . This architecture is a prime example of the full picture not being shown when people mention how deep learning has eradicated feature engineering. Indeed, edge detection is not manually coded by a human when a CNN architecture is used. However, humans are the ones that decide the size of a convolution since the model has no way of determining whether or not a convolution must be larger or smaller.

Feature engineering best practices

As previously stated, at the beginning of a project, there is no way of knowing what solution would work best for a problem. Vigorous experiments ought to be carried out to better understand what may work and what won’t. However, as time has passed, practitioners have developed some best practices based on what has worked in most cases.

Start simple & generate lots of features

Practitioners should attempt to generate as many simple features as possible once the modeling phase begins. In this context, simplicity refers to the ease of coding the feature generation technique. For instance, practitioners should decide to start with a bag-of-words (BoW) model, which generates thousands of features with minimal code, over a Word2Vec when working with text. Since there is no clear way to identify what features will be the most influential for a problem, starting with anything measurable will clarify what may be a good direction to move forward in.

Reduce the cardinality of features when possible

Consider the scenario where we have a categorical feature that contains many unique values (more than 12). This feature should only be used if we want to alter our model’s behavior based on that feature. For example, if a feature records what state a person resides in from the USA, the feature would have 50 unique states as part of its values. By using this feature as an input to the model, the model’s behavior will vary depending on what state an instance resides. If this functionality is not desired, then reducing cardinality is a better option.

Do feature selection when necessary

Feature selection should be done when it’s absolutely necessary. Here are some reasons to justify performing feature selection:

  • The model needs to be explainable
  • There’s a constraint on hardware requirements
  • There’s limited time to conduct numerous experiments and or to rebuild the model for a production environment
  • There’s an expected distribution shift between multiple model training rounds.

Conduct error analysis

Error analysis is a term used to describe the process of analyzing misclassified instances made by a predictive model. It’s a manual process in which practitioners scan a set of misclassified samples to detect patterns of where the predictive model went wrong. This can provide immense insights into how inputs can be better engineered to provide a predictive model with more representative input data.

Be cautious when using counts

When counts remain roughly within the same bounds as time goes on, then it’s fine – such is the case for the Bag-of-Words (BoW) features, given the document length does not grow or shrink as time goes on. Where counts can become an issue is when this is not the case. For example, suppose a telephone network dataset records the number of calls made by users since their contract began. In that case, long-term customers would likely have made significantly more calls than a user that signed up a week or so ago. However, over time, these numbers may become more frequent; hence constant reevaluation is crucial.

Feature engineering with Layer

Layer is a Declarative MLOps (DM) platform, meaning you can define what you’d like to accomplish rather than describe how to achieve it – you can focus on what is most important to you. Essentially, the platform’s purpose is to aid teams in producing scalable machine learning applications based on code.

For Feature engineering, Layer provides Featuresets (Layer defines Featuresets as first-class entities), a group of calculated features that provide a high-level interface to access individual features. They differ from static datasets or an ordinary database table because they provide the capability to time-travel to get point-in-time values of its underlying features.

Ultimately, there are two ways that features can be defined in Layer:

  • SQL Features: SQL queries can be used to define the transformations on your dataset to extract features.
  • Python Features: Some machine learning projects may require more advanced feature extraction techniques. Python has extensive tooling for that.
  • Spark features: Layer also allows you to create features using Spark. This is very handy when working with big data.

Final thoughts

Feature engineering describes the process of transforming a machine learning model’s input features to make them more representative, as this will make it easier for the model to identify the underlying pattern within data. However, since this process requires skill, intuition, domain expertise, and plenty of time, researchers have been looking for ways to try and automate this process.

Hence, the introduction of automated feature engineering. Automated feature engineering allows data scientists to be more productive with other tasks. Another way researchers have attempted to eradicate the need for feature engineering is by leveraging deep learning. Technically, feature engineering has not been made entirely obsolete by introducing deep learning, but it has most definitely been simplified.

Oct 13th 2021

Share this post

Try Layer for free

Get started with Layers Beta

Start Free