Machine learning predictive modeling performance is only as good as your data, and your data is only as good as the way you prepare it for modeling.
The most common approach to data preparation is to study a dataset and review the expectations of a machine learning algorithm, then carefully choose the most appropriate data preparation techniques to transform the raw data to best meet the expectations of the algorithm. This is slow, expensive, and requires a vast amount of expertise.
An alternative approach to data preparation is to apply a suite of common and commonly useful data preparation techniques to the raw data in parallel and combine the results of all of the transforms together into a single large dataset from which a model can be fit and evaluated.
This is an alternative philosophy for data preparation that treats data transforms as an approach to extract salient features from raw data to expose the structure of the problem to the learning algorithms. It requires learning algorithms that are scalable of weight input features and using those input features that are most relevant to the target that is being predicted.
This approach requires less expertise, is computationally effective compared to a full grid search of data preparation methods, and can aid in the discovery of unintuitive data preparation solutions that achieve good or best performance for a given predictive modeling problem.
In this tutorial, you will discover how to use feature extraction for data preparation with tabular data.
After completing this tutorial, you will know:
Feature extraction provides an alternate approach to data preparation for tabular data, where all data transforms are applied in parallel to raw input data and combined together to create one large dataset.
How to use the feature extraction method for data preparation to improve model performance over a baseline for a standard classification dataset.
How to add feature selection to the feature extraction modeling pipeline to give a further lift in modeling performance on a standard dataset.
Discover data cleaning, feature selection, data transforms, dimensionality reduction and much more in my new book, with 30 step-by-step tutorials and full Python source code.
Let’s get started.
How to Use Feature Extraction on Tabular Data for Data Preparation
How to Use Feature Extraction on Tabular Data for Data Preparation
Photo by Nicolas Valdes, some rights reserved.
Tutorial Overview
This tutorial is divided into three parts; they are:
Feature Extraction Technique for Data Preparation
Dataset and Performance Baseline
Wine Classification Dataset
Baseline Model Performance
Feature Extraction Approach to Data Preparation
Feature Extraction Technique for Data Preparation
Data preparation can be challenging.
The approach that is most often prescribed and followed is to analyze the dataset, review the requirements of the algorithms, and transform the raw data to best meet the expectations of the algorithms.
This can be effective, but is also slow and can require deep expertise both with data analysis and machine learning algorithms.
An alternative approach is to treat the preparation of input variables as a hyperparameter of the modeling pipeline and to tune it along with the choice of algorithm and algorithm configuration.
This too can be an effective approach exposing unintuitive solutions and requiring very little expertise, although it can be computationally expensive.
An approach that seeks a middle ground between these two approaches to data preparation is to treat the transformation of input data as a feature engineering or feature extraction procedure. This involves applying a suite of common or commonly useful data preparation techniques to the raw data, then aggregating all features together to create one large dataset, then fit and evaluate a model on this data.
The philosophy of the approach treats each data preparation technique as a transform that extracts salient features from raw data to be presented to the learning algorithm. Ideally, such transforms untangle complex relationships and compound input variables, in turn allowing the use of simpler modeling algorithms, such as linear machine learning techniques.
For lack of a better name, we will refer to this as the “Feature Engineering Method” or the “Feature Extraction Method” for configuring data preparation for a predictive modeling project.
It allows data analysis and algorithm expertise to be used in the selection of data preparation methods and allows unintuitive solutions to be found but at a much lower computational cost.
The exclusion in the number of input features can also be explicitly addressed through the use of feature selection techniques that attempt to rank order the importance or value of the vast number of extracted features and only select a small subset of the most relevant to predicting the target variable.
We can explore this approach to data preparation with a worked example.
Before we dive into a worked example, let’s first select a standard dataset and develop a baseline in performance.
Source- https://machinelearningmastery.com/feature-extraction-on-tabular-data/