Updated: Nov 30, 2018
Since the dawn of the computer age, scientists and engineers have always wondered about infusing computers with the ability to learn, just like humans do. Alan Turing was amongst the first scientists to posit a theory of intelligence that envisaged computers to one day be able to reach a level of intelligence that aims to reach human parity. Since then a number of giant leaps have been made that have pushed the field of Machine Learning forward. We have seen Machine Learning in many cases beating or at least matching specific human cognitive faculties such as in the case of ResNet (a deep Residual Network architecture) surpassing human performance in image recognition, or Microsoft's speech transcription system almost reaching human level performance. One of the biggest benefits of Machine Learning is that it can be applied to almost any problem that humanity faces today. However, with that benefit there are also challenges. Machine Learning algorithms need to be configured and tuned for every different real world scenario. This makes it very manual intensive and takes a huge amount of time from a human supervising its development. This manual process is also error-prone, not efficient, and difficult to manage. Not to mention the scarcity of expertise out there to be able to configure and tune different types of algorithms. If the configuration, tuning, and model selection is automated, the deployment process will be made more efficient and humans can focus on the more important tasks such as model interpretability, ethics, and business outcomes. So we can agree that automating the Machine Learning model building process is of practical significance.
Enter Automated Machine Learning
Note: In the definition of Automated Machine Learning we include:
automated feature engineering
automated model selection and hyperparameter tuning
automated neural network architecture selection
This blog post will explore the frameworks available today for each of those automated processes listed above to give the reader an understanding of what is possible today in terms of automated Machine Learning. Before each process is explored let us discuss briefly the end to end Machine Learning pipeline and map out where each process takes place in that pipeline.
It is evident from the figure above that the Machine Learning pipeline includes more than just the modelling phase. It also includes problem definition, data collection, and deployment. The focus of this blog post shall remain on the 'Modelling' and 'Deployment' phases. That is what we want to explore from an automation perspective. If the Modelling and Deployment phase can be automated the Expert can then focus more on the problem definition, data understanding, abide to ethical standards and ensure that the deployed model generates impactful insights for the business and does not raise any ethical concerns.
For each part of the Modelling and Deployment phase we will explore frameworks both from the Open Source community, vendors such as Google, Microsoft, and Amazon and other niche players out there. This blog post has been inspired by the overview of machine learning operations tools outlined by Alejandro Saucedo on his github account .
Automated Feature Engineering
It is often the case that a good performance of a Machine Learning algorithm is largely dependent on the quality of features used by the model. Feature engineering is a very manual and labour intensive task for Data Scientists that involves a lot of trial and error, deep domain knowledge, and something that machines are not good at (for now) and that is: intuition. Automated feature engineering aims to create new feature sets iteratively until the ML model achieves a satisfactory accuracy score. Let us now frame the process we are trying to automate.
A feature engineering process goes typically like this: A data set is collected, for instance a data set from an e-commerce website that collects data about customers' behaviour. As a Data Scientist you will typically like to create new features if not already in the data such as:
"the frequency the customer makes an order"
"the number of days or hours from last purchase"
"the type of items a customer usually purchases"
The aim is to create an algorithm that generates or synthesises these type of features from the data automatically. We will now list and briefly describe a few frameworks out there for automated feature engineering. Note, that in specialized form of Machine Learning called Deep Learning, it is typical for features to be extracted from images, text, and videos automatically by the multiple matrix transformations in the layers of a Deep Learning model. The type of feature engineering we're talking about in this blog post addresses primarily structured transactional and relational data sets, although we will briefly talk about feature engineering in Deep Learning as well.
Data Science Machine is a research project undertaken by Max Kanter and Kalyan Verramachaneni at MIT. Their research paper outlines the inner workings of a Deep Feature Synthesis algorithm which uses the concept of primitives in order to generate features for an entity (unique observation in the data) and relationships between entities. The primitives in essence are the mathematical functions applied to data such (sum, mean, max, min, average, etc) which return a case-agnostic numerical result and which can be interpreted by a human to mean different things. In the case of our e-commerce example, sum can be used to calculate the amount of dollars spent on all orders for a particular customer. In the case of an airplane ticketing platform it can be used to count the number of flight tickets for the year a customer has purchased. Different use cases but same mathematical primitive. This has been open sourced under the Featuretools Python library which you can download and experiment with. Featuretools was developed by Feature Labs which operationalized the work undertaken from the Data Science Machine research paper. Feature Labs is a company created by Max and Kalyan the creators of the Data Science Machine.
DataRobot achieves automated feature engineering using a concept called model blueprints which stacks different pre-processing steps in a machine learning pipeline. The feature engineering part does not leverage the concept of primitives as in Featuretools. However, it does apply some standard pre-processing techniques on the data (based on the ML algorithm being used e.g. Random Forest, Logistic Regression, etc) such as one-hot encoding, imputation, category count, n-gram token occurrences in free text columns, ratios etc.
H2O's Driverless AI is a platform for automatic machine learning. It can be used automate feature engineering, model validation, model tuning, model selection and model deployment. In this part we will cover only the automatic feature engineering part of Driverless AI. Driverless AI supports a whole range of what it calls 'transformers' which can be applied on a data set. The following are some of the transformers available on the platform: Dates Transformer, Cross Validation Categorical to Numeric Encoding , Text Transformer (using TF-IDF or count), One Hot Encoding Transformer, Ewma Lags Transformer, Text Cluster Distance Transformer and many more.
tsfresh is a Python library for calculating and extracting characteristics from a time series data. It extracts features such as median, mean, sample entropy, quantile, skewness, variance, value_count, number of peaks, etc. It does not generalize on all types of data sets. It is more targeted towards time series data. However, it can be used in conjunction with the other tools above.
Automating feature engineering remains a difficult task to accomplish. There are also arguments that do not favour automating feature engineering as it can produce incorrect results or classify observations using a wrong label in a non-transparent way. Therefore, automated feature engineering is treated with caution, especially in highly regulated environments such as Financial Services where explainability and accountability are of paramount importance in every decision making process.
Automated Model Selection and Hyper-Parameter Tuning
Once the features have been pre-processed you need to find a Machine Learning algorithm to train on the observations for those features and be able to predict a target value on new observations. Unlike feature engineering, model selection is abundant with choices and options to choose from. There are clustering models, classification and regression models, models based on neural networks, association rules based models, and many more. Each algorithm is suitable for a certain class of problems and with automated model selection we can filter this model space by running through all suitable models for a particular task at hand and selecting the one that produces the highest accuracy (e.g. lowest AIC) or lowest error rate (e.g. RMSE). It is understood that no single Machine Learning algorithm performs best on all datasets, and that some algorithms will require hyper parameter tuning. In fact, during model selection we tend to either try different variables, different coefficients, or different hyperparameters . In regression problems, there exists an approach to automate the choice of predictive variables that gets used in the final model using techniques such as F-test, t-tests, ajdusted R-squared, etc. This approach is called stepwise regression. However, please refer to the following Stackoverflow thread to understand why this can be error prone and should be avoided in most cases.
Frameworks for automated model selection:
auto-sklearn is a Python library created by Mathias Feurer, Aaron Klein, Katharina Eggensperger et al. This library essentially tackles two core processes in Machine Learning: algorithm selection from a wide list of classification and regresion algorithms and hyperparameter optimization. This library does not perform feature engineering in the sense that data set features are combined to create new ones using mathematical primitives like in the case of Featuretools. Auto-sklearn is comparable to Auto-WEKA and Hyperopt-sklearn. The following are some of the classifiers that auto-sklearn can select from: Decision Trees, Gaussian Naive Bayes, Gradient Boosting, kNN, LDA, SVM, Random Forest, and Linear Classifier (SGD) to name a few. In terms of preprocessing steps it support the following: kernal PCA, select percentile, select rates, one-hot encoding, imputation, balancing, scaling, feature agglomeration, to name a few. Again these are not understood to be feature engineering steps from the perspective of enriching the data set by way of combining existing features.
There are algorithms that automatically go through a series of different variable configurations aiming towards optimizing some metric. This is similar to finding variable importance. Typically, humans do this very well by understanding the context and domain in which a variable exists. For instance: 'Sales increase during the summer season' or 'Most expensive purchases are from West London residents'. These are variables that can be implied naturally by a human domain expert. However, there is another way to understand the importance of a variable and that is by looking at how statistically important that variable is. This is automatically done by algorithms such as Decision Trees (using a so called Gini Index or Information Gain). Random Forests also do this but unlike Decision Trees, Random Forests run multiple decision trees to create multiple models with introduced randomness .
For time series data we tend to talk about the auto.arima package in R which uses AIC as the optimization metric. The algorithm that auto.arima uses in the background to achieve this is called Hyndman-Khandakar and it is explained in length in the following OText book.
The caret package in R can perform a search in the hyper parameter space in order to determine the optimal configuration set for a given model. For instance it can look for the optimal number of hidden layers in a neural network, number of trees and random variables in a random forest, and many more. For a complete list of hyper parameters that can be tuned using caret have a look at the following extensive list of models it supports.
H2O Driverless AI as we discussed earlier can be used for automated feature engineering. It can also be used to automatically train multiple algorithms at the same time. This is achieved by the h2o.automl package. It can automatically train your data using multiple different algorithms with different parameters such as GLM, Xgboost Random Forest, Deep Learning, Ensemble models, to name a few.
DataRobot can also be used to automatically train multiple algorithms at the same time. This is achieved by using models that have been tuned by DataRobot scientists and as such is able to run dozens of models with preset hyperparameters. It eventually selects the one algorithm that results in highest accuracy. It also allows the ability for the Data Scientist to manually intervene and tweak the models in order to improve accuracy.
Microsoft announced in September its own toolkit around Automated Machine Learning. In fact the product itself is called Automated ML and falls under the Azure Machine Learning offering. Microsoft's Automated ML leverages collaborative filtering and Bayesian optimization to search a space of machine learning pipelines. By ML pipeline Microsoft refers to the combination of data pre-processing steps, learning algorithms, and hyperparameter configurations. In many of the model selection techniques discussed above, the typical part of the ML learning process that gets automated is the hyperparameter settings. Researchers at Microsoft have found that tuning only hyperparameters is sometimes comparable to random search and therefore the entire end to end ML pipeline should be ideally automated .
Automated ML is currently in Preview mode and can be tested by going to this link.
Google has also innovated in this space with its Google Cloud AutoML. In Cloud AutoML Google provides Data Scientist with the ability to train models for Computer Vision, Natural Language Processing, and Translation by way of only taking the labeled data from the user and building and training the algorithms automatically.
TPOT is Python library for automated machine learning that leverages genetic programming to optimize machine learning pipelines. The ML pipeline includes Data Cleansing, Feature Selection, Feature Preprocessing, Feature Construction, Model Selection, and Parameter Optimization. The TPOT library leverages the Machine Learning libraries available in scikit-learn.
Amazon Sage Maker offers capability for model building, training, and deployment. It can automatically tune an algorithm and in order to do that it uses a technique called Bayesian optimization.
HyperDrive is product of Microsoft that is built for comprehensive hyper-paramenter exploration. The hyper-parameter search space can be covered using either Random Search, Grid Search, or Bayesian optimization. It implement a list of schedulers which you can choose to induce early termination of the exploration phase by jointly optimizing on quality and cost .
Neural Network Architecture Selection
One of the most tedious tasks in the world of Machine Learning is designing and building neural network architectures. Typically humans will spend hours or days trying to iterate through different neural network architecture with different hyper parameters in order to optimize an objective function for a task at hand. This is very time consuming and often prone with errors. Google introduced the idea of implementing Neural Network Search by employing evolutionary algorithms and reinforcement learning in order to design and find optimal neural network architecture. In essence, what this is doing is that it is training to create a layer and then stacking those layers to create a Deep Neural Network architecture. This area of research has a drawn a lot of attention recently and there has been a number of research papers proposed. Here is an up to date list of all research papers in this area: http://www.ml4aad.org/automl/literature-on-neural-architecture-search/. A few notable research papers worth mentioning are:
NASNet - Learning transferable architecture for salable image recognition 
AmoebaNet - Regularized Evolution for Image Classifier Architecture Search
ENAS - Efficient Neural Architecture Search
A lot of the focus of the Machine Learning community is directed towards learning algorithm development and not so much on probably the most important part of an end to end machine learning pipeline, and that is ML model deployment and productionization. There are many challenges that are inherent in deploying machine learning models to production. We would not do justice if we did not mention the excellent list of challenges outlined by Seldon in their github account which you can read here: https://github.com/SeldonIO/seldon-core/blob/master/docs/challenges.md
There are companies and open source projects which are trying to automate this process and make it as less painful as possible for the Data Scientist who does not necessarily have the DevOps skills. The following is a list of frameworks and companies that are working in this space:
Seldon - provides ways to wrap your model built in R, Python, Java, and NodeJS and deploy that to a Kubernetes cluster. It provides integration with kubeflow, IBM's fabric for deep learning, NVIDIA TensorRT and DL Inference Server, Tensorflow Serving etc.
Redis-ML - is a module in Redis ( an in-memory distributed key-value database) that allows for deploying of models into production. It currently supports only the following algorithms: Random Forests (classification and regression), Linear Regression, and Logistic Regression.
Model Server for Apache MXNet is used for serving deep learning models exported from MXNet or the Open Neural Network Exchange (ONNX).
Microsoft Machine Learning Service allows you to deploy a model as a web service on a scalable Kubernetes cluster and the model can be called as a web service.
Amazon SageMaker can be used to deploy model to an HTTPS endpoint utilized by an application for inferencing / predictions on new data observations.
Google Cloud ML also support model deployment and inferencing through HTTP calls to the web service hosting the model. By default it limits the size of the model to 250 MB.
H2O supports deployment of models by leveraging the concept of Java MOJOs (Model ObJect, Optimized). MOJOs are supported for AutoML, Deep Learning, DRF, GBM, GLM, GLRM, K-Means, Stacked Ensembles, SVM, Word2vec, and XGBoost models. It is tighly integrated with Java type environments. For non-Java programmed models (such as R or Python) the model can be saved as serialized objects and loaded upon inferencing.
TensorFlow Serving is used to deploy tensorflow models into production. In a few lines of code you can serve a tensorflow model as an API for prediction.
Openscoring - if you're model has been trained and exported into a PMML format then Openscoring can help you serve those PMML models as REST APIs for inferencing.
GraphPipe has been created with the aim to decouple ML model deployment from framework specific model implementations (e.g. Tensorflow, Caffe2, ONNX).
 J. M. Kanter and K. Veeramachaneni, “Deep feature synthesis: Towards automating data science endeavors,” in IEEE International Conference on Data Science and Advanced Analytics, 2015, pp. 1–10.
 The Dangers of Automated Model Selection http://www.learnbymarketing.com/743/dangers-of-auto-model-select/
 Finding Important Variables in Your Data http://www.learnbymarketing.com/603/variable-importance/
 Model Tuning https://www.datarobot.com/wiki/tuning/
 Probabilistic Matrix Factorization for Automated Machine Learning https://arxiv.org/pdf/1705.05355.pdf
 Everything You need to Know about AutoML and Neural Network Architecture Search https://towardsdatascience.com/everything-you-need-to-know-about-automl-and-neural-architecture-search-8db1863682bf
 Awesome Machine Learning Operations https://github.com/EthicalML/awesome-machine-learning-operations
 HyperDrive: Exploring Hyperparameters with POP Scheduling https://www.microsoft.com/en-us/research/publication/hyperdrive-exploring-hyperparameters-pop-scheduling/