Logo Passei Direto
Buscar

Machine Learning and Data Science Blueprints for Finance From Building Trading Strategies to Robo-Advisors Using Python (Tatsat, Hariom, Puri, Sahil, Lookabaugh, Brad) (Z-Library)

User badge image
Diogo Rezende

em

Ferramentas de estudo

Material
páginas com resultados encontrados.
páginas com resultados encontrados.
left-side-bubbles-backgroundright-side-bubbles-background

Crie sua conta grátis para liberar esse material. 🤩

Já tem uma conta?

Ao continuar, você aceita os Termos de Uso e Política de Privacidade

left-side-bubbles-backgroundright-side-bubbles-background

Crie sua conta grátis para liberar esse material. 🤩

Já tem uma conta?

Ao continuar, você aceita os Termos de Uso e Política de Privacidade

left-side-bubbles-backgroundright-side-bubbles-background

Crie sua conta grátis para liberar esse material. 🤩

Já tem uma conta?

Ao continuar, você aceita os Termos de Uso e Política de Privacidade

left-side-bubbles-backgroundright-side-bubbles-background

Crie sua conta grátis para liberar esse material. 🤩

Já tem uma conta?

Ao continuar, você aceita os Termos de Uso e Política de Privacidade

left-side-bubbles-backgroundright-side-bubbles-background

Crie sua conta grátis para liberar esse material. 🤩

Já tem uma conta?

Ao continuar, você aceita os Termos de Uso e Política de Privacidade

left-side-bubbles-backgroundright-side-bubbles-background

Crie sua conta grátis para liberar esse material. 🤩

Já tem uma conta?

Ao continuar, você aceita os Termos de Uso e Política de Privacidade

left-side-bubbles-backgroundright-side-bubbles-background

Crie sua conta grátis para liberar esse material. 🤩

Já tem uma conta?

Ao continuar, você aceita os Termos de Uso e Política de Privacidade

left-side-bubbles-backgroundright-side-bubbles-background

Crie sua conta grátis para liberar esse material. 🤩

Já tem uma conta?

Ao continuar, você aceita os Termos de Uso e Política de Privacidade

left-side-bubbles-backgroundright-side-bubbles-background

Crie sua conta grátis para liberar esse material. 🤩

Já tem uma conta?

Ao continuar, você aceita os Termos de Uso e Política de Privacidade

left-side-bubbles-backgroundright-side-bubbles-background

Crie sua conta grátis para liberar esse material. 🤩

Já tem uma conta?

Ao continuar, você aceita os Termos de Uso e Política de Privacidade

Prévia do material em texto

Hariom Tatsat, Sahil Puri 
 & Brad Lookabaugh
Machine Learning 
& Data Science 
Blueprints 
for Finance
From Building Trading Strategies to 
Robo-Advisors Using Python
Hariom Tatsat, Sahil Puri, and Brad Lookabaugh
Machine Learning and Data
Science Blueprints for Finance
From Building Trading Strategies to
Robo-Advisors Using Python
Boston Farnham Sebastopol TokyoBeijing Boston Farnham Sebastopol TokyoBeijing
978-1-492-07305-5
[LSCH]
Machine Learning and Data Science Blueprints for Finance
by Hariom Tatsat, Sahil Puri, and Brad Lookabaugh
Copyright © 2021 Hariom Tatsat, Sahil Puri, and Brad Lookabaugh. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are
also available for most titles (http://oreilly.com). For more information, contact our corporate/institu‐
tional sales department: 800-998-9938 or corporate@oreilly.com.
Acquisitions Editor: Michelle Smith Indexer: WordCo Indexing Services, Inc.
Development Editor: Jeff Bleiel Interior Designer: David Futato
Production Editor: Christopher Faucher Cover Designer: Karen Montgomery
Copyeditor: Piper Editorial, LLC Illustrator: Kate Dullea
Proofreader: Arthur Johnson
October 2020: First Edition
Revision History for the First Edition
2020-09-29: First Release
See http://oreilly.com/catalog/errata.csp?isbn=9781492073055 for release details.
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Machine Learning and Data Science
Blueprints for Finance, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc.
The views expressed in this work are those of the authors, and do not represent the publisher’s views.
While the publisher and the authors have used good faith efforts to ensure that the information and
instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility
for errors or omissions, including without limitation responsibility for damages resulting from the use of
or reliance on this work. Use of the information and instructions contained in this work is at your own
risk. If any code samples or other technology this work contains or describes is subject to open source
licenses or the intellectual property rights of others, it is your responsibility to ensure that your use
thereof complies with such licenses and/or rights.
http://oreilly.com
http://oreilly.com/catalog/errata.csp?isbn=9781492073055
Table of Contents
Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Part I. The Framework
1. Machine Learning in Finance: The Landscape. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Current and Future Machine Learning Applications in Finance 2
Algorithmic Trading 2
Portfolio Management and Robo-Advisors 2
Fraud Detection 3
Loans/Credit Card/Insurance Underwriting 3
Automation and Chatbots 3
Risk Management 4
Asset Price Prediction 4
Derivative Pricing 4
Sentiment Analysis 5
Trade Settlement 5
Money Laundering 5
Machine Learning, Deep Learning, Artificial Intelligence, and Data Science 5
Machine Learning Types 7
Supervised 7
Unsupervised 8
Reinforcement Learning 9
Natural Language Processing 10
Chapter Summary 11
iii
2. Developing a Machine Learning Model in Python. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Why Python? 13
Python Packages for Machine Learning 14
Python and Package Installation 15
Steps for Model Development in Python Ecosystem 15
Model Development Blueprint 16
Chapter Summary 29
3. Artificial Neural Networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
ANNs: Architecture, Training, and Hyperparameters 32
Architecture 32
Training 34
Hyperparameters 36
Creating an Artificial Neural Network Model in Python 40
Installing Keras and Machine Learning Packages 40
Running an ANN Model Faster: GPU and Cloud Services 43
Chapter Summary 45
Part II. Supervised Learning
4. Supervised Learning: Models and Concepts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Supervised Learning Models: An Overview 51
Linear Regression (Ordinary Least Squares) 52
Regularized Regression 55
Logistic Regression 57
Support Vector Machine 58
K-Nearest Neighbors 60
Linear Discriminant Analysis 62
Classification and Regression Trees 63
Ensemble Models65
ANN-Based Models 71
Model Performance 73
Overfitting and Underfitting 73
Cross Validation 74
Evaluation Metrics 75
Model Selection 79
Factors for Model Selection 79
Model Trade-off 81
Chapter Summary 82
iv | Table of Contents
5. Supervised Learning: Regression (Including Time Series Models). . . . . . . . . . . . . . . . . 83
Time Series Models 86
Time Series Breakdown 87
Autocorrelation and Stationarity 88
Traditional Time Series Models (Including the ARIMA Model) 90
Deep Learning Approach to Time Series Modeling 92
Modifying Time Series Data for Supervised Learning Models 95
Case Study 1: Stock Price Prediction 95
Blueprint for Using Supervised Learning Models to Predict a Stock Price 97
Case Study 2: Derivative Pricing 114
Blueprint for Developing a Machine Learning Model for Derivative
Pricing 115
Case Study 3: Investor Risk Tolerance and Robo-Advisors 125
Blueprint for Modeling Investor Risk Tolerance and Enabling a Machine
Learning–Based Robo-Advisor 127
Case Study 4: Yield Curve Prediction 141
Blueprint for Using Supervised Learning Models to Predict the Yield
Curve 142
Chapter Summary 149
Exercises 150
6. Supervised Learning: Classification. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
Case Study 1: Fraud Detection 153
Blueprint for Using Classification Models to Determine Whether a
Transaction Is Fraudulent 153
Case Study 2: Loan Default Probability 166
Blueprint for Creating a Machine Learning Model for Predicting Loan
Default Probability 167
Case Study 3: Bitcoin Trading Strategy 179
Blueprint for Using Classification-Based Models to Predict Whether to
Buy or Sell in the Bitcoin Market 180
Chapter Summary 190
Exercises 191
Part III. Unsupervised Learning
7. Unsupervised Learning: Dimensionality Reduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
Dimensionality Reduction Techniques 197
Principal Component Analysis 198
Kernel Principal Component Analysis 201
Table of Contents | v
t-distributed Stochastic Neighbor Embedding 202
Case Study 1: Portfolio Management: Finding an Eigen Portfolio 202
Blueprint for Using Dimensionality Reduction for Asset Allocation 203
Case Study 2: Yield Curve Construction and Interest Rate Modeling 217
Blueprint for Using Dimensionality Reduction to Generate a Yield Curve 218
Case Study 3: Bitcoin Trading: Enhancing Speed and Accuracy 227
Blueprint for Using Dimensionality Reduction to Enhance a Trading
Strategy 228
Chapter Summary 236
Exercises 236
8. Unsupervised Learning: Clustering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
Clustering Techniques 239
k-means Clustering 239
Hierarchical Clustering 240
Affinity Propagation Clustering 242
Case Study 1: Clustering for Pairs Trading 243
Blueprint for Using Clustering to Select Pairs 244
Case Study 2: Portfolio Management: Clustering Investors 259
Blueprint for Using Clustering for Grouping Investors 260
Case Study 3: Hierarchical Risk Parity 267
Blueprint for Using Clustering to Implement Hierarchical Risk Parity 268
Chapter Summary 277
Exercises 277
Part IV. Reinforcement Learning and Natural Language Processing
9. Reinforcement Learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
Reinforcement Learning—Theory and Concepts 283
RL Components 284
RL Modeling Framework 288
Reinforcement Learning Models 293
Key Challenges in Reinforcement Learning298
Case Study 1: Reinforcement Learning–Based Trading Strategy 298
Blueprint for Creating a Reinforcement Learning–Based Trading Strategy 300
Case Study 2: Derivatives Hedging 316
Blueprint for Implementing a Reinforcement Learning–Based Hedging
Strategy 317
Case Study 3: Portfolio Allocation 334
vi | Table of Contents
Blueprint for Creating a Reinforcement Learning–Based Algorithm for
Portfolio Allocation 335
Chapter Summary 344
Exercises 345
10. Natural Language Processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
Natural Language Processing: Python Packages 349
NLTK 349
TextBlob 349
spaCy 350
Natural Language Processing: Theory and Concepts 350
1. Preprocessing 351
2. Feature Representation 356
3. Inference 360
Case Study 1: NLP and Sentiment Analysis–Based Trading Strategies 362
Blueprint for Building a Trading Strategy Based on Sentiment Analysis 363
Case Study 2: Chatbot Digital Assistant 383
Blueprint for Creating a Custom Chatbot Using NLP 385
Case Study 3: Document Summarization 393
Blueprint for Using NLP for Document Summarization 394
Chapter Summary 400
Exercises 400
Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401
Table of Contents | vii
Preface
The value of machine learning (ML) in finance is becoming more apparent each day.
Machine learning is expected to become crucial to the functioning of financial mar‐
kets. Analysts, portfolio managers, traders, and chief investment officers should all be
familiar with ML techniques. For banks and other financial institutions striving to
improve financial analysis, streamline processes, and increase security, ML is becom‐
ing the technology of choice. The use of ML in institutions is an increasing trend, and
its potential for improving various systems can be observed in trading strategies,
pricing, and risk management.
Although machine learning is making significant inroads across all verticals of the
financial services industry, there is a gap between the ideas and the implementation
of machine learning algorithms. A plethora of material is available on the web in
these areas, yet very little is organized. Additionally, most of the literature is limited
to trading algorithms only. Machine Learning and Data Science Blueprints for Finance
fills this void and provides a machine learning toolbox customized for the financal
market that allows the readers to be part of the machine learning revolution. This
book is not limited to investing or trading strategies; it focuses on leveraging the art
and craft of building ML-driven algorithms that are crucial in the finance industry.
Implementing machine learning models in finance is easier than commonly believed.
There is also a misconception that big data is needed for building machine learning
models. The case studies in this book span almost all areas of machine learning and
aim to handle such misconceptions. This book not only will cover the theory and case
studies related to using ML in trading strategies but also will delve deep into other
critical “need-to-know” concepts such as portfolio management, derivative pricing,
fraud detection, corporate credit ratings, robo-advisor development, and chatbot
development. It will address real-life problems faced by practitioners and provide sci‐
entifically sound solutions supported by code and examples.
Preface | ix
The Python codebase for this book on GitHub will be useful and serve as a starting
point for industry practitioners working on their projects. The examples and case
studies shown in the book demonstrate techniques that can easily be applied to a
wide range of datasets. The futuristic case studies, such as reinforcement learning for
trading, building a robo-advisor, and using machine learning for instrument pricing,
inspire readers to think outside the box and motivate them to make the best of the
models and data available.
Who This Book Is For
The format of the book and the list of topics covered make it suitable for professio‐
nals working in hedge funds, investment and retail banks, and fintech firms. They
may have titles such as data scientist, data engineer, quantitative researcher, machine
learning architect, or software engineer. Additionally, the book will be useful for
those professionals working in support functions, such as compliance and risk.
Whether a quantitative trader in a hedge fund is looking for ideas in using reinforce‐
ment learning for trading cryptocurrency or an investment bank quant is looking for
machine learning–based techniques to improve the calibration speed of pricing mod‐
els, this book will add value. The theory, concepts, and codebase mentioned in the
book will be extremely useful at every step of the model development lifecycle, from
idea generation to model implementation. Readers can use the shared codebase and
test the proposed solutions themselves, allowing for a hands-on reader experience.
The readers should have a basic knowledge of statistics, machine learning, and
Python.
How This Book Is Organized
This book provides a comprehensive introduction to how machine learning and data
science can be used to design models across different areas in finance. It is organized
into four parts.
Part I: The Framework
The first part provides an overview of machine learning in finance and the building
blocks of machine learning implementation. These chapters serve as the foundation
for the case studies covering different machine learning types presented in the rest of
the book.
x | Preface
https://github.com/tatsath/fin-ml
The chapters under the first part are as follows:
Chapter 1, Machine Learning in Finance: The Landscape
This chapter provides an overview of applications of machine learning in finance
and provides a brief overview of several types of machine learning.
Chapter 2, Developing a Machine Learning Model in Python
This chapter looks at the Python-based ecosystem for machine learning. It also
cover the steps for machine learning model development in the Python frame‐
work.
Chapter 3, Artificial Neural Networks
Given that an artificial neural network (ANN)is a primary algorithm used across
all types of machine learning, this chapter looks at the details of ANNs, followed
by a detailed implementation of an ANN model using Python libraries.
Part II: Supervised Learning
The second part covers fundamental supervised learning algorithms and illustrates
specific applications and case studies.
The chapters under the second part are as follows:
Chapter 4, Supervised Learning: Models and Concepts
This chapter provides an introduction to supervised learning techniques (both
classification and regression). Given that a lot of models are common between
classification and regression, the details of those models are presented together
along with other concepts such as model selection and evaluation metrics for
classification and regression.
Chapter 5, Supervised Learning: Regression (Including Time Series Models)
Supervised learning-based regression models are the most commonly used
machine learning models in finance. This chapter covers the models from basic
linear regression to advance deep learning. The case studies covered in this sec‐
tion include models for stock price prediction, derivatives pricing, and portfolio
management.
Chapter 6, Supervised Learning: Classification
Classification is a subcategory of supervised learning in which the goal is to pre‐
dict the categorical class labels of new instances, based on past observations. This
section discusses several case studies based on classification–based techniques,
such as logistic regression, support vector machines, and random forests.
Preface | xi
Part III: Unsupervised Learning
The third part covers the fundamental unsupervised learning algorithms and offers
applications and case studies.
The chapters under the third part are as follows:
Chapter 7, Unsupervised Learning: Dimensionality Reduction
This chapter describes the essential techniques to reduce the number of features
in a dataset while retaining most of their useful and discriminatory information.
It also discusses the standard approach to dimensionality reduction via principal
component analysis and covers case studies in portfolio management, trading
strategy, and yield curve construction.
Chapter 8, Unsupervised Learning: Clustering
This chapter covers the algorithms and techniques related to clustering and iden‐
tifying groups of objects that share a degree of similarity. The case studies utiliz‐
ing clustering in trading strategies and portfolio management are covered in this
chapter.
Part IV: Reinforcement Learning and Natural Language Processing
The fourth part covers the reinforcement learning and natural language processing
(NLP) techniques.
The chapters under the fourth part are as follows:
Chapter 9, Reinforcement Learning
This chapter covers concepts and case studies on reinforcement learning, which
have great potential for application in the finance industry. Reinforcement learn‐
ing’s main idea of “maximizing the rewards” aligns perfectly with the core moti‐
vation of several areas within finance. Case studies related to trading strategies,
portfolio optimization, and derivatives hedging are covered in this chapter.
Chapter 10, Natural Language Processing
This chapter describes the techniques in natural language processing and dis‐
cusses the essential steps to transform textual data into meaningful representa‐
tions across several areas in finance. Case studies related to sentiment analysis,
chatbots, and document interpretation are covered.
xii | Preface
Conventions Used in This Book
The following typographical conventions are used in this book:
Italic
Indicates new terms, URLs, email addresses, filenames, and file extensions.
Constant width
Used for program listings, as well as within paragraphs to refer to program ele‐
ments such as variable or function names, databases, data types, environment
variables, statements, and keywords.
This element signifies a tip or suggestion.
This element signifies a general note.
This element indicates a warning or caution.
This element indicates a blueprint.
Using Code Presented in the Book
All code in this book (case studies and master template) is available at the GitHub
directory: https://github.com/tatsath/fin-ml. The code is hosted on a cloud platform,
so every case study can be run without installing a package on a local machine by
clicking https://mybinder.org/v2/gh/tatsath/fin-ml/master.
This book is here to help you get your job done. In general, if example code is offered,
you may use it in your programs and documentation. You do not need to contact us
for permission unless you’re reproducing a significant portion of the code. For exam‐
ple, writing a program that uses several chunks of code from this book does not
Preface | xiii
https://github.com/tatsath/fin-ml
https://mybinder.org/v2/gh/tatsath/fin-ml/master
require permission. Selling or distributing examples from O’Reilly books does require
permission. Answering a question by citing this book and quoting example code does
not require permission. Incorporating a significant amount of example code from
this book into your product’s documentation does require permission.
We appreciate, but generally do not require, attribution. An attribution usually
includes the title, author, publisher, and ISBN. For example: Machine Learning and
Data Science Blueprints for Finance by Hariom Tatsat, Sahil Puri, and Brad Looka‐
baugh (O’Reilly, 2021), 978-1-492-07305-5.
If you feel your use of code examples falls outside fair use or the permission given
above, feel free to contact us at permissions@oreilly.com.
Python Libraries
The book uses Python 3.7. Installing the Conda package manager is recommended in
order to create a Conda environment to install the requisite libraries. Installation
instructions are available on the GitHub repo’s README file.
O’Reilly Online Learning
For more than 40 years, O’Reilly Media has provided technol‐
ogy and business training, knowledge, and insight to help
companies succeed.
Our unique network of experts and innovators share their knowledge and expertise
through books, articles, and our online learning platform. O’Reilly’s online learning
platform gives you on-demand access to live training courses, in-depth learning
paths, interactive coding environments, and a vast collection of text and video from
O’Reilly and 200+ other publishers. For more information, visit http://oreilly.com.
How to Contact Us
Please address comments and questions concerning this book to the publisher:
O’Reilly Media, Inc.
1005 Gravenstein Highway North
Sebastopol, CA 95472
800-998-9938 (in the United States or Canada)
707-829-0515 (international or local)
707-829-0104 (fax)
xiv | Preface
mailto:permissions@oreilly.com
https://github.com/tatsath/fin-ml
http://oreilly.com
http://oreilly.com
We have a web page for this book, where we list errata, examples, and any additional
information. You can access this page at https://oreil.ly/ML-and-data-science-
blueprints.
Email bookquestions@oreilly.com to comment or ask technical questions about this
book.
For news and information about our books and courses, visit http://oreilly.com.
Find us on Facebook: http://facebook.com/oreilly
Follow us on Twitter: http://twitter.com/oreillymedia
Watch us on YouTube: http://youtube.com/oreillymedia
Acknowledgments
We want to thank all those who helped to make this book a reality. Special thanks to
Jeff Bleiel for honest, insightful feedback, and for guiding us through the entire pro‐
cess. We are incredibly grateful to Juan Manuel Contreras, Chakri Cherukuri, and
Gregory Bronner, who took time out of their busy lives to review our book in so
much detail. The book benefited from their valuable feedback and suggestions. Many
thanks as well to O’Reilly’s fantastic staff, in particular Michelle Smith for believing in
this project and helping us define its scope.
Special Thanks from Hariom
I would like to thank my wife, Prachi, and my parents for their love and support. Spe‐
cial thanks to my father for encouragingme in all of my pursuits and being a contin‐
uous source of inspiration.
Special Thanks from Sahil
Thanks to my family, who always encouraged and supported me in all endeavors.
Special Thanks from Brad
Thank you to my wife, Megan, for her endless love and support.
Preface | xv
https://oreil.ly/ML-and-data-science-blueprints
https://oreil.ly/ML-and-data-science-blueprints
mailto:bookquestions@oreilly.com
http://oreilly.com
http://facebook.com/oreilly
http://twitter.com/oreillymedia
http://youtube.com/oreillymedia
PART I
The Framework
CHAPTER 1
Machine Learning in Finance:
The Landscape
Machine learning promises to shake up large swathes of finance
—The Economist (2017)
There is a new wave of machine learning and data science in finance, and the related
applications will transform the industry over the next few decades.
Currently, most financial firms, including hedge funds, investment and retail banks,
and fintech firms, are adopting and investing heavily in machine learning. Going for‐
ward, financial institutions will need a growing number of machine learning and data
science experts.
Machine learning in finance has become more prominent recently due to the availa‐
bility of vast amounts of data and more affordable computing power. The use of data
science and machine learning is exploding exponentially across all areas of finance.
The success of machine learning in finance depends upon building efficient infra‐
structure, using the correct toolkit, and applying the right algorithms. The concepts
related to these building blocks of machine learning in finance are demonstrated and
utilized throughout this book.
In this chapter, we provide an introduction to the current and future application of
machine learning in finance, including a brief overview of different types of machine
learning. This chapter and the two that follow serve as the foundation for the case
studies presented in the rest of the book.
1
Current and Future Machine Learning Applications
in Finance
Let’s take a look at some promising machine learning applications in finance. The
case studies presented in this book cover all the applications mentioned here.
Algorithmic Trading
Algorithmic trading (or simply algo trading) is the use of algorithms to conduct trades
autonomously. With origins going back to the 1970s, algorithmic trading (sometimes
called Automated Trading Systems, which is arguably a more accurate description)
involves the use of automated preprogrammed trading instructions to make
extremely fast, objective trading decisions.
Machine learning stands to push algorithmic trading to new levels. Not only can
more advanced strategies be employed and adapted in real time, but machine learn‐
ing–based techniques can offer even more avenues for gaining special insight into
market movements. Most hedge funds and financial institutions do not openly dis‐
close their machine learning–based approaches to trading (for good reason), but
machine learning is playing an increasingly important role in calibrating trading
decisions in real time.
Portfolio Management and Robo-Advisors
Asset and wealth management firms are exploring potential artificial intelligence (AI)
solutions for improving their investment decisions and making use of their troves of
historical data.
One example of this is the use of robo-advisors, algorithms built to calibrate a finan‐
cial portfolio to the goals and risk tolerance of the user. Additionally, they provide
automated financial guidance and service to end investors and clients.
A user enters their financial goals (e.g., to retire at age 65 with $250,000 in savings),
age, income, and current financial assets. The advisor (the allocator) then spreads
investments across asset classes and financial instruments in order to reach the user’s
goals.
The system then calibrates to changes in the user’s goals and real-time changes in the
market, aiming always to find the best fit for the user’s original goals. Robo-advisors
have gained significant traction among consumers who do not need a human advisor
to feel comfortable investing.
2 | Chapter 1: Machine Learning in Finance: The Landscape
Fraud Detection
Fraud is a massive problem for financial institutions and one of the foremost reasons
to leverage machine learning in finance.
There is currently a significant data security risk due to high computing power, fre‐
quent internet use, and an increasing amount of company data being stored online.
While previous financial fraud detection systems depended heavily on complex and
robust sets of rules, modern fraud detection goes beyond following a checklist of risk
factors—it actively learns and calibrates to new potential (or real) security threats.
Machine learning is ideally suited to combating fraudulent financial transactions.
This is because machine learning systems can scan through vast datasets, detect
unusual activities, and flag them instantly. Given the incalculably high number of
ways that security can be breached, genuine machine learning systems will be an
absolute necessity in the days to come.
Loans/Credit Card/Insurance Underwriting
Underwriting could be described as a perfect job for machine learning in finance, and
indeed there is a great deal of worry in the industry that machines will replace a large
swath of underwriting positions that exist today.
Especially at large companies (big banks and publicly traded insurance firms),
machine learning algorithms can be trained on millions of examples of consumer
data and financial lending or insurance outcomes, such as whether a person defaulted
on their loan or mortgage.
Underlying financial trends can be assessed with algorithms and continuously ana‐
lyzed to detect trends that might influence lending and underwriting risk in the
future. Algorithms can perform automated tasks such as matching data records,
identifying exceptions, and calculating whether an applicant qualifies for a credit or
insurance product.
Automation and Chatbots
Automation is patently well suited to finance. It reduces the strain that repetitive,
low-value tasks put on human employees. It tackles the routine, everyday processes,
freeing up teams to finish their high-value work. In doing so, it drives enormous time
and cost savings.
Adding machine learning and AI into the automation mix adds another level of sup‐
port for employees. With access to relevant data, machine learning and AI can pro‐
vide an in-depth data analysis to support finance teams with difficult decisions. In
some cases, it may even be able to recommend the best course of action for employ‐
ees to approve and enact.
Current and Future Machine Learning Applications in Finance | 3
AI and automation in the financial sector can also learn to recognize errors, reducing
the time wasted between discovery and resolution. This means that human team
members are less likely to be delayed in providing their reports and are able to com‐
plete their work with fewer errors.
AI chatbots can be implemented to support finance and banking customers. With the
rise in popularity of live chat software in banking and finance businesses, chatbots are
the natural evolution.
Risk Management
Machine learning techniques are transforming how we approach risk management.
All aspects of understanding and controlling risk are being revolutionized through
the growth of solutions driven by machine learning. Examples range from deciding
how much a bank should lend a customer to improving compliance and reducing
model risk.
Asset Price Prediction
Asset price prediction is considered the most frequently discussed and most sophisti‐
cated area in finance. Predicting asset prices allows one to understand the factors that
drive the market and speculate asset performance. Traditionally, asset price predic‐
tion was performed by analyzing past financial reports and market performance to
determine what position to take for a specific security or asset class. However, with
a tremendous increase in the amount offinancial data, the traditional approaches for
analysis and stock-selection strategies are being supplemented with ML-based
techniques.
Derivative Pricing
Recent machine learning successes, as well as the fast pace of innovation, indicate
that ML applications for derivatives pricing should become widely used in the com‐
ing years. The world of Black-Scholes models, volatility smiles, and Excel spreadsheet
models should wane as more advanced methods become readily available.
The classic derivative pricing models are built on several impractical assumptions to
reproduce the empirical relationship between the underlying input data (strike price,
time to maturity, option type) and the price of the derivatives observed in the market.
Machine learning methods do not rely on several assumptions; they just try to esti‐
mate a function between the input data and price, minimizing the difference between
the results of the model and the target.
The faster deployment times achieved with state-of-the-art ML tools are just one of
the advantages that will accelerate the use of machine learning in derivatives pricing.
4 | Chapter 1: Machine Learning in Finance: The Landscape
Sentiment Analysis
Sentiment analysis involves the perusal of enormous volumes of unstructured data,
such as videos, transcriptions, photos, audio files, social media posts, articles, and
business documents, to determine market sentiment. Sentiment analysis is crucial for
all businesses in today’s workplace and is an excellent example of machine learning in
finance.
The most common use of sentiment analysis in the financial sector is the analysis of
financial news—in particular, predicting the behaviors and possible trends of mar‐
kets. The stock market moves in response to myriad human-related factors, and the
hope is that machine learning will be able to replicate and enhance human intuition
about financial activity by discovering new trends and telling signals.
However, much of the future applications of machine learning will be in understand‐
ing social media, news trends, and other data sources related to predicting the senti‐
ments of customers toward market developments. It will not be limited to predicting
stock prices and trades.
Trade Settlement
Trade settlement is the process of transferring securities into the account of a buyer
and cash into the seller’s account following a transaction of a financial asset.
Despite the majority of trades being settled automatically, and with little or no inter‐
action by human beings, about 30% of trades need to be settled manually.
The use of machine learning not only can identify the reason for failed trades, but it
also can analyze why the trades were rejected, provide a solution, and predict which
trades may fail in the future. What usually would take a human being five to ten
minutes to fix, machine learning can do in a fraction of a second.
Money Laundering
A United Nations report estimates that the amount of money laundered worldwide
per year is 2%–5% of global GDP. Machine learning techniques can analyze internal,
publicly existing, and transactional data from a client’s broader network in an
attempt to spot money laundering signs.
Machine Learning, Deep Learning, Artificial Intelligence,
and Data Science
For the majority of people, the terms machine learning, deep learning, artificial intelli‐
gence, and data science are confusing. In fact, a lot of people use one term inter‐
changeably with the others.
Machine Learning, Deep Learning, Artificial Intelligence, and Data Science | 5
Figure 1-1 shows the relationships between AI, machine learning, deep learning and
data science. Machine learning is a subset of AI that consists of techniques that
enable computers to identify patterns in data and to deliver AI applications. Deep
learning, meanwhile, is a subset of machine learning that enables computers to solve
more complex problems.
Data science isn’t exactly a subset of machine learning, but it uses machine learning,
deep learning, and AI to analyze data and reach actionable conclusions. It combines
machine learning, deep learning and AI with other disciplines such as big data ana‐
lytics and cloud computing.
Figure 1-1. AI, machine learning, deep learning, and data science
The following is a summary of the details about artificial intelligence, machine learn‐
ing, deep learning, and data science:
Artificial intelligence
Artificial intelligence is the field of study by which a computer (and its systems)
develop the ability to successfully accomplish complex tasks that usually require
human intelligence. These tasks include, but are not limited to, visual perception,
speech recognition, decision making, and translation between languages. AI is
usually defined as the science of making computers do things that require intelli‐
gence when done by humans.
Machine learning
Machine learning is an application of artificial intelligence that provides the AI
system with the ability to automatically learn from the environment and apply
those lessons to make better decisions. There are a variety of algorithms that
6 | Chapter 1: Machine Learning in Finance: The Landscape
machine learning uses to iteratively learn, describe and improve data, spot pat‐
terns, and then perform actions on these patterns.
Deep learning
Deep learning is a subset of machine learning that involves the study of algo‐
rithms related to artificial neural networks that contain many blocks (or layers)
stacked on each other. The design of deep learning models is inspired by the bio‐
logical neural network of the human brain. It strives to analyze data with a logical
structure similar to how a human draws conclusions.
Data science
Data science is an interdisciplinary field similar to data mining that uses scien‐
tific methods, processes, and systems to extract knowledge or insights from data
in various forms, either structured or unstructured. Data science is different from
ML and AI because its goal is to gain insight into and understanding of the data
by using different scientific tools and techniques. However, there are several
tools and techniques common to both ML and data science, some of which are
demonstrated in this book.
Machine Learning Types
This section will outline all types of machine learning that are used in different case
studies presented in this book for various financial applications. The three types of
machine learning, as shown in Figure 1-2, are supervised learning, unsupervised
learning, and reinforcement learning.
Figure 1-2. Machine learning types
Supervised
The main goal in supervised learning is to train a model from labeled data that allows
us to make predictions about unseen or future data. Here, the term supervised refers
to a set of samples where the desired output signals (labels) are already known. There
are two types of supervised learning algorithms: classification and regression.
Machine Learning Types | 7
Classification
Classification is a subcategory of supervised learning in which the goal is to predict
the categorical class labels of new instances based on past observations.
Regression
Regression is another subcategory of supervised learning used in the prediction of
continuous outcomes. In regression, we are given a number of predictor (explana‐
tory) variables and a continuous response variable (outcome or target), and we try to
find a relationship between those variables that allows us to predict an outcome.
An example of regression versus classification is shown in Figure 1-3. The chart on
the left shows an example of regression. The continuous response variable is return,
and the observed values are plotted against the predicted outcomes. On the right, the
outcome is a categorical class label, whether the market is bull or bear, and is an
example of classification.
Figure 1-3. Regression versus classification
Unsupervised
Unsupervised learning is a type of machine learning used to draw inferences from
datasets consisting of input data without labeledresponses. There are two types of
unsupervised learning: dimensionality reduction and clustering.
Dimensionality reduction
Dimensionality reduction is the process of reducing the number of features, or vari‐
ables, in a dataset while preserving information and overall model performance. It is
a common and powerful way to deal with datasets that have a large number of
dimensions.
8 | Chapter 1: Machine Learning in Finance: The Landscape
Figure 1-4 illustrates this concept, where the dimension of data is converted from two
dimensions (X1 and X2) to one dimension (Z1). Z1 conveys similar information
embedded in X1 and X2 and reduces the dimension of the data.
Figure 1-4. Dimensionality reduction
Clustering
Clustering is a subcategory of unsupervised learning techniques that allows us to dis‐
cover hidden structures in data. The goal of clustering is to find a natural grouping in
data so that items in the same cluster are more similar to each other than to those
from different clusters.
An example of clustering is shown in Figure 1-5, where we can see the entire data
clustered into two distinct groups by the clustering algorithm.
Figure 1-5. Clustering
Reinforcement Learning
Learning from experiences, and the associated rewards or punishments, is the core
concept behind reinforcement learning (RL). It is about taking suitable actions to
maximize reward in a particular situation. The learning system, called an agent, can
observe the environment, select and perform actions, and receive rewards (or penal‐
ties in the form of negative rewards) in return, as shown in Figure 1-6.
Reinforcement learning differs from supervised learning in this way: In supervised
learning, the training data has the answer key, so the model is trained with the correct
answers available. In reinforcement learning, there is no explicit answer. The learning
Machine Learning Types | 9
system (agent) decides what to do to perform the given task and learns whether that
was a correct action based on the reward. The algorithm determines the answer key
through its experience.
Figure 1-6. Reinforcement learning
The steps of the reinforcement learning are as follows:
1. First, the agent interacts with the environment by performing an action.
2. Then the agent receives a reward based on the action it performed.
3. Based on the reward, the agent receives an observation and understands whether
the action was good or bad. If the action was good—that is, if the agent received a
positive reward—then the agent will prefer performing that action. If the reward
was less favorable, the agent will try performing another action to receive a posi‐
tive reward. It is basically a trial-and-error learning process.
Natural Language Processing
Natural language processing (NLP) is a branch of AI that deals with the problems of
making a machine understand the structure and the meaning of natural language as
used by humans. Several techniques of machine learning and deep learning are used
within NLP.
NLP has many applications in the finance sectors in areas such as sentiment analysis,
chatbots, and document processing. A lot of information, such as sell side reports,
earnings calls, and newspaper headlines, is communicated via text message, making
NLP quite useful in the financial domain.
Given the extensive application of NLP algorithms based on machine learning in
finance, there is a separate chapter of this book (Chapter 10) dedicated to NLP and
related case studies.
10 | Chapter 1: Machine Learning in Finance: The Landscape
Chapter Summary
Machine learning is making significant inroads across all the verticals of the financial
services industry. This chapter covered different applications of machine learning in
finance, from algorithmic trading to robo-advisors. These applications will be cov‐
ered in the case studies later in this book.
Next Steps
In terms of the platforms used for machine learning, the Python ecosystem is grow‐
ing and is one of the most dominant programming languages for machine learning.
In the next chapter, we will learn about the model development steps, from data
preparation to model deployment in a Python-based framework.
Chapter Summary | 11
CHAPTER 2
Developing a Machine Learning
Model in Python
In terms of the platforms used for machine learning, there are many algorithms and
programming languages. However, the Python ecosystem is one of the most domi‐
nant and fastest-growing programming languages for machine learning.
Given the popularity and high adoption rate of Python, we will use it as the main
programming language throughout the book. This chapter provides an overview of a
Python-based machine learning framework. First, we will review the details of
Python-based packages used for machine learning, followed by the model develop‐
ment steps in the Python framework.
The steps of model development in Python presented in this chapter serve as the
foundation for the case studies presented in the rest of the book. The Python frame‐
work can also be leveraged while developing any machine learning–based model in
finance.
Why Python?
Some reasons for Python’s popularity are as follows:
• High-level syntax (compared to lower-level languages of C, Java, and C++).
Applications can be developed by writing fewer lines of code, making Python
attractive to beginners and advanced programmers alike.
• Efficient development lifecycle.
• Large collection of community-managed, open-source libraries.
• Strong portability.
13
The simplicity of Python has attracted many developers to create new libraries for
machine learning, leading to strong adoption of Python.
Python Packages for Machine Learning
The main Python packages used for machine learning are highlighted in Figure 2-1.
Figure 2-1. Python packages
Here is a brief summary of each of these packages:
NumPy
Provides support for large, multidimensional arrays as well as an extensive col‐
lection of mathematical functions.
Pandas
A library for data manipulation and analysis. Among other features, it offers data
structures to handle tables and the tools to manipulate them.
Matplotlib
A plotting library that allows the creation of 2D charts and plots.
SciPy
The combination of NumPy, Pandas, and Matplotlib is generally referred to as
SciPy. SciPy is an ecosystem of Python libraries for mathematics, science, and
engineering.
Scikit-learn (or sklearn)
A machine learning library offering a wide range of algorithms and utilities.
14 | Chapter 2: Developing a Machine Learning Model in Python
https://numpy.org
https://pandas.pydata.org
https://matplotlib.org
https://www.scipy.org
https://scikit-learn.org
StatsModels
A Python module that provides classes and functions for the estimation of many
different statistical models, as well as for conducting statistical tests and statistical
data exploration.
TensorFlow and Theano
Dataflow programming libraries that facilitate working with neural networks.
Keras
An artificial neural network library that can act as a simplified interface to
TensorFlow/Theano packages.
Seaborn
A data visualization library based on Matplotlib. It provides a high-level interface
for drawing attractive and informative statistical graphics.
pip and Conda
These are Python package managers. pip is a package manager that facilitates
installation, upgrade, and uninstallation of Python packages. Conda is a package
manager that handles Python packages as well as library dependencies outside of
the Python packages.
Python and Package Installation
There are different ways of installing Python. However, it is strongly recommended
that you install Python through Anaconda. Anaconda contains Python, SciPy, and
Scikit-learn.
After installing Anaconda, a Jupyter server can be started locally by opening the
machine’s terminal and typing in the following code:
$jupyter notebook
All code samples in this book use Python 3 and are presented in
Jupyter notebooks. Several Python packages, especially Scikit-learn
and Keras, are extensivelyused in the case studies.
Steps for Model Development in Python Ecosystem
Working through machine learning problems from end to end is critically important.
Applied machine learning will not come alive unless the steps from beginning to end
are well defined.
Figure 2-2 provides an outline of the simple seven-step machine learning project
template that can be used to jump-start any machine learning model in Python. The
Steps for Model Development in Python Ecosystem | 15
https://www.statsmodels.org
https://www.tensorflow.org
http://deeplearning.net/software/theano
https://keras.io
https://seaborn.pydata.org
https://pypi.org/project/pip
https://docs.conda.io/en/latest
https://www.anaconda.com
first few steps include exploratory data analysis and data preparation, which are typi‐
cal data science–based steps aimed at extracting meaning and insights from data.
These steps are followed by model evaluation, fine-tuning, and finalizing the model.
Figure 2-2. Model development steps
All the case studies in this book follow the standard seven-step
model development process. However, there are a few case studies
in which some of the steps are skipped, renamed, or reordered
based on the appropriateness and intuitiveness of the steps.
Model Development Blueprint
The following section covers the details of each model development step with sup‐
porting Python code.
1. Problem definition
The first step in any project is defining the problem. Powerful algorithms can be used
for solving the problem, but the results will be meaningless if the wrong problem is
solved.
The following framework should be used for defining the problem:
1. Describe the problem informally and formally. List assumptions and similar
problems.
2. List the motivation for solving the problem, the benefits a solution provides, and
how the solution will be used.
3. Describe how the problem would be solved using the domain knowledge.
16 | Chapter 2: Developing a Machine Learning Model in Python
2. Loading the data and packages
The second step gives you everything needed to start working on the problem. This
includes loading libraries, packages, and individual functions needed for the model
development.
2.1. Load libraries. A sample code for loading libraries is as follows:
# Load libraries
import pandas as pd
from matplotlib import pyplot
The details of the libraries and modules for specific functionalities are defined further
in the individual case studies.
2.2. Load data. The following items should be checked and removed before loading
the data:
• Column headers
• Comments or special characters
• Delimiter
There are many ways of loading data. Some of the most common ways are as follows:
Load CSV files with Pandas
from pandas import read_csv
filename = 'xyz.csv'
data = read_csv(filename, names=names)
Load file from URL
from pandas import read_csv
url = 'https://goo.gl/vhm1eU'
names = ['age', 'class']
data = read_csv(url, names=names)
Load file using pandas_datareader
import pandas_datareader.data as web
ccy_tickers = ['DEXJPUS', 'DEXUSUK']
idx_tickers = ['SP500', 'DJIA', 'VIXCLS']
stk_data = web.DataReader(stk_tickers, 'yahoo')
ccy_data = web.DataReader(ccy_tickers, 'fred')
idx_data = web.DataReader(idx_tickers, 'fred')
Steps for Model Development in Python Ecosystem | 17
3. Exploratory data analysis
In this step, we look at the dataset.
3.1. Descriptive statistics. Understanding the dataset is one of the most important
steps of model development. The steps to understanding data include:
1. Viewing the raw data.
2. Reviewing the dimensions of the dataset.
3. Reviewing the data types of attributes.
4. Summarizing the distribution, descriptive statistics, and relationship among the
variables in the dataset.
These steps are demonstrated below using sample Python code:
Viewing the data
set_option('display.width', 100)
dataset.head(1)
Output
Age Sex Job Housing SavingAccounts CheckingAccount CreditAmount Duration Purpose Risk
0 67 male 2 own NaN little 1169 6 radio/TV good
Reviewing the dimensions of the dataset
dataset.shape
Output
(284807, 31)
The results show the dimension of the dataset and mean that the dataset has 284,807
rows and 31 columns.
Reviewing the data types of the attributes in the data
# types
set_option('display.max_rows', 500)
dataset.dtypes
Summarizing the data using descriptive statistics
# describe data
set_option('precision', 3)
dataset.describe()
18 | Chapter 2: Developing a Machine Learning Model in Python
Output
Age Job CreditAmount Duration
count 1000.000 1000.000 1000.000 1000.000
mean 35.546 1.904 3271.258 20.903
std 11.375 0.654 2822.737 12.059
min 19.000 0.000 250.000 4.000
25% 27.000 2.000 1365.500 12.000
50% 33.000 2.000 2319.500 18.000
75% 42.000 2.000 3972.250 24.000
max 75.000 3.000 18424.000 72.000
3.2. Data visualization. The fastest way to learn more about the data is to visualize it.
Visualization involves independently understanding each attribute of the dataset.
Some of the plot types are as follows:
Univariate plots
Histograms and density plots
Multivariate plots
Correlation matrix plot and scatterplot
The Python code for univariate plot types is illustrated with examples below:
Univariate plot: histogram
from matplotlib import pyplot
dataset.hist(sharex=False, sharey=False, xlabelsize=1, ylabelsize=1,\
figsize=(10,4))
pyplot.show()
Univariate plot: density plot
from matplotlib import pyplot
dataset.plot(kind='density', subplots=True, layout=(3,3), sharex=False,\
legend=True, fontsize=1, figsize=(10,4))
pyplot.show()
Steps for Model Development in Python Ecosystem | 19
Figure 2-3 illustrates the output.
Figure 2-3. Histogram (top) and density plot (bottom)
The Python code for multivariate plot types is illustrated with examples below:
Multivariate plot: correlation matrix plot
from matplotlib import pyplot
import seaborn as sns
correlation = dataset.corr()
pyplot.figure(figsize=(5,5))
pyplot.title('Correlation Matrix')
sns.heatmap(correlation, vmax=1, square=True,annot=True,cmap='cubehelix')
Multivariate plot: scatterplot matrix
from pandas.plotting import scatter_matrix
scatter_matrix(dataset)
Figure 2-4 illustrates the output.
20 | Chapter 2: Developing a Machine Learning Model in Python
Figure 2-4. Correlation (left) and scatterplot (right)
4. Data preparation
Data preparation is a preprocessing step in which data from one or more sources is
cleaned and transformed to improve its quality prior to its use.
4.1. Data cleaning. In machine learning modeling, incorrect data can be costly. Data
cleaning involves checking the following:
Validity
The data type, range, etc.
Accuracy
The degree to which the data is close to the true values.
Completeness
The degree to which all required data is known.
Uniformity
The degree to which the data is specified using the same unit of measure.
The different options for performing data cleaning include:
Dropping “NA” values within data
dataset.dropna(axis=0)
Filling “NA” with 0
dataset.fillna(0)
Filling NAs with the mean of the column
dataset['col'] = dataset['col'].fillna(dataset['col'].mean())
Steps for Model Development in Python Ecosystem | 21
1 Feature selection is more relevant for supervised learning models and is described in detail in the individual
case studies in Chapters 5 and 6.
2 Overfitting is covered in detail in Chapter 4.
4.2. Feature selection. The data features used to train the machine learning models
have a huge influence on the performance. Irrelevant or partially relevant features
can negatively impact model performance. Feature selection1 is a process in which
features in data that contribute most to the prediction variable or output are auto‐
matically selected.
The benefits of performing feature selection before modeling the data are:
Reduces overfitting2
Less redundant data means fewer opportunities for the model to make decisions
based on noise.
Improves performance
Less misleading data means improved modeling performance.Reduces training time and memory footprint
Less data means faster training and lower memory footprint.
The following sample feature is an example demonstrating when the best two fea‐
tures are selected using the SelectKBest function under sklearn. The SelectKBest
function scores the features using an underlying function and then removes all but
the k highest scoring feature:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
bestfeatures = SelectKBest( k=5)
fit = bestfeatures.fit(X,Y)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X.columns)
featureScores = pd.concat([dfcolumns,dfscores],axis=1)
print(featureScores.nlargest(2,'Score')) #print 2 best features
Output
 Specs Score
2 Variable1 58262.490
3 Variable2 321.031
When features are irrelevant, they should be dropped. Dropping the irrelevant fea‐
tures is illustrated in the following sample code:
#dropping the old features
dataset.drop(['Feature1','Feature2','Feature3'],axis=1,inplace=True)
22 | Chapter 2: Developing a Machine Learning Model in Python
https://oreil.ly/JDo-F
4.3. Data transformation. Many machine learning algorithms make assumptions about
the data. It is a good practice to perform the data preparation in such a way that
exposes the data in the best possible manner to the machine learning algorithms. This
can be accomplished through data transformation.
The different data transformation approaches are as follows:
Rescaling
When data comprises attributes with varying scales, many machine learning
algorithms can benefit from rescaling all the attributes to the same scale.
Attributes are often rescaled in the range between zero and one. This is useful for
optimization algorithms used in the core of machine learning algorithms, and it
also helps to speed up the calculations in an algorithm:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX = pd.DataFrame(scaler.fit_transform(X))
Standardization
Standardization is a useful technique to transform attributes to a standard nor‐
mal distribution with a mean of zero and a standard deviation of one. It is most
suitable for techniques that assume the input variables represent a normal distri‐
bution:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(X)
StandardisedX = pd.DataFrame(scaler.fit_transform(X))
Normalization
Normalization refers to rescaling each observation (row) to have a length of one
(called a unit norm or a vector). This preprocessing method can be useful for
sparse datasets of attributes of varying scales when using algorithms that weight
input values:
from sklearn.preprocessing import Normalizer
scaler = Normalizer().fit(X)
NormalizedX = pd.DataFrame(scaler.fit_transform(X))
5. Evaluate models
Once we estimate the performance of our algorithm, we can retrain the final algo‐
rithm on the entire training dataset and get it ready for operational use. The best way
to do this is to evaluate the performance of the algorithm on a new dataset. Different
machine learning techniques require different evaluation metrics. Other than model
performance, several other factors such as simplicity, interpretability, and training
time are considered when selecting a model. The details regarding these factors are
covered in Chapter 4.
Steps for Model Development in Python Ecosystem | 23
https://oreil.ly/4a70f
https://oreil.ly/4a70f
5.1. Training and test split. The simplest method we can use to evaluate the perfor‐
mance of a machine learning algorithm is to use different training and testing data‐
sets. We can take our original dataset and split it into two parts: train the algorithm
on the first part, make predictions on the second part, and evaluate the predictions
against the expected results. The size of the split can depend on the size and specifics
of the dataset, although it is common to use 80% of the data for training and the
remaining 20% for testing. The differences in the training and test datasets can result
in meaningful differences in the estimate of accuracy. The data can easily be split into
the training and test sets using the train_test_split function available in sklearn:
# split out validation dataset for the end
validation_size = 0.2
seed = 7
X_train, X_validation, Y_train, Y_validation =\
train_test_split(X, Y, test_size=validation_size, random_state=seed)
5.2. Identify evaluation metrics. Choosing which metric to use to evaluate machine
learning algorithms is very important. An important aspect of evaluation metrics is
the capability to discriminate among model results. Different types of evaluation
metrics used for different kinds of ML models are covered in detail across several
chapters of this book.
5.3. Compare models and algorithms. Selecting a machine learning model or algorithm
is both an art and a science. There is no one solution or approach that fits all. There
are several factors over and above the model performance that can impact the deci‐
sion to choose a machine learning algorithm.
Let’s understand the process of model comparison with a simple example. We define
two variables, X and Y, and try to build a model to predict Y using X. As a first step,
the data is divided into training and test split as mentioned in the preceding section:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
validation_size = 0.2
seed = 7
X = 2 - 3 * np.random.normal(0, 1, 20)
Y = X - 2 * (X ** 2) + 0.5 * (X ** 3) + np.exp(-X)+np.random.normal(-3, 3, 20)
# transforming the data to include another axis
X = X[:, np.newaxis]
Y = Y[:, np.newaxis]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y,\
test_size=validation_size, random_state=seed)
We have no idea which algorithms will do well on this problem. Let’s design our test
now. We will use two models—one linear regression and the second polynomial
regression to fit Y against X. We will evaluate algorithms using the Root Mean
24 | Chapter 2: Developing a Machine Learning Model in Python
3 It should be noted that the difference in RMSE is small in this case and may not replicate with a different split
of the train/test data.
4 Hyperparameters are the external characteristics of the model, can be considered the model’s settings, and are
not estimated based on data-like model parameters.
Squared Error (RMSE) metric, which is one of the measures of the model perfor‐
mance. RMSE will give a gross idea of how wrong all predictions are (zero is perfect):
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import PolynomialFeatures
model = LinearRegression()
model.fit(X_train, Y_train)
Y_pred = model.predict(X_train)
rmse_lin = np.sqrt(mean_squared_error(Y_train,Y_pred))
r2_lin = r2_score(Y_train,Y_pred)
print("RMSE for Linear Regression:", rmse_lin)
polynomial_features= PolynomialFeatures(degree=2)
x_poly = polynomial_features.fit_transform(X_train)
model = LinearRegression()
model.fit(x_poly, Y_train)
Y_poly_pred = model.predict(x_poly)
rmse = np.sqrt(mean_squared_error(Y_train,Y_poly_pred))
r2 = r2_score(Y_train,Y_poly_pred)
print("RMSE for Polynomial Regression:", rmse)
Output
RMSE for Linear Regression: 6.772942423315028
RMSE for Polynomial Regression: 6.420495127266883
We can see that the RMSE of the polynomial regression is slightly better than that of
the linear regression.3 With the former having the better fit, it is the preferred model
in this step.
6. Model tuning
Finding the best combination of hyperparameters of a model can be treated as a
search problem.4 This searching exercise is often known as model tuning and is one of
the most important steps of model development. It is achieved by searching for the
best parameters of the model by using techniques such as a grid search. In a grid
search, you create a grid of all possible hyperparametercombinations and train
the model using each one of them. Besides a grid search, there are several other
Steps for Model Development in Python Ecosystem | 25
techniques for model tuning, including randomized search, Bayesian optimization,
and hyperbrand.
In the case studies presented in this book, we focus primarily on grid search for
model tuning.
Continuing on from the preceding example, with the polynomial as the best model:
next, run a grid search for the model, refitting the polynomial regression with differ‐
ent degrees. We compare the RMSE results for all the models:
Deg= [1,2,3,6,10]
results=[]
names=[]
for deg in Deg:
 polynomial_features= PolynomialFeatures(degree=deg)
 x_poly = polynomial_features.fit_transform(X_train)
 model = LinearRegression()
 model.fit(x_poly, Y_train)
 Y_poly_pred = model.predict(x_poly)
 rmse = np.sqrt(mean_squared_error(Y_train,Y_poly_pred))
 r2 = r2_score(Y_train,Y_poly_pred)
 results.append(rmse)
 names.append(deg)
plt.plot(names, results,'o')
plt.suptitle('Algorithm Comparison')
Output
The RMSE decreases when the degree increases, and the lowest RMSE is for the
model with degree 10. However, models with degrees lower than 10 performed very
well, and the test set will be used to finalize the best model.
26 | Chapter 2: Developing a Machine Learning Model in Python
https://oreil.ly/ZGVPM
While the generic set of input parameters for each algorithm provides a starting point
for analysis, it may not have the optimal configurations for the particular dataset and
business problem.
7. Finalize the model
Here, we perform the final steps for selecting the model. First, we run predictions on
the test dataset with the trained model. Then we try to understand the model intu‐
ition and save it for further usage.
7.1. Performance on the test set. The model selected during the training steps is further
evaluated on the test set. The test set allows us to compare different models in an
unbiased way, by basing the comparisons in data that were not used in any part of the
training. The test results for the model developed in the previous step are shown in
the following example:
Deg= [1,2,3,6,8,10]
for deg in Deg:
 polynomial_features= PolynomialFeatures(degree=deg)
 x_poly = polynomial_features.fit_transform(X_train)
 model = LinearRegression()
 model.fit(x_poly, Y_train)
 x_poly_test = polynomial_features.fit_transform(X_test)
 Y_poly_pred_test = model.predict(x_poly_test)
 rmse = np.sqrt(mean_squared_error(Y_test,Y_poly_pred_test))
 r2 = r2_score(Y_test,Y_poly_pred_test)
 results_test.append(rmse)
 names_test.append(deg)
plt.plot(names_test, results_test,'o')
plt.suptitle('Algorithm Comparison')
Output
Steps for Model Development in Python Ecosystem | 27
In the training set we saw that the RMSE decreases with an increase in the degree of
polynomial model, and the polynomial of degree 10 had the lowest RMSE. However,
as shown in the preceding output for the polynomial of degree 10, although the train‐
ing set had the best results, the results in the test set are poor. For the polynomial of
degree 8, the RMSE in the test set is relatively higher. The polynomial of degree 6
shows the best result in the test set (although the difference is small compared to
other lower-degree polynomials in the test set) as well as good results in the training
set. For these reasons, this is the preferred model.
In addition to the model performance, there are several other factors to consider
when selecting a model, such as simplicity, interpretability, and training time. These
factors will be covered in the upcoming chapters.
7.2. Model/variable intuition. This step involves considering a holistic view of the
approach taken to solve the problem, including the model’s limitations as it relates to
the desired outcome, the variables used, and the selected model parameters. Details
on model and variable intuition regarding different types of machine learning models
are presented in the subsequent chapters and case studies.
7.3. Save/deploy. After finding an accurate machine learning model, it must be saved
and loaded in order to ensure its usage later.
Pickle is one of the packages for saving and loading a trained model in Python. Using
pickle operations, trained machine learning models can be saved in the serialized for‐
mat to a file. Later, this serialized file can be loaded to de-serialize the model for its
usage. The following sample code demonstrates how to save the model to a file and
load it to make predictions on new data:
# Save Model Using Pickle
from pickle import dump
from pickle import load
# save the model to disk
filename = 'finalized_model.sav'
dump(model, open(filename, 'wb'))
# load the model from disk
loaded_model = load(filename)
In recent years, frameworks such as AutoML have been built to
automate the maximum number of steps in a machine learning
model development process. Such frameworks allow the model
developers to build ML models with high scale, efficiency, and pro‐
ductivity. Readers are encouraged to explore such frameworks.
28 | Chapter 2: Developing a Machine Learning Model in Python
https://oreil.ly/ChjFb
Chapter Summary
Given its popularity, rate of adoption, and flexibility, Python is often the preferred
language for machine learning development. There are many available Python pack‐
ages to perform numerous tasks, including data cleaning, visualization, and model
development. Some of these key packages are Scikit-learn and Keras.
The seven steps of model development mentioned in this chapter can be leveraged
while developing any machine learning–based model in finance.
Next Steps
In the next chapter, we will cover the key algorithm for machine learning—the artifi‐
cial neural network. The artificial neural network is another building block of
machine learning in finance and is used across all types of machine learning and deep
learning algorithms.
Chapter Summary | 29
CHAPTER 3
Artificial Neural Networks
There are many different types of models used in machine learning. However, one
class of machine learning models that stands out is artificial neural networks (ANNs).
Given that artificial neural networks are used across all machine learning types, this
chapter will cover the basics of ANNs.
ANNs are computing systems based on a collection of connected units or nodes
called artificial neurons, which loosely model the neurons in a biological brain. Each
connection, like the synapses in a biological brain, can transmit a signal from one
artificial neuron to another. An artificial neuron that receives a signal can process it
and then signal additional artificial neurons connected to it.
Deep learning involves the study of complex ANN-related algorithms. The complex‐
ity is attributed to elaborate patterns of how information flows throughout the
model. Deep learning has the ability to represent the world as a nested hierarchy of
concepts, with each concept defined in relation to a simpler concept. Deep learning
techniques are extensively used in reinforcement learning and natural language pro‐
cessing applications that we will look at in Chapters 9 and 10.
31
1 Readers are encouraged to refer to the book Deep Learning by Aaron Courville, Ian Goodfellow, and Yoshua
Bengio (MIT Press) for more details on ANN and deep learning.
2 Activation functions are described in detail later in this chapter.
We will review detailed terminology and processes used in the field of ANNs1 and
cover the following topics:
• Architecture of ANNs: Neurons and layers
• Training an ANN: Forward propagation, backpropagation and gradient descent
• Hyperparameters of ANNs: Number of layers and nodes, activation function,
loss function, learning rate, etc.
• Defining and training a deep neural network–based model in Python
• Improving the training speed of ANNs and deep learning models
ANNs: Architecture, Training, and Hyperparameters
ANNs contain multiple neurons arranged inlayers. An ANN goes through a training
phase by comparing the modeled output to the desired output, where it learns to rec‐
ognize patterns in data. Let us go through the components of ANNs.
Architecture
An ANN architecture comprises neurons, layers, and weights.
Neurons
The building blocks for ANNs are neurons (also known as artificial neurons, nodes,
or perceptrons). Neurons have one or more inputs and one output. It is possible to
build a network of neurons to compute complex logical propositions. Activation
functions in these neurons create complicated, nonlinear functional mappings
between the inputs and the output.2
As shown in Figure 3-1, a neuron takes an input (x1, x2…xn), applies the learning
parameters to generate a weighted sum (z), and then passes that sum to an activation
function (f) that computes the output f(z).
32 | Chapter 3: Artificial Neural Networks
Figure 3-1. An artificial neuron
Layers
The output f(z) from a single neuron (as shown in Figure 3-1) will not be able to
model complex tasks. So, in order to handle more complex structures, we have multi‐
ple layers of such neurons. As we keep stacking neurons horizontally and vertically,
the class of functions we can get becomes increasing complex. Figure 3-2 shows an
architecture of an ANN with an input layer, an output layer, and a hidden layer.
Figure 3-2. Neural network architecture
Input layer. The input layer takes input from the dataset and is the exposed part of
the network. A neural network is often drawn with an input layer of one neuron per
input value (or column) in the dataset. The neurons in the input layer simply pass the
input value though to the next layer.
ANNs: Architecture, Training, and Hyperparameters | 33
Hidden layers. Layers after the input layer are called hidden layers because they are
not directly exposed to the input. The simplest network structure is to have a single
neuron in the hidden layer that directly outputs the value.
A multilayer ANN is capable of solving more complex machine learning–related
tasks due to its hidden layer(s). Given increases in computing power and efficient
libraries, neural networks with many layers can be constructed. ANNs with many
hidden layers (more than three) are known as deep neural networks. Multiple hidden
layers allow deep neural networks to learn features of the data in a so-called feature
hierarchy, because simple features recombine from one layer to the next to form
more complex features. ANNs with many layers pass input data (features) through
more mathematical operations than do ANNs with few layers and are therefore more
computationally intensive to train.
Output layer. The final layer is called the output layer; it is responsible for outputting
a value or vector of values that correspond to the format required to solve the
problem.
Neuron weights
A neuron weight represents the strength of the connection between units and
measures the influence the input will have on the output. If the weight from neuron
one to neuron two has greater magnitude, it means that neuron one has a greater
influence over neuron two. Weights near zero mean changing this input will not
change the output. Negative weights mean increasing this input will decrease the
output.
Training
Training a neural network basically means calibrating all of the weights in the ANN.
This optimization is performed using an iterative approach involving forward propa‐
gation and backpropagation steps.
Forward propagation
Forward propagation is a process of feeding input values to the neural network and
getting an output, which we call predicted value. When we feed the input values to the
neural network’s first layer, it goes without any operations. The second layer takes
values from the first layer and applies multiplication, addition, and activation opera‐
tions before passing this value to the next layer. The same process repeats for any
subsequent layers until an output value from the last layer is received.
34 | Chapter 3: Artificial Neural Networks
3 There are many available loss functions discussed in the next section. The nature of our problem dictates our
choice of loss function.
Backpropagation
After forward propagation, we get a predicted value from the ANN. Suppose the
desired output of a network is Y and the predicted value of the network from forward
propagation is Y′. The difference between the predicted output and the desired out‐
put (Y–Y′ ) is converted into the loss (or cost) function J(w), where w represents the
weights in ANN.3 The goal is to optimize the loss function (i.e., make the loss as small
as possible) over the training set.
The optimization method used is gradient descent. The goal of the gradient descent
method is to find the gradient of J(w) with respect to w at the current point and take a
small step in the direction of the negative gradient until the minimum value is
reached, as shown in Figure 3-3.
Figure 3-3. Gradient descent
In an ANN, the function J(w) is essentially a composition of multiple layers, as
explained in the preceding text. So, if layer one is represented as function p(), layer
two as q(), and layer three as r(), then the overall function is J(w)=r(q(p())). w consists
of all weights in all three layers. We want to find the gradient of J(w) with respect to
each component of w.
Skipping the mathematical details, the above essentially implies that the gradient of a
component w in the first layer would depend on the gradients in the second and third
layers. Similarly, the gradients in the second layer will depend on the gradients in the
third layer. Therefore, we start computing the derivatives in the reverse direction,
starting with the last layer, and use backpropagation to compute gradients of the pre‐
vious layer.
ANNs: Architecture, Training, and Hyperparameters | 35
Overall, in the process of backpropagation, the model error (difference between pre‐
dicted and desired output) is propagated back through the network, one layer at a
time, and the weights are updated according to the amount they contributed to the
error.
Almost all ANNs use gradient descent and backpropagation. Backpropagation is one
of the cleanest and most efficient ways to find the gradient.
Hyperparameters
Hyperparameters are the variables that are set before the training process, and they
cannot be learned during training. ANNs have a large number of hyperparameters,
which makes them quite flexible. However, this flexibility makes the model tuning
process difficult. Understanding the hyperparameters and the intuition behind them
helps give an idea of what values are reasonable for each hyperparameter so we can
restrict the search space. Let’s start with the number of hidden layers and nodes.
Number of hidden layers and nodes
More hidden layers or nodes per layer means more parameters in the ANN, allowing
the model to fit more complex functions. To have a trained network that generalizes
well, we need to pick an optimal number of hidden layers, as well as of the nodes in
each hidden layer. Too few nodes and layers will lead to high errors for the system, as
the predictive factors might be too complex for a small number of nodes to capture.
Too many nodes and layers will overfit to the training data and not generalize well.
There is no hard-and-fast rule to decide the number of layers and nodes.
The number of hidden layers primarily depends on the complexity of the task. Very
complex tasks, such as large image classification or speech recognition, typically
require networks with dozens of layers and a huge amount of training data. For the
majority of the problems, we can start with just one or two hidden layers and then
gradually ramp up the number of hidden layers until we start overfitting the training
set.
The number of hidden nodes should have a relationship to the number of input and
output nodes, the amount of training data available, and the complexity of the func‐
tion being modeled. As a rule of thumb, the number of hidden nodesin each layer
should be somewhere between the size of the input layer and the size of the output
layer, ideally the mean. The number of hidden nodes shouldn’t exceed twice the
number of input nodes in order to avoid overfitting.
Learning rate
When we train ANNs, we use many iterations of forward propagation and backpro‐
pagation to optimize the weights. At each iteration we calculate the derivative of the
36 | Chapter 3: Artificial Neural Networks
loss function with respect to each weight and subtract it from that weight. The learn‐
ing rate determines how quickly or slowly we want to update our weight (parameter)
values. This learning rate should be high enough so that it converges in a reasonable
amount of time. Yet it should be low enough so that it finds the minimum value of
the loss function.
Activation functions
Activation functions (as shown in Figure 3-1) refer to the functions used over the
weighted sum of inputs in ANNs to get the desired output. Activation functions
allow the network to combine the inputs in more complex ways, and they provide a
richer capability in the relationship they can model and the output they can produce.
They decide which neurons will be activated—that is, what information is passed to
further layers.
Without activation functions, ANNs lose a bulk of their representation learning
power. There are several activation functions. The most widely used are as follows:
Linear (identity) function
Represented by the equation of a straight line (i.e., f (x) = mx + c), where activa‐
tion is proportional to the input. If we have many layers, and all the layers are
linear in nature, then the final activation function of the last layer is the same as
the linear function of the first layer. The range of a linear function is –inf to +inf.
Sigmoid function
Refers to a function that is projected as an S-shaped graph (as shown in
Figure 3-4). It is represented by the mathematical equation f (x) = 1 / (1 + e –x)
and ranges from 0 to 1. A large positive input results in a large positive output; a
large negative input results in a large negative output. It is also referred to as
logistic activation function.
Tanh function
Similar to sigmoid activation function with a mathematical equation
Tanh (x) = 2Sigmoid(2x) – 1, where Sigmoid represents the sigmoid function
discussed above. The output of this function ranges from –1 to 1, with an equal
mass on both sides of the zero-axis, as shown in Figure 3-4.
ReLU function
ReLU stands for the Rectified Linear Unit and is represented as
f (x) = max(x, 0). So, if the input is a positive number, the function returns the
number itself, and if the input is a negative number, then the function returns
zero. It is the most commonly used function because of its simplicity.
ANNs: Architecture, Training, and Hyperparameters | 37
4 Deriving a regression or classification output by changing the activation function of the output layer is
described further in Chapter 4.
Figure 3-4 shows a summary of the activation functions discussed in this section.
Figure 3-4. Activation functions
There is no hard-and-fast rule for activation function selection. The decision com‐
pletely relies on the properties of the problem and the relationships being modeled.
We can try different activation functions and select the one that helps provide faster
convergence and a more efficient training process. The choice of activation function
in the output layer is strongly constrained by the type of problem that is modeled.4
Cost functions
Cost functions (also known as loss functions) are a measure of the ANN perfor‐
mance, measuring how well the ANN fits empirical data. The two most common cost
functions are:
Mean squared error (MSE)
This is the cost function used primarily for regression problems, where output is
a continuous value. MSE is measured as the average of the squared difference
38 | Chapter 3: Artificial Neural Networks
5 Refer to https://oreil.ly/FSt-8 for more details on optimization.
between predictions and actual observation. MSE is described further in
Chapter 4.
Cross-entropy (or log loss)
This cost function is used primarily for classification problems, where output is a
probability value between zero and one. Cross-entropy loss increases as the pre‐
dicted probability diverges from the actual label. A perfect model would have a
cross-entropy of zero.
Optimizers
Optimizers update the weight parameters to minimize the loss function.5 Cost func‐
tion acts as a guide to the terrain, telling the optimizer if it is moving in the right
direction to reach the global minimum. Some of the common optimizers are as
follows:
Momentum
The momentum optimizer looks at previous gradients in addition to the current
step. It will take larger steps if the previous updates and the current update move
the weights in the same direction (gaining momentum). It will take smaller steps
if the direction of the gradient is opposite. A clever way to visualize this is to
think of a ball rolling down a valley—it will gain momentum as it approaches the
valley bottom.
AdaGrad (Adaptive Gradient Algorithm)
AdaGrad adapts the learning rate to the parameters, performing smaller updates
for parameters associated with frequently occurring features, and larger updates
for parameters associated with infrequent features.
RMSProp
RMSProp stands for Root Mean Square Propagation. In RMSProp, the learning
rate gets adjusted automatically, and it chooses a different learning rate for each
parameter.
Adam (Adaptive Moment Estimation)
Adam combines the best properties of the AdaGrad and RMSProp algorithms to
provide an optimization and is one of the most popular gradient descent optimi‐
zation algorithms.
ANNs: Architecture, Training, and Hyperparameters | 39
https://oreil.ly/FSt-8
6 The steps and Python code related to implementing deep learning models using Keras, as demonstrated in
this section, are used in several case studies in the subsequent chapters.
Epoch
One round of updating the network for the entire training dataset is called an epoch.
A network may be trained for tens, hundreds, or many thousands of epochs depend‐
ing on the data size and computational constraints.
Batch size
The batch size is the number of training examples in one forward/backward pass. A
batch size of 32 means that 32 samples from the training dataset will be used to esti‐
mate the error gradient before the model weights are updated. The higher the batch
size, the more memory space is needed.
Creating an Artificial Neural Network Model in Python
In Chapter 2 we discussed the steps for end-to-end model development in Python. In
this section, we dig deeper into the steps involved in building an ANN-based model
in Python.
Our first step will be to look at Keras, the Python package specifically built for ANN
and deep learning.
Installing Keras and Machine Learning Packages
There are several Python libraries that allow building ANN and deep learning models
easily and quickly without getting into the details of underlying algorithms. Keras is
one of the most user-friendly packages that enables an efficient numerical computa‐
tion related to ANNs. Using Keras, complex deep learning models can be defined and
implemented in a few lines of code. We will primarily be using Keras packages for
implementing deep learning models in several of the book’s case studies.
Keras is simply a wrapper around more complex numerical computation engines
such as TensorFlow and Theano. In order to install Keras, TensorFlow or Theano
needs to be installed first.
This section describes the steps to define and compile an ANN-based model in Keras,
with a focus on the following steps.6
Importing the packages
Before you can start to build an ANN model, you need to import two modules from
the Keras package: Sequential and Dense:
40 | Chapter 3: Artificial Neural Networks
https://keras.io
https://www.tensorflow.org
https://oreil.ly/-XFJP
from Keras.models import Sequential
from Keras.layers import Dense
import numpyas np
Loading data
This example makes use of the random module of NumPy to quickly generate some
data and labels to be used by ANN that we build in the next step. Specifically, an
array with size (1000,10) is first constructed. Next, we create a labels array that con‐
sists of zeros and ones with a size (1000,1):
data = np.random.random((1000,10))
Y = np.random.randint(2,size= (1000,1))
model = Sequential()
Model construction—defining the neural network architecture
A quick way to get started is to use the Keras Sequential model, which is a linear stack
of layers. We create a Sequential model and add layers one at a time until the network
topology is finalized. The first thing to get right is to ensure the input layer has the
right number of inputs. We can specify this when creating the first layer. We then
select a dense or fully connected layer to indicate that we are dealing with an input
layer by using the argument input_dim.
We add a layer to the model with the add() function, and the number of nodes in
each layer is specified. Finally, another dense layer is added as an output layer.
The architecture for the model shown in Figure 3-5 is as follows:
• The model expects rows of data with 10 variables (input_dim_=10 argument).
• The first hidden layer has 32 nodes and uses the relu activation function.
• The second hidden layer has 32 nodes and uses the relu activation function.
• The output layer has one node and uses the sigmoid activation function.
Creating an Artificial Neural Network Model in Python | 41
7 A detailed discussion of the evaluation metrics for classification models is presented in Chapter 4.
Figure 3-5. An ANN architecture
The Python code for the network in Figure 3-5 is shown below:
model = Sequential()
model.add(Dense(32, input_dim=10, activation= 'relu' ))
model.add(Dense(32, activation= 'relu' ))
model.add(Dense(1, activation= 'sigmoid'))
Compiling the model
With the model constructed, it can be compiled with the help of the compile() func‐
tion. Compiling the model leverages the efficient numerical libraries in the Theano or
TensorFlow packages. When compiling, it is important to specify the additional
properties required when training the network. Training a network means finding
the best set of weights to make predictions for the problem at hand. So we must spec‐
ify the loss function used to evaluate a set of weights, the optimizer used to search
through different weights for the network, and any optional metrics we would like to
collect and report during training.
In the following example, we use cross-entropy loss function, which is defined in
Keras as binary_crossentropy. We will also use the adam optimizer, which is the
default option. Finally, because it is a classification problem, we will collect and
report the classification accuracy as the metric.7 The Python code follows:
model.compile(loss= 'binary_crossentropy' , optimizer= 'adam' , \
 metrics=[ 'accuracy' ])
42 | Chapter 3: Artificial Neural Networks
Fitting the model
With our model defined and compiled, it is time to execute it on data. We can train
or fit our model on our loaded data by calling the fit() function on the model.
The training process will run for a fixed number of iterations (epochs) through the
dataset, specified using the nb_epoch argument. We can also set the number of
instances that are evaluated before a weight update in the network is performed. This
is set using the batch_size argument. For this problem we will run a small number
of epochs (10) and use a relatively small batch size of 32. Again, these can be chosen
experimentally through trial and error. The Python code follows:
model.fit(data, Y, nb_epoch=10, batch_size=32)
Evaluating the model
We have trained our neural network on the entire dataset and can evaluate the per‐
formance of the network on the same dataset. This will give us an idea of how well we
have modeled the dataset (e.g., training accuracy) but will not provide insight on how
well the algorithm will perform on new data. For this, we separate the data into train‐
ing and test datasets. The model is evaluated on the training dataset using the evalua
tion() function. This will generate a prediction for each input and output pair and
collect scores, including the average loss and any metrics configured, such as accu‐
racy. The Python code follows:
scores = model.evaluate(X_test, Y_test)
print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))
Running an ANN Model Faster: GPU and Cloud Services
For training ANNs (especially deep neural networks with many layers), a large
amount of computation power is required. Available CPUs, or Central Processing
Units, are responsible for processing and executing instructions on a local machine.
Since CPUs are limited in the number of cores and take up the job sequentially, they
cannot do rapid matrix computations for the large number of matrices required for
training deep learning models. Hence, the training of deep learning models can be
extremely slow on the CPUs.
The following alternatives are useful for running ANNs that generally require a sig‐
nificant amount of time to run on a CPU:
• Running notebooks locally on a GPU.
• Running notebooks on Kaggle Kernels or Google Colaboratory.
• Using Amazon Web Services.
Creating an Artificial Neural Network Model in Python | 43
GPU
A GPU is composed of hundreds of cores that can handle thousands of threads
simultaneously. Running ANNs and deep learning models can be accelerated by the
use of GPUs.
GPUs are particularly adept at processing complex matrix operations. The GPU cores
are highly specialized, and they massively accelerate processes such as deep learning
training by offloading the processing from CPUs to the cores in the GPU subsystem.
All the Python packages related to machine learning, including Tensorflow, Theano,
and Keras, can be configured for the use of GPUs.
Cloud services such as Kaggle and Google Colab
If you have a GPU-enabled computer, you can run ANNs locally. If you do not, we
recommend you use a service such as Kaggle Kernels, Google Colab, or AWS:
Kaggle
A popular data science website owned by Google that hosts Jupyter service and is
also referred to as Kaggle Kernels. Kaggle Kernels are free to use and come with
the most frequently used packages preinstalled. You can connect a kernel to any
dataset hosted on Kaggle, or alternatively, you can just upload a new dataset on
the fly.
Google Colaboratory
A free Jupyter Notebook environment provided by Google where you can use
free GPUs. The features of Google Colaboratory are the same as Kaggle.
Amazon Web Services (AWS)
AWS Deep Learning provides an infrastructure to accelerate deep learning in the
cloud, at any scale. You can quickly launch AWS server instances preinstalled
with popular deep learning frameworks and interfaces to train sophisticated, cus‐
tom AI models, experiment with new algorithms, or learn new skills and techni‐
ques. These web servers can run longer than Kaggle Kernels. So for big projects,
it might be worth using an AWS instead of a kernel.
44 | Chapter 3: Artificial Neural Networks
https://www.kaggle.com
https://oreil.ly/keqHk
https://oreil.ly/gU84O
Chapter Summary
ANNs comprise a family of algorithms used across all types of machine learning.
These models are inspired by the biological neural networks containing neurons and
layers of neurons that constitute animal brains. ANNs with many layers are referred
to as deep neural networks. Several steps, including forward propagation and back‐
propagation, are required for training these ANNs. Python packages such as Keras
make the training of these ANNs possible in a few lines of code. The training of these
deep neural networks require more computational power, and CPUs alone might not
be enough. Alternatives include using a GPU or cloud service such as Kaggle Kernels,
Google Colaboratory, or Amazon Web Services for training deep neural networks.
Next Steps
As a next step, we will be going intothe details of the machine learning concepts for
supervised learning, followed by case studies using the concepts covered in this
chapter.
Chapter Summary | 45
PART II
Supervised Learning
CHAPTER 4
Supervised Learning: Models and Concepts
Supervised learning is an area of machine learning where the chosen algorithm tries
to fit a target using the given input. A set of training data that contains labels is sup‐
plied to the algorithm. Based on a massive set of data, the algorithm will learn a rule
that it uses to predict the labels for new observations. In other words, supervised
learning algorithms are provided with historical data and asked to find the relation‐
ship that has the best predictive power.
There are two varieties of supervised learning algorithms: regression and classifica‐
tion algorithms. Regression-based supervised learning methods try to predict outputs
based on input variables. Classification-based supervised learning methods identify
which category a set of data items belongs to. Classification algorithms are
probability-based, meaning the outcome is the category for which the algorithm finds
the highest probability that the dataset belongs to it. Regression algorithms, in con‐
trast, estimate the outcome of problems that have an infinite number of solutions
(continuous set of possible outcomes).
In the context of finance, supervised learning models represent one of the most-used
class of machine learning models. Many algorithms that are widely applied in algo‐
rithmic trading rely on supervised learning models because they can be efficiently
trained, they are relatively robust to noisy financial data, and they have strong links
to the theory of finance.
Regression-based algorithms have been leveraged by academic and industry research‐
ers to develop numerous asset pricing models. These models are used to predict
returns over various time periods and to identify significant factors that drive asset
returns. There are many other use cases of regression-based supervised learning in
portfolio management and derivatives pricing.
49
Classification-based algorithms, on the other hand, have been leveraged across many
areas within finance that require predicting a categorical response. These include
fraud detection, default prediction, credit scoring, directional forecast of asset price
movement, and Buy/Sell recommendations. There are many other use cases of
classification-based supervised learning in portfolio management and algorithmic
trading.
Many use cases of regression-based and classification-based supervised machine
learning are presented in Chapters 5 and 6.
Python and its libraries provide methods and ways to implement these supervised
learning models in few lines of code. Some of these libraries were covered in Chap‐
ter 2. With easy-to-use machine learning libraries like Scikit-learn and Keras, it is
straightforward to fit different machine learning models on a given predictive model‐
ing dataset.
In this chapter, we present a high-level overview of supervised learning models. For a
thorough coverage of the topics, the reader is referred to Hands-On Machine Learn‐
ing with Scikit-Learn, Keras, and TensorFlow, 2nd Edition, by Aurélien Géron
(O’Reilly).
The following topics are covered in this chapter:
• Basic concepts of supervised learning models (both regression and classification).
• How to implement different supervised learning models in Python.
• How to tune the models and identify the optimal parameters of the models using
grid search.
• Overfitting versus underfitting and bias versus variance.
• Strengths and weaknesses of several supervised learning models.
• How to use ensemble models, ANN, and deep learning models for both regres‐
sion and classification.
• How to select a model on the basis of several factors, including model
performance.
• Evaluation metrics for classification and regression models.
• How to perform cross validation.
50 | Chapter 4: Supervised Learning: Models and Concepts
Supervised Learning Models: An Overview
Classification predictive modeling problems are different from regression predictive
modeling problems, as classification is the task of predicting a discrete class label and
regression is the task of predicting a continuous quantity. However, both share the
same concept of utilizing known variables to make predictions, and there is a signifi‐
cant overlap between the two models. Hence, the models for classification and regres‐
sion are presented together in this chapter. Figure 4-1 summarizes the list of the
models commonly used for classification and regression.
Some models can be used for both classification and regression with small modifica‐
tions. These are K-nearest neighbors, decision trees, support vector, ensemble bag‐
ging/boosting methods, and ANNs (including deep neural networks), as shown in
Figure 4-1. However, some models, such as linear regression and logistic regression,
cannot (or cannot easily) be used for both problem types.
Figure 4-1. Models for regression and classification
This section contains the following details about the models:
• Theory of the models.
• Implementation in Scikit-learn or Keras.
• Grid search for different models.
• Pros and cons of the models.
Supervised Learning Models: An Overview | 51
In finance, a key focus is on models that extract signals from previ‐
ously observed data in order to predict future values for the same
time series. This family of time series models predicts continuous
output and is more aligned with the supervised regression models.
Time series models are covered separately in the supervised regres‐
sion chapter (Chapter 5).
Linear Regression (Ordinary Least Squares)
Linear regression (Ordinary Least Squares Regression or OLS Regression) is perhaps
one of the most well-known and best-understood algorithms in statistics and
machine learning. Linear regression is a linear model, e.g., a model that assumes a
linear relationship between the input variables (x) and the single output variable (y).
The goal of linear regression is to train a linear model to predict a new y given a pre‐
viously unseen x with as little error as possible.
Our model will be a function that predicts y given x1, x2...xi:
y = β0 + β1x1 + ... + βi xi
where, β0 is called intercept and β1...βi are the coefficient of the regression.
Implementation in Python
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X, Y)
In the following section, we cover the training of a linear regression model and grid
search of the model. However, the overall concepts and related approaches are appli‐
cable to all other supervised learning models.
Training a model
As we mentioned in Chapter 3, training a model basically means retrieving the model
parameters by minimizing the cost (loss) function. The two steps for training a linear
regression model are:
Define a cost function (or loss function)
Measures how inaccurate the model’s predictions are. The sum of squared residu‐
als (RSS) as defined in Equation 4-1 measures the squared sum of the difference
between the actual and predicted value and is the cost function for linear
regression.
52 | Chapter 4: Supervised Learning: Models and Concepts
Equation 4-1. Sum of squared residuals
RSS = ∑
i=1
n
(yi – β0 – ∑
j=1
n
βj xij)
2
In this equation, β0 is the intercept; βj represents the coefficient; β1, .., βj are the
coefficients of the regression; and xij represents the i th observation and j th
variable.
Find the parameters that minimize loss
For example, make our model as accurate as possible. Graphically, in two dimen‐
sions, this results in a line of best fit as shown in Figure 4-2. In higher dimen‐
sions, we would have higher-dimensional hyperplanes. Mathematically, we look
at the difference between each real data point (y) and our model’s prediction (ŷ).
Square these differences to avoid negative numbers and penalize larger differ‐
ences, and then add them up and take the average.This is a measure of how well
our data fits the line.
Figure 4-2. Linear regression
Grid search
The overall idea of the grid search is to create a grid of all possible hyperparameter
combinations and train the model using each one of them. Hyperparameters are the
external characteristic of the model, can be considered the model’s settings, and are
not estimated based on data-like model parameters. These hyperparameters are
tuned during grid search to achieve better model performance.
Supervised Learning Models: An Overview | 53
1 Cross validation will be covered in detail later in this chapter.
Due to its exhaustive search, a grid search is guaranteed to find the optimal parame‐
ter within the grid. The drawback is that the size of the grid grows exponentially with
the addition of more parameters or more considered values.
The GridSearchCV class in the model_selection module of the sklearn package facil‐
itates the systematic evaluation of all combinations of the hyperparameter values that
we would like to test.
The first step is to create a model object. We then define a dictionary where the key‐
words name the hyperparameters and the values list the parameter settings to be
tested. For linear regression, the hyperparameter is fit_intercept, which is a
boolean variable that determines whether or not to calculate the intercept for this
model. If set to False, no intercept will be used in calculations:
model = LinearRegression()
param_grid = {'fit_intercept': [True, False]}
}
The second step is to instantiate the GridSearchCV object and provide the estimator
object and parameter grid, as well as a scoring method and cross validation choice, to
the initialization method. Cross validation is a resampling procedure used to evaluate
machine learning models, and scoring parameter is the evaluation metrics of the
model:1
With all settings in place, we can fit GridSearchCV:
grid = GridSearchCV(estimator=model, param_grid=param_grid, scoring= 'r2', \
 cv=kfold)
grid_result = grid.fit(X, Y)
Advantages and disadvantages
In terms of advantages, linear regression is easy to understand and interpret. How‐
ever, it may not work well when there is a nonlinear relationship between predicted
and predictor variables. Linear regression is prone to overfitting (which we will dis‐
cuss in the next section) and when a large number of features are present, it may not
handle irrelevant features well. Linear regression also requires the data to follow cer‐
tain assumptions, such as the absence of multicollinearity. If the assumptions fail,
then we cannot trust the results obtained.
54 | Chapter 4: Supervised Learning: Models and Concepts
https://oreil.ly/tNDnc
Regularized Regression
When a linear regression model contains many independent variables, their coeffi‐
cients will be poorly determined, and the model will have a tendency to fit extremely
well to the training data (data used to build the model) but fit poorly to testing data
(data used to test how good the model is). This is known as overfitting or high
variance.
One popular technique to control overfitting is regularization, which involves the
addition of a penalty term to the error or loss function to discourage the coefficients
from reaching large values. Regularization, in simple terms, is a penalty mechanism
that applies shrinkage to model parameters (driving them closer to zero) in order to
build a model with higher prediction accuracy and interpretation. Regularized regres‐
sion has two advantages over linear regression:
Prediction accuracy
The performance of the model working better on the testing data suggests that
the model is trying to generalize from training data. A model with too many
parameters might try to fit noise specific to the training data. By shrinking or set‐
ting some coefficients to zero, we trade off the ability to fit complex models
(higher bias) for a more generalizable model (lower variance).
Interpretation
A large number of predictors may complicate the interpretation or communica‐
tion of the big picture of the results. It may be preferable to sacrifice some detail
to limit the model to a smaller subset of parameters with the strongest effects.
The common ways to regularize a linear regression model are as follows:
L1 regularization or Lasso regression
Lasso regression performs L1 regularization by adding a factor of the sum of the
absolute value of coefficients in the cost function (RSS) for linear regression, as
mentioned in Equation 4-1. The equation for lasso regularization can be repre‐
sented as follows:
CostFunction = RSS + λ * ∑ j=1
p |βj|
L1 regularization can lead to zero coefficients (i.e., some of the features are com‐
pletely neglected for the evaluation of output). The larger the value of λ, the more
features are shrunk to zero. This can eliminate some features entirely and give us
a subset of predictors, reducing model complexity. So Lasso regression not only
helps in reducing overfitting, but also can help in feature selection. Predictors not
shrunk toward zero signify that they are important, and thus L1 regularization
allows for feature selection (sparse selection). The regularization parameter (λ)
can be controlled, and a lambda value of zero produces the basic linear regression
equation.
Supervised Learning Models: An Overview | 55
A lasso regression model can be constructed using the Lasso class of the sklearn
package of Python, as shown in the code snippet that follows:
from sklearn.linear_model import Lasso
model = Lasso()
model.fit(X, Y)
L2 regularization or Ridge regression
Ridge regression performs L2 regularization by adding a factor of the sum of the
square of coefficients in the cost function (RSS) for linear regression, as men‐
tioned in Equation 4-1. The equation for ridge regularization can be represented
as follows:
CostFunction = RSS + λ * ∑ j=1
p β j
2
Ridge regression puts constraint on the coefficients. The penalty term (λ) regu‐
larizes the coefficients such that if the coefficients take large values, the optimiza‐
tion function is penalized. So ridge regression shrinks the coefficients and helps
to reduce the model complexity. Shrinking the coefficients leads to a lower var‐
iance and a lower error value. Therefore, ridge regression decreases the complex‐
ity of a model but does not reduce the number of variables; it just shrinks their
effect. When λ is closer to zero, the cost function becomes similar to the linear
regression cost function. So the lower the constraint (low λ) on the features, the
more the model will resemble the linear regression model.
A ridge regression model can be constructed using the Ridge class of the sklearn
package of Python, as shown in the code snippet that follows:
from sklearn.linear_model import Ridge
model = Ridge()
model.fit(X, Y)
Elastic net
Elastic nets add regularization terms to the model, which are a combination of
both L1 and L2 regularization, as shown in the following equation:
CostFunction = RSS + λ * ((1 – α) / 2 * ∑ j=1
p β j
2 + α * ∑ j=1
p |βj|)
In addition to setting and choosing a λ value, an elastic net also allows us to tune
the alpha parameter, where α = 0 corresponds to ridge and α = 1 to lasso. There‐
fore, we can choose an α value between 0 and 1 to optimize the elastic net. Effec‐
tively, this will shrink some coefficients and set some to 0 for sparse selection.
An elastic net regression model can be constructed using the ElasticNet class of
the sklearn package of Python, as shown in the following code snippet:
from sklearn.linear_model import ElasticNet
model = ElasticNet()
model.fit(X, Y)
56 | Chapter 4: Supervised Learning: Models and Concepts
2 See the activation function section of Chapter 3 for details on the sigmoid function.
3 MLE is a method of estimating the parameters of a probability distribution so that under the assumed statisti‐
cal model the observed data is most probable.
For all the regularized regression, λ is the key parameter to tune during grid search in
Python. In anelastic net, α can be an additional parameter to tune.
Logistic Regression
Logistic regression is one of the most widely used algorithms for classification. The
logistic regression model arises from the desire to model the probabilities of the out‐
put classes given a function that is linear in x, at the same time ensuring that output
probabilities sum up to one and remain between zero and one as we would expect
from probabilities.
If we train a linear regression model on several examples where Y = 0 or 1, we might
end up predicting some probabilities that are less than zero or greater than one,
which doesn’t make sense. Instead, we use a logistic regression model (or logit
model), which is a modification of linear regression that makes sure to output a prob‐
ability between zero and one by applying the sigmoid function.2
Equation 4-2 shows the equation for a logistic regression model. Similar to linear
regression, input values (x) are combined linearly using weights or coefficient values
to predict an output value (y). The output coming from Equation 4-2 is a probability
that is transformed into a binary value (0 or 1) to get the model prediction.
Equation 4-2. Logistic regression equation
y =
exp (β0 + β1x1 + .... + βi x1)
1 + exp (β0 + β1x1 + .... + βi x1)
Where y is the predicted output, β0 is the bias or intercept term and B1 is the coeffi‐
cient for the single input value (x). Each column in the input data has an associated β
coefficient (a constant real value) that must be learned from the training data.
In logistic regression, the cost function is basically a measure of how often we
predicted one when the true answer was zero, or vice versa. Training the logistic
regression coefficients is done using techniques such as maximum likelihood estima‐
tion (MLE) to predict values close to 1 for the default class and close to 0 for the other
class.3
Supervised Learning Models: An Overview | 57
https://oreil.ly/y9atF
A logistic regression model can be constructed using the LogisticRegression class
of the sklearn package of Python, as shown in the following code snippet:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X, Y)
Hyperparameters
Regularization (penalty in sklearn)
Similar to linear regression, logistic regression can have regularization, which
can be L1, L2, or elasticnet. The values in the sklearn library are [l1, l2, elasticnet].
Regularization strength (C in sklearn)
This parameter controls the regularization strength. Good values of the penalty
parameters can be [100, 10, 1.0, 0.1, 0.01].
Advantages and disadvantages
In terms of the advantages, the logistic regression model is easy to implement, has
good interpretability, and performs very well on linearly separable classes. The out‐
put of the model is a probability, which provides more insight and can be used for
ranking. The model has small number of hyperparameters. Although there may be
risk of overfitting, this may be addressed using L1/L2 regularization, similar to the
way we addressed overfitting for the linear regression models.
In terms of disadvantages, the model may overfit when provided with large numbers
of features. Logistic regression can only learn linear functions and is less suitable to
complex relationships between features and the target variable. Also, it may not han‐
dle irrelevant features well, especially if the features are strongly correlated.
Support Vector Machine
The objective of the support vector machine (SVM) algorithm is to maximize the mar‐
gin (shown as shaded area in Figure 4-3), which is defined as the distance between
the separating hyperplane (or decision boundary) and the training samples that are
closest to this hyperplane, the so-called support vectors. The margin is calculated as
the perpendicular distance from the line to only the closest points, as shown in
Figure 4-3. Hence, SVM calculates a maximum-margin boundary that leads to a
homogeneous partition of all data points.
58 | Chapter 4: Supervised Learning: Models and Concepts
Figure 4-3. Support vector machine
In practice, the data is messy and cannot be separated perfectly with a hyperplane.
The constraint of maximizing the margin of the line that separates the classes must be
relaxed. This change allows some points in the training data to violate the separating
line. An additional set of coefficients is introduced that give the margin wiggle room
in each dimension. A tuning parameter is introduced, simply called C, that defines
the magnitude of the wiggle allowed across all dimensions. The larger the value of C,
the more violations of the hyperplane are permitted.
In some cases, it is not possible to find a hyperplane or a linear decision boundary,
and kernels are used. A kernel is just a transformation of the input data that allows
the SVM algorithm to treat/process the data more easily. Using kernels, the original
data is projected into a higher dimension to classify the data better.
SVM is used for both classification and regression. We achieve this by converting the
original optimization problem into a dual problem. For regression, the trick is to
reverse the objective. Instead of trying to fit the largest possible street between two
classes while limiting margin violations, SVM regression tries to fit as many instances
as possible on the street (shaded area in Figure 4-3) while limiting margin violations.
The width of the street is controlled by a hyperparameter.
The SVM regression and classification models can be constructed using the sklearn
package of Python, as shown in the following code snippets:
Supervised Learning Models: An Overview | 59
Regression
from sklearn.svm import SVR
model = SVR()
model.fit(X, Y)
Classification
from sklearn.svm import SVC
model = SVC()
model.fit(X, Y)
Hyperparameters
The following key parameters are present in the sklearn implementation of SVM and
can be tweaked while performing the grid search:
Kernels (kernel in sklearn)
The choice of kernel controls the manner in which the input variables will be
projected. There are many kernels to choose from, but linear and RBF are the
most common.
Penalty (C in sklearn)
The penalty parameter tells the SVM optimization how much you want to avoid
misclassifying each training example. For large values of the penalty parameter,
the optimization will choose a smaller-margin hyperplane. Good values might be
a log scale from 10 to 1,000.
Advantages and disadvantages
In terms of advantages, SVM is fairly robust against overfitting, especially in higher
dimensional space. It handles the nonlinear relationships quite well, with many ker‐
nels to choose from. Also, there is no distributional requirement for the data.
In terms of disadvantages, SVM can be inefficient to train and memory-intensive to
run and tune. It doesn’t perform well with large datasets. It requires the feature scal‐
ing of the data. There are also many hyperparameters, and their meanings are often
not intuitive.
K-Nearest Neighbors
K-nearest neighbors (KNN) is considered a “lazy learner,” as there is no learning
required in the model. For a new data point, predictions are made by searching
through the entire training set for the K most similar instances (the neighbors) and
summarizing the output variable for those K instances.
To determine which of the K instances in the training dataset are most similar to a
new input, a distance measure is used. The most popular distance measure is Eucli‐
60 | Chapter 4: Supervised Learning: Models and Concepts
https://oreil.ly/XpBOi
dean distance, which is calculated as the square root of the sum of the squared differ‐
ences between a point a and a point b across all input attributes i, and which is
represented as d(a, b) = ∑i=1
n (ai – bi)2. Euclidean distance is a good distance meas‐
ure to use if the input variables are similar in type.
Another distance metric is Manhattan distance, in which the distance between point
a and point b is represented as d(a, b) = ∑i=1n | ai – bi | . Manhattan distance is a good
measure to use if the input variables are not similar in type.
The steps of KNN can be summarized as follows:
1. Choose the number of K and a distance metric.
2. Find the K-nearest neighbors of the sample that we want to classify.
3. Assign the class label by majority vote.
KNN regression and classification models can be constructed using the sklearn pack‐
age of Python, as shown in the following code:
Classification
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier()
model.fit(X, Y)
Regression
from sklearn.neighbors import KNeighborsRegressor
model = KNeighborsRegressor()
model.fit(X, Y)
Hyperparameters
The following key parameters are present in the sklearn implementation of KNN and
can be tweaked while performing the grid search:
Number of neighbors (n_neighbors in sklearn)
The most important hyperparameter for KNN is the number of neighbors
(n_neighbors). Good values are between 1 and 20.
Distance metric (metric in sklearn)
It may also be interesting to test different distance metrics for choosing the com‐
position of the neighborhood. Good values are euclidean and manhattan.
Advantages and disadvantages
In terms of advantages, no training is involved and hence there is no learning phase.
Since the algorithm requires no training before making predictions, new data can be
added seamlessly without impacting the accuracy of the algorithm. It is intuitive and
Supervised Learning Models: An Overview | 61
4 The approach of projecting data is similar to the PCA algorithm discussed in Chapter 7.
easy to understand. The model naturally handles multiclass classification and can
learn complex decision boundaries. KNN is effective if the training data is large. It is
also robust to noisy data, and there is no need to filter the outliers.
In terms of the disadvantages, the distance metric to choose is not obvious and diffi‐
cult to justify in many cases. KNN performs poorly on high dimensional datasets. It
is expensive and slow to predict new instances because the distance to all neighbors
must be recalculated. KNN is sensitive to noise in the dataset. We need to manually
input missing values and remove outliers. Also, feature scaling (standardization and
normalization) is required before applying the KNN algorithm to any dataset; other‐
wise, KNN may generate wrong predictions.
Linear Discriminant Analysis
The objective of the linear discriminant analysis (LDA) algorithm is to project the
data onto a lower-dimensional space in a way that the class separability is maximized
and the variance within a class is minimized.4
During the training of the LDA model, the statistical properties (i.e., mean and cova‐
riance matrix) of each class are computed. The statistical properties are estimated on
the basis of the following assumptions about the data:
• Data is normally distributed, so that each variable is shaped like a bell curve
when plotted.
• Each attribute has the same variance, and the values of each variable vary around
the mean by the same amount on average.
To make a prediction, LDA estimates the probability that a new set of inputs belongs
to every class. The output class is the one that has the highest probability.
Implementation in Python and hyperparameters
The LDA classification model can be constructed using the sklearn package of
Python, as shown in the following code snippet:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
model = LinearDiscriminantAnalysis()
model.fit(X, Y)
The key hyperparameter for the LDA model is number of components for dimen‐
sionality reduction, which is represented by n_components in sklearn.
62 | Chapter 4: Supervised Learning: Models and Concepts
https://oreil.ly/cuc7p
Advantages and disadvantages
In terms of advantages, LDA is a relatively simple model with fast implementation
and is easy to implement. In terms of disadvantages, it requires feature scaling and
involves complex matrix operations.
Classification and Regression Trees
In the most general terms, the purpose of an analysis via tree-building algorithms is
to determine a set of if–then logical (split) conditions that permit accurate prediction
or classification of cases. Classification and regression trees (or CART or decision tree
classifiers) are attractive models if we care about interpretability. We can think of this
model as breaking down our data and making a decision based on asking a series of
questions. This algorithm is the foundation of ensemble methods such as random
forest and gradient boosting method.
Representation
The model can be represented by a binary tree (or decision tree), where each node is
an input variable x with a split point and each leaf contains an output variable y for
prediction.
Figure 4-4 shows an example of a simple classification tree to predict whether a per‐
son is a male or a female based on two inputs of height (in centimeters) and weight
(in kilograms).
Figure 4-4. Classification and regression tree example
Learning a CART model
Creating a binary tree is actually a process of dividing up the input space. A greedy
approach called recursive binary splitting is used to divide the space. This is a numeri‐
cal procedure in which all the values are lined up and different split points are tried
and tested using a cost (loss) function. The split with the best cost (lowest cost,
because we minimize cost) is selected. All input variables and all possible split points
Supervised Learning Models: An Overview | 63
are evaluated and chosen in a greedy manner (e.g., the very best split point is chosen
each time).
For regression predictive modeling problems, the cost function that is minimized to
choose split points is the sum of squared errors across all training samples that fall
within the rectangle:
∑i=1
n (yi – predictioni)2
where yi is the output for the training sample and prediction is the predicted output
for the rectangle. For classification, the Gini cost function is used; it provides an indi‐
cation of how pure the leaf nodes are (i.e., how mixed the training data assigned to
each node is) and is defined as:
G = ∑i=1
n pk * (1 – pk )
where G is the Gini cost over all classes and pk is the number of training instances
with class k in the rectangle of interest. A node that has all classes of the same type
(perfect class purity) will have G = 0, while a node that has a 50–50 split of classes for
a binary classification problem (worst purity) will have G = 0.5.
Stopping criterion
The recursive binary splitting procedure described in the preceding section needs to
know when to stop splitting as it works its way down the tree with the training data.
The most common stopping procedure is to use a minimum count on the number of
training instances assigned to each leaf node. If the count is less than some minimum,
then the split is not accepted and the node is taken as a final leaf node.
Pruning the tree
The stopping criterion is important as it strongly influences the performance of the
tree. Pruning can be used after learning the tree to further lift performance. The com‐
plexity of a decision tree is defined as the number of splits in the tree. Simpler trees
are preferred as they are faster to run and easy to understand, consume less memory
during processing and storage, and are less likely to overfit the data. The fastest and
simplest pruning method is to work through each leaf node in the tree and evaluate
the effect of removing it using a test set. A leaf node is removed only if doing so
results in a drop in the overall cost function on the entire test set. The removal of
nodes can be stopped when no further improvements can be made.
64 | Chapter 4: Supervised Learning: Models and Concepts
Implementation in Python
CART regression and classification models can be constructed using the sklearn
package of Python, as shown in the following code snippet:
Classification
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()model.fit(X, Y)
Regression
from sklearn.tree import DecisionTreeRegressor
model = DecisionTreeRegressor ()
model.fit(X, Y)
Hyperparameters
CART has many hyperparameters. However, the key hyperparameter is the maxi‐
mum depth of the tree model, which is the number of components for dimensional‐
ity reduction, and which is represented by max_depth in the sklearn package. Good
values can range from 2 to 30 depending on the number of features in the data.
Advantages and disadvantages
In terms of advantages, CART is easy to interpret and can adapt to learn complex
relationships. It requires little data preparation, and data typically does not need to be
scaled. Feature importance is built in due to the way decision nodes are built. It per‐
forms well on large datasets. It works for both regression and classification problems.
In terms of disadvantages, CART is prone to overfitting unless pruning is used. It can
be very nonrobust, meaning that small changes in the training dataset can lead to
quite major differences in the hypothesis function that gets learned. CART generally
has worse performance than ensemble models, which are covered next.
Ensemble Models
The goal of ensemble models is to combine different classifiers into a meta-classifier
that has better generalization performance than each individual classifier alone. For
example, assuming that we collected predictions from 10 experts, ensemble methods
would allow us to strategically combine their predictions to come up with a predic‐
tion that is more accurate and robust than the experts’ individual predictions.
The two most popular ensemble methods are bagging and boosting. Bagging (or boot‐
strap aggregation) is an ensemble technique of training several individual models in a
parallel way. Each model is trained by a random subset of the data. Boosting, on the
other hand, is an ensemble technique of training several individual models in a
sequential way. This is done by building a model from the training data and then
Supervised Learning Models: An Overview | 65
5 Bias and variance are described in detail later in this chapter.
creating a second model that attempts to correct the errors of the first model. Models
are added until the training set is predicted perfectly or a maximum number of mod‐
els is added. Each individual model learns from mistakes made by the previous
model. Just like the decision trees themselves, bagging and boosting can be used for
classification and regression problems.
By combining individual models, the ensemble model tends to be more flexible (less
bias) and less data-sensitive (less variance).5 Ensemble methods combine multiple,
simpler algorithms to obtain better performance.
In this section we will cover random forest, AdaBoost, the gradient boosting method,
and extra trees, along with their implementation using sklearn package.
Random forest. Random forest is a tweaked version of bagged decision trees. In order
to understand a random forest algorithm, let us first understand the bagging algo‐
rithm. Assuming we have a dataset of one thousand instances, the steps of bagging
are:
1. Create many (e.g., one hundred) random subsamples of our dataset.
2. Train a CART model on each sample.
3. Given a new dataset, calculate the average prediction from each model and
aggregate the prediction by each tree to assign the final label by majority vote.
A problem with decision trees like CART is that they are greedy. They choose the
variable to split by using a greedy algorithm that minimizes error. Even after bagging,
the decision trees can have a lot of structural similarities and result in high correla‐
tion in their predictions. Combining predictions from multiple models in ensembles
works better if the predictions from the submodels are uncorrelated, or at best are
weakly correlated. Random forest changes the learning algorithm in such a way that
the resulting predictions from all of the subtrees have less correlation.
In CART, when selecting a split point, the learning algorithm is allowed to look
through all variables and all variable values in order to select the most optimal split
point. The random forest algorithm changes this procedure such that each subtree
can access only a random sample of features when selecting the split points. The
number of features that can be searched at each split point (m) must be specified as a
parameter to the algorithm.
As the bagged decision trees are constructed, we can calculate how much the error
function drops for a variable at each split point. In regression problems, this may be
the drop in sum squared error, and in classification, this might be the Gini cost. The
66 | Chapter 4: Supervised Learning: Models and Concepts
bagged method can provide feature importance by calculating and averaging the
error function drop for individual variables.
Implementation in Python. Random forest regression and classification models can be
constructed using the sklearn package of Python, as shown in the following code:
Classification
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model.fit(X, Y)
Regression
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()
model.fit(X, Y)
Hyperparameters. Some of the main hyperparameters that are present in the sklearn
implementation of random forest and that can be tweaked while performing the grid
search are:
Maximum number of features (max_features in sklearn)
This is the most important parameter. It is the number of random features to
sample at each split point. You could try a range of integer values, such as 1 to 20,
or 1 to half the number of input features.
Number of estimators (n_estimators in sklearn)
This parameter represents the number of trees. Ideally, this should be increased
until no further improvement is seen in the model. Good values might be a log
scale from 10 to 1,000.
Advantages and disadvantages. The random forest algorithm (or model) has gained
huge popularity in ML applications during the last decade due to its good perfor‐
mance, scalability, and ease of use. It is flexible and naturally assigns feature impor‐
tance scores, so it can handle redundant feature columns. It scales to large datasets
and is generally robust to overfitting. The algorithm doesn’t need the data to be
scaled and can model a nonlinear relationship.
In terms of disadvantages, random forest can feel like a black box approach, as we
have very little control over what the model does, and the results may be difficult to
interpret. Although random forest does a good job at classification, it may not be
good for regression problems, as it does not give a precise continuous nature predic‐
tion. In the case of regression, it doesn’t predict beyond the range in the training data
and may overfit datasets that are particularly noisy.
Supervised Learning Models: An Overview | 67
6 Split is the process of converting a nonhomogeneous parent node into two homogeneous child nodes (best
possible).
Extra trees
Extra trees, otherwise known as extremely randomized trees, is a variant of a random
forest; it builds multiple trees and splits nodes using random subsets of features simi‐
lar to random forest. However, unlike random forest, where observations are drawn
with replacement, the observations are drawn without replacement in extra trees. So
there is no repetition of observations.
Additionally, random forest selects the best split to convert the parent into the two
most homogeneous child nodes.6 However, extra trees selects a random split to divide
the parent node into two random child nodes. In extra trees, randomness doesn’t
come from bootstrapping the data; it comes from the random splits of all
observations.
In real-world cases, performance is comparable to an ordinary random forest, some‐
times a bit better. The advantages and disadvantages of extra trees are similar to those
of random forest.
Implementation in Python. Extra trees regression and classificationmodels can be con‐
structed using the sklearn package of Python, as shown in the following code snippet.
The hyperparameters of extra trees are similar to random forest, as shown in the pre‐
vious section:
Classification
from sklearn.ensemble import ExtraTreesClassifier
model = ExtraTreesClassifier()
model.fit(X, Y)
Regression
from sklearn.ensemble import ExtraTreesRegressor
model = ExtraTreesRegressor()
model.fit(X, Y)
Adaptive Boosting (AdaBoost)
Adaptive Boosting or AdaBoost is a boosting technique in which the basic idea is to
try predictors sequentially, and each subsequent model attempts to fix the errors of
its predecessor. At each iteration, the AdaBoost algorithm changes the sample distri‐
bution by modifying the weights attached to each of the instances. It increases the
weights of the wrongly predicted instances and decreases the ones of the correctly
predicted instances.
68 | Chapter 4: Supervised Learning: Models and Concepts
The steps of the AdaBoost algorithm are:
1. Initially, all observations are given equal weights.
2. A model is built on a subset of data, and using this model, predictions are made
on the whole dataset. Errors are calculated by comparing the predictions and
actual values.
3. While creating the next model, higher weights are given to the data points that
were predicted incorrectly. Weights can be determined using the error value. For
instance, the higher the error, the more weight is assigned to the observation.
4. This process is repeated until the error function does not change, or until the
maximum limit of the number of estimators is reached.
Implementation in Python. AdaBoost regression and classification models can be con‐
structed using the sklearn package of Python, as shown in the following code snippet:
Classification
from sklearn.ensemble import AdaBoostClassifier
model = AdaBoostClassifier()
model.fit(X, Y)
Regression
from sklearn.ensemble import AdaBoostRegressor
model = AdaBoostRegressor()
model.fit(X, Y)
Hyperparameters. Some of the main hyperparameters that are present in the sklearn
implementation of AdaBoost and that can be tweaked while performing the grid
search are as follows:
Learning rate (learning_rate in sklearn)
Learning rate shrinks the contribution of each classifier/regressor. It can be con‐
sidered on a log scale. The sample values for grid search can be 0.001, 0.01, and
0.1.
Number of estimators (n_estimators in sklearn)
This parameter represents the number of trees. Ideally, this should be increased
until no further improvement is seen in the model. Good values might be a log
scale from 10 to 1,000.
Advantages and disadvantages. In terms of advantages, AdaBoost has a high degree of
precision. AdaBoost can achieve similar results to other models with much less
tweaking of parameters or settings. The algorithm doesn’t need the data to be scaled
and can model a nonlinear relationship.
Supervised Learning Models: An Overview | 69
In terms of disadvantages, the training of AdaBoost is time consuming. AdaBoost can
be sensitive to noisy data and outliers, and data imbalance leads to a decrease in clas‐
sification accuracy
Gradient boosting method
Gradient boosting method (GBM) is another boosting technique similar to AdaBoost,
where the general idea is to try predictors sequentially. Gradient boosting works by
sequentially adding the previous underfitted predictions to the ensemble, ensuring
the errors made previously are corrected.
The following are the steps of the gradient boosting algorithm:
1. A model (which can be referred to as the first weak learner) is built on a subset of
data. Using this model, predictions are made on the whole dataset.
2. Errors are calculated by comparing the predictions and actual values, and the loss
is calculated using the loss function.
3. A new model is created using the errors of the previous step as the target vari‐
able. The objective is to find the best split in the data to minimize the error. The
predictions made by this new model are combined with the predictions of the
previous. New errors are calculated using this predicted value and actual value.
4. This process is repeated until the error function does not change or until the
maximum limit of the number of estimators is reached.
Contrary to AdaBoost, which tweaks the instance weights at every interaction, this
method tries to fit the new predictor to the residual errors made by the previous
predictor.
Implementation in Python and hyperparameters. Gradient boosting method regression
and classification models can be constructed using the sklearn package of Python, as
shown in the following code snippet. The hyperparameters of gradient boosting
method are similar to AdaBoost, as shown in the previous section:
Classification
from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier()
model.fit(X, Y)
Regression
from sklearn.ensemble import GradientBoostingRegressor
model = GradientBoostingRegressor()
model.fit(X, Y)
70 | Chapter 4: Supervised Learning: Models and Concepts
Advantages and disadvantages. In terms of advantages, gradient boosting method is
robust to missing data, highly correlated features, and irrelevant features in the same
way as random forest. It naturally assigns feature importance scores, with slightly
better performance than random forest. The algorithm doesn’t need the data to be
scaled and can model a nonlinear relationship.
In terms of disadvantages, it may be more prone to overfitting than random forest, as
the main purpose of the boosting approach is to reduce bias and not variance. It has
many hyperparameters to tune, so model development may not be as fast. Also, fea‐
ture importance may not be robust to variation in the training dataset.
ANN-Based Models
In Chapter 3 we covered the basics of ANNs, along with the architecture of ANNs
and their training and implementation in Python. The details provided in that chap‐
ter are applicable across all areas of machine learning, including supervised learning.
However, there are a few additional details from the supervised learning perspective,
which we will cover in this section.
Neural networks are reducible to a classification or regression model with the activa‐
tion function of the node in the output layer. In the case of a regression problem, the
output node has linear activation function (or no activation function). A linear func‐
tion produces a continuous output ranging from -inf to +inf. Hence, the output
layer will be the linear function of the nodes in the layer before the output layer, and
it will be a regression-based model.
In the case of a classification problem, the output node has a sigmoid or softmax acti‐
vation function. A sigmoid or softmax function produces an output ranging from
zero to one to represent the probability of target value. Softmax function can also be
used for multiple groups for classification.
ANN using sklearn
ANN regression and classification models can be constructed using the sklearn pack‐
age of Python, as shown in the following code snippet:
Classification
from sklearn.neural_network import MLPClassifier
model = MLPClassifier()
model.fit(X, Y)
Regression
from sklearn.neural_network import MLPRegressor
model = MLPRegressor()
model.fit(X, Y)
Supervised Learning Models: An Overview | 71
Hyperparameters
As we saw in Chapter 3, ANN has many hyperparameters. Some of the hyperparame‐
ters that are present in the sklearn implementation of ANN and can be tweaked while
performing the grid search are:
Hidden Layers (hidden_layer_sizes in sklearn)
It represents the number of layers and nodes in the ANN architecture. In sklearn
implementation of ANN, the ith element represents the number of neurons in
the ith hidden layer. A sample value for grid search in the sklearn implementa‐
tion can be [(20,), (50,), (20, 20), (20, 30, 20)].
Activation Function (activation in sklearn)
It represents the activation function of a hidden layer. Some of the activationfunctions defined in Chapter 3, such as sigmoid, relu, or tanh, can be used.
Deep neural network
ANNs with more than a single hidden layer are often called deep networks. We pre‐
fer using the library Keras to implement such networks, given the flexibility of the
library. The detailed implementation of a deep neural network in Keras was shown in
Chapter 3. Similar to MLPClassifier and MLPRegressor in sklearn for classification
and regression, Keras has modules called KerasClassifier and KerasRegressor that
can be used for creating classification and regression models with deep network.
A popular problem in finance is time series prediction, which is predicting the next
value of a time series based on a historical overview. Some of the deep neural net‐
works, such as recurrent neural network (RNN), can be directly used for time series
prediction. The details of this approach are provided in Chapter 5.
Advantages and disadvantages
The main advantage of an ANN is that it captures the nonlinear relationship between
the variables quite well. ANN can more easily learn rich representations and is good
with a large number of input features with a large dataset. ANN is flexible in how it
can be used. This is evident from its use across a wide variety of areas in machine
learning and AI, including reinforcement learning and NLP, as discussed in
Chapter 3.
The main disadvantage of ANN is the interpretability of the model, which is a draw‐
back that often cannot be ignored and is sometimes the determining factor when
choosing a model. ANN is not good with small datasets and requires a lot of tweaking
and guesswork. Choosing the right topology/algorithms to solve a problem is diffi‐
cult. Also, ANN is computationally expensive and can take a lot of time to train.
72 | Chapter 4: Supervised Learning: Models and Concepts
Using ANNs for supervised learning in finance
If a simple model such as linear or logistic regression perfectly fits your problem,
don’t bother with ANN. However, if you are modeling a complex dataset and feel a
need for better prediction power, give ANN a try. ANN is one of the most flexible
models in adapting itself to the shape of the data, and using it for supervised learning
problems can be an interesting and valuable exercise.
Model Performance
In the previous section, we discussed grid search as a way to find the right hyperpara‐
meter to achieve better performance. In this section, we will expand on that process
by discussing the key components of evaluating the model performance, which are
overfitting, cross validation, and evaluation metrics.
Overfitting and Underfitting
A common problem in machine learning is overfitting, which is defined by learning a
function that perfectly explains the training data that the model learned from but
doesn’t generalize well to unseen test data. Overfitting happens when a model over‐
learns from the training data to the point that it starts picking up idiosyncrasies that
aren’t representative of patterns in the real world. This becomes especially problem‐
atic as we make our models increasingly more complex. Underfitting is a related issue
in which the model is not complex enough to capture the underlying trend in the
data. Figure 4-5 illustrates overfitting and underfitting. The left-hand panel of
Figure 4-5 shows a linear regression model; a straight line clearly underfits the true
function. The middle panel shows that a high degree polynomial approximates the
true relationship reasonably well. On the other hand, a polynomial of a very high
degree fits the small sample almost perfectly, and performs best on the training data,
but this doesn’t generalize, and it would do a horrible job at explaining a new data
point.
The concepts of overfitting and underfitting are closely linked to bias-variance trade-
off. Bias refers to the error due to overly simplistic assumptions or faulty assumptions
in the learning algorithm. Bias results in underfitting of the data, as shown in the left-
hand panel of Figure 4-5. A high bias means our learning algorithm is missing
important trends among the features. Variance refers to the error due to an overly
complex model that tries to fit the training data as closely as possible. In high var‐
iance cases, the model’s predicted values are extremely close to the actual values from
the training set. High variance gives rise to overfitting, as shown in the right-hand
panel of Figure 4-5. Ultimately, in order to have a good model, we need low bias and
low variance.
Model Performance | 73
Figure 4-5. Overfitting and underfitting
There can be two ways to combat overfitting:
Using more training data
The more training data we have, the harder it is to overfit the data by learning
too much from any single training example.
Using regularization
Adding a penalty in the loss function for building a model that assigns too much
explanatory power to any one feature, or allows too many features to be taken
into account.
The concept of overfitting and the ways to combat it are applicable across all the
supervised learning models. For example, regularized regressions address overfitting
in linear regression, as discussed earlier in this chapter.
Cross Validation
One of the challenges of machine learning is training models that are able to general‐
ize well to unseen data (overfitting versus underfitting or a bias-variance trade-off).
The main idea behind cross validation is to split the data one time or several times so
that each split is used once as a validation set and the remainder is used as a training
set: part of the data (the training sample) is used to train the algorithm, and the
remaining part (the validation sample) is used for estimating the risk of the algo‐
rithm. Cross validation allows us to obtain reliable estimates of the model’s generali‐
zation error. It is easiest to understand it with an example. When doing k-fold cross
validation, we randomly split the training data into k folds. Then we train the model
using k-1 folds and evaluate the performance on the kth fold. We repeat this process
k times and average the resulting scores.
Figure 4-6 shows an example of cross validation, where the data is split into five sets
and in each round one of the sets is used for validation.
74 | Chapter 4: Supervised Learning: Models and Concepts
Figure 4-6. Cross validation
A potential drawback of cross validation is the computational cost, especially when
paired with a grid search for hyperparameter tuning. Cross validation can be per‐
formed in a couple of lines using the sklearn package; we will perform cross valida‐
tion in the supervised learning case studies.
In the next section, we cover the evaluation metrics for the supervised learning mod‐
els that are used to measure and compare the models’ performance.
Evaluation Metrics
The metrics used to evaluate the machine learning algorithms are very important.
The choice of metrics to use influences how the performance of machine learning
algorithms is measured and compared. The metrics influence both how you weight
the importance of different characteristics in the results and your ultimate choice of
algorithm.
The main evaluation metrics for regression and classification are illustrated in
Figure 4-7.
Figure 4-7. Evaluation metrics for regression and classification
Let us first look at the evaluation metrics for supervised regression.
Model Performance | 75
Mean absolute error
The mean absolute error (MAE) is the sum of the absolute differences between pre‐
dictions and actual values. The MAE is a linear score, which means that all the indi‐
vidual differences are weighted equally in the average. It gives an idea of how wrong
the predictions were. The measure gives an idea of the magnitude of the error, but no
idea of the direction (e.g., over- or underpredicting).
Mean squared error
The mean squared error (MSE) represents the sample standard deviation of the dif‐
ferences between predicted values and observed values (called residuals). Thisis
much like the mean absolute error in that it provides a gross idea of the magnitude of
the error. Taking the square root of the mean squared error converts the units back to
the original units of the output variable and can be meaningful for description and
presentation. This is called the root mean squared error (RMSE).
R² metric
The R² metric provides an indication of the “goodness of fit” of the predictions to
actual value. In statistical literature this measure is called the coefficient of determi‐
nation. This is a value between zero and one, for no-fit and perfect fit, respectively.
Adjusted R² metric
Just like R², adjusted R² also shows how well terms fit a curve or line but adjusts for
the number of terms in a model. It is given in the following formula:
Radj
2 = 1 –
(1 – R 2)(n – 1))
n – k – 1
where n is the total number of observations and k is the number of predictors.
Adjusted R² will always be less than or equal to R².
Selecting an evaluation metric for supervised regression
In terms of a preference among these evaluation metrics, if the main goal is predictive
accuracy, then RMSE is best. It is computationally simple and is easily differentiable.
The loss is symmetric, but larger errors weigh more in the calculation. The MAEs are
symmetric but do not weigh larger errors more. R² and adjusted R² are often used for
explanatory purposes by indicating how well the selected independent variable(s)
explains the variability in the dependent variable(s).
Let us first look at the evaluation metrics for supervised classification.
76 | Chapter 4: Supervised Learning: Models and Concepts
Classification
For simplicity, we will mostly discuss things in terms of a binary classification prob‐
lem (i.e., only two outcomes, such as true or false); some common terms are:
True positives (TP)
Predicted positive and are actually positive.
False positives (FP)
Predicted positive and are actually negative.
True negatives (TN)
Predicted negative and are actually negative.
False negatives (FN)
Predicted negative and are actually positive.
The difference between three commonly used evaluation metrics for classification,
accuracy, precision, and recall, is illustrated in Figure 4-8.
Figure 4-8. Accuracy, precision, and recall
Accuracy
As shown in Figure 4-8, accuracy is the number of correct predictions made as a ratio
of all predictions made. This is the most common evaluation metric for classification
problems and is also the most misused. It is most suitable when there are an equal
number of observations in each class (which is rarely the case) and when all predic‐
tions and the related prediction errors are equally important, which is often not the
case.
Precision
Precision is the percentage of positive instances out of the total predicted positive
instances. Here, the denominator is the model prediction done as positive from the
whole given dataset. Precision is a good measure to determine when the cost of false
positives is high (e.g., email spam detection).
Model Performance | 77
Recall
Recall (or sensitivity or true positive rate) is the percentage of positive instances out of
the total actual positive instances. Therefore, the denominator (true positive + false
negative) is the actual number of positive instances present in the dataset. Recall is a
good measure when there is a high cost associated with false negatives (e.g., fraud
detection).
In addition to accuracy, precision, and recall, some of the other commonly used eval‐
uation metrics for classification are discussed in the following sections.
Area under ROC curve
Area under ROC curve (AUC) is an evaluation metric for binary classification prob‐
lems. ROC is a probability curve, and AUC represents degree or measure of separa‐
bility. It tells how much the model is capable of distinguishing between classes. The
higher the AUC, the better the model is at predicting zeros as zeros and ones as ones.
An AUC of 0.5 means that the model has no class separation capacity whatsoever.
The probabilistic interpretation of the AUC score is that if you randomly choose a
positive case and a negative case, the probability that the positive case outranks the
negative case according to the classifier is given by the AUC.
Confusion matrix
A confusion matrix lays out the performance of a learning algorithm. The confusion
matrix is simply a square matrix that reports the counts of the true positive (TP), true
negative (TN), false positive (FP), and false negative (FN) predictions of a classifier,
as shown in Figure 4-9.
Figure 4-9. Confusion matrix
The confusion matrix is a handy presentation of the accuracy of a model with two or
more classes. The table presents predictions on the x-axis and accuracy outcomes on
the y-axis. The cells of the table are the number of predictions made by the model.
For example, a model can predict zero or one, and each prediction may actually have
been a zero or a one. Predictions for zero that were actually zero appear in the cell for
78 | Chapter 4: Supervised Learning: Models and Concepts
prediction = 0 and actual = 0, whereas predictions for zero that were actually one
appear in the cell for prediction = 0 and actual = 1.
Selecting an evaluation metric for supervised classification
The evaluation metric for classification depends heavily on the task at hand. For
example, recall is a good measure when there is a high cost associated with false nega‐
tives such as fraud detection. We will further examine these evaluation metrics in the
case studies.
Model Selection
Selecting the perfect machine learning model is both an art and a science. Looking at
machine learning models, there is no one solution or approach that fits all. There are
several factors that can affect your choice of a machine learning model. The main cri‐
teria in most of the cases is the model performance that we discussed in the previous
section. However, there are many other factors to consider while performing model
selection. In the following section, we will go over all such factors, followed by a dis‐
cussion of model trade-offs.
Factors for Model Selection
The factors considered for the model selection process are as follows:
Simplicity
The degree of simplicity of the model. Simplicity usually results in quicker, more
scalable, and easier to understand models and results.
Training time
Speed, performance, memory usage and overall time taken for model training.
Handle nonlinearity in the data
The ability of the model to handle the nonlinear relationship between the
variables.
Robustness to overfitting
The ability of the model to handle overfitting.
Size of the dataset
The ability of the model to handle large number of training examples in the
dataset.
Number of features
The ability of the model to handle high dimensionality of the feature space.
Model Selection | 79
7 In this table we do not include AdaBoost and extra trees as their overall behavior across all the parameters are
similar to Gradient Boosting and Random Forest, respectively.
Model interpretation
How explainable is the model? Model interpretability is important because it
allows us to take concrete actions to solve the underlying problem.
Feature scaling
Does the model require variables to be scaled or normally distributed?
Figure 4-10 compares the supervised learning models on the factors mentioned pre‐
viously and outlines a general rule-of-thumb to narrow down the search for the best
machine learning algorithm7 for a given problem. The table is based on the advan‐
tages and disadvantages of different models discussed in the individual model section
in this chapter.
Figure 4-10. Model selection
We can see from the table that relatively simple models include linear and logistic
regression and as we move towards the ensemble and ANN, the complexity increases.
In terms of the training time, the linear models and CART are relatively faster to
train as compared to ensemble methods and ANN.
Linear and logistic regression can’t handle nonlinear relationships,while all other
models can. SVM can handle the nonlinear relationship between dependent and
independent variables with nonlinear kernels.
80 | Chapter 4: Supervised Learning: Models and Concepts
SVM and random forest tend to overfit less as compared to the linear regression,
logistic regression, gradient boosting, and ANN. The degree of overfitting also
depends on other parameters, such as size of the data and model tuning, and can be
checked by looking at the results of the test set for each model. Also, the boosting
methods such as gradient boosting have higher overfitting risk compared to the bag‐
ging methods, such as random forest. Recall the focus of gradient boosting is to mini‐
mize the bias and not variance.
Linear and logistic regressions are not able to handle large datasets and large number
of features well. However, CART, ensemble methods, and ANN are capable of han‐
dling large datasets and many features quite well. The linear and logistic regression
generally perform better than other models in case the size of the dataset is small.
Application of variable reduction techniques (shown in Chapter 7) enables the linear
models to handle large datasets. The performance of ANN increases with an increase
in the size of the dataset.
Given linear regression, logistic regression, and CART are relatively simpler models,
they have better model interpretation as compared to the ensemble models and
ANN.
Model Trade-off
Often, it’s a trade-off between different factors when selecting a model. ANN, SVM,
and some ensemble methods can be used to create very accurate predictive models,
but they may lack simplicity and interpretability and may take a significant amount
of resources to train.
In terms of selecting the final model, models with lower interpretability may be pre‐
ferred when predictive performance is the most important goal, and it’s not necessary
to explain how the model works and makes predictions. In some cases, however,
model interpretability is mandatory.
Interpretability-driven examples are often seen in the financial industry. In many
cases, choosing a machine learning algorithm has less to do with the optimization or
the technical aspects of the algorithm and more to do with business decisions. Sup‐
pose a machine learning algorithm is used to accept or reject an individual’s credit
card application. If the applicant is rejected and decides to file a complaint or take
legal action, the financial institution will need to explain how that decision was made.
While that can be nearly impossible for ANN, it’s relatively straightforward for deci‐
sion tree–based models.
Different classes of models are good at modeling different types of underlying pat‐
terns in data. So a good first step is to quickly test out a few different classes of mod‐
els to know which ones capture the underlying structure of the dataset most
Model Selection | 81
efficiently. We will follow this approach while performing model selection in all our
supervised learning–based case studies.
Chapter Summary
In this chapter, we discussed the importance of supervised learning models in
finance, followed by a brief introduction to several supervised learning models,
including linear and logistic regression, SVM, decision trees, ensemble, KNN, LDA,
and ANN. We demonstrated training and tuning of these models in a few lines of
code using sklearn and Keras libraries.
We discussed the most common error metrics for regression and classification mod‐
els, explained the bias-variance trade-off, and illustrated the various tools for manag‐
ing the model selection process using cross validation.
We introduced the strengths and weaknesses of each model and discussed the factors
to consider when selecting the best model. We also discussed the trade-off between
model performance and interpretability.
In the following chapter, we will dive into the case studies for regression and classifi‐
cation. All case studies in the next two chapters leverage the concepts presented in
this chapter and in the previous two chapters.
82 | Chapter 4: Supervised Learning: Models and Concepts
CHAPTER 5
Supervised Learning: Regression
(Including Time Series Models)
Supervised regression–based machine learning is a predictive form of modeling in
which the goal is to model the relationship between a target and the predictor vari‐
able(s) in order to estimate a continuous set of possible outcomes. These are the most
used machine learning models in finance.
One of the focus areas of analysts in financial institutions (and finance in general) is
to predict investment opportunities, typically predictions of asset prices and asset
returns. Supervised regression–based machine learning models are inherently suit‐
able in this context. These models help investment and financial managers under‐
stand the properties of the predicted variable and its relationship with other variables,
and help them identify significant factors that drive asset returns. This helps investors
estimate return profiles, trading costs, technical and financial investment required in
infrastructure, and thus ultimately the risk profile and profitability of a strategy or
portfolio.
With the availability of large volumes of data and processing techniques, supervised
regression–based machine learning isn’t just limited to asset price prediction. These
models are applied to a wide range of areas within finance, including portfolio man‐
agement, insurance pricing, instrument pricing, hedging, and risk management.
In this chapter we cover three supervised regression–based case studies that span
diverse areas, including asset price prediction, instrument pricing, and portfolio
management. All of the case studies follow the standardized seven-step model devel‐
opment process presented in Chapter 2; those steps include defining the problem,
loading the data, exploratory data analysis, data preparation, model evaluation, and
83
1 There may be reordering or renaming of the steps or substeps based on the appropriateness and intuitiveness
of the steps/substeps.
2 An exogenous variable is one whose value is determined outside the model and imposed on the model.
model tuning.1 The case studies are designed not only to cover a diverse set of topics
from the finance standpoint but also to cover multiple machine learning and model‐
ing concepts, including models from basic linear regression to advanced deep learn‐
ing that were presented in Chapter 4.
A substantial amount of asset modeling and prediction problems in the financial
industry involve a time component and estimation of a continuous output. As such,
it is also important to address time series models. In its broadest form, time series
analysis is about inferring what has happened to a series of data points in the past and
attempting to predict what will happen to it in the future. There have been a lot of
comparisons and debates in academia and the industry regarding the differences
between supervised regression and time series models. Most time series models are
parametric (i.e., a known function is assumed to represent the data), while the major‐
ity of supervised regression models are nonparametric. Time series models primarily
use historical data of the predicted variables for prediction, and supervised learning
algorithms use exogenous variables as predictor variables.2 However, supervised
regression can embed the historical data of the predicted variable through a time-
delay approach (covered later in this chapter), and a time series model (such as ARI‐
MAX, also covered later in this chapter) can use exogenous variables for prediction.
Hence, time series and supervised regression models are similar in the sense that both
can use exogenous variables as well as historical data of the predicted variable for
forecasting. In terms of the final output, both supervised regression and time series
models estimate a continuous set of possible outcomes of a variable.
In Chapter 4, we covered the concepts of modelsthat are common between super‐
vised regression and supervised classification. Given that time series models are more
closely aligned with supervised regression than supervised classification, we will go
through the concepts of time series models separately in this chapter. We will also
demonstrate how we can use time series models on financial data to predict future
values. Comparison of time series models against the supervised regression models
will be presented in the case studies. Additionally, some machine learning and deep
learning models (such as LSTM) can be directly used for time series forecasting, and
those will be discussed as well.
84 | Chapter 5: Supervised Learning: Regression (Including Time Series Models)
In “Case Study 1: Stock Price Prediction” on page 95, we demonstrate one of the most
popular prediction problems in finance, that of predicting stock returns. In addition
to predicting future stock prices accurately, the purpose of this case study is to dis‐
cuss the machine learning–based framework for general asset class price prediction in
finance. In this case study we will discuss several machine learning and time series
concepts, along with focusing on visualization and model tuning.
In “Case Study 2: Derivative Pricing” on page 114, we will delve into derivative pricing
using supervised regression and show how to deploy machine learning techniques in
the context of traditional quant problems. As compared to traditional derivative pric‐
ing models, machine learning techniques can lead to faster derivative pricing without
relying on the several impractical assumptions. Efficient numerical computation
using machine learning can be increasingly beneficial in areas such as financial risk
management, where a trade-off between efficiency and accuracy is often inevitable.
In “Case Study 3: Investor Risk Tolerance and Robo-Advisors” on page 125, we illus‐
trate supervised regression–based framework to estimate the risk tolerance of invest‐
ors. In this case study, we build a robo-advisor dashboard in Python and implement
this risk tolerance prediction model in the dashboard. We demonstrate how such
models can lead to the automation of portfolio management processes, including the
use of robo-advisors for investment management. The purpose is also to illustrate
how machine learning can efficiently be used to overcome the problem of traditional
risk tolerance profiling or risk tolerance questionnaires that suffer from several
behavioral biases.
In “Case Study 4: Yield Curve Prediction” on page 141, we use a supervised regression–
based framework to forecast different yield curve tenors simultaneously. We demon‐
strate how we can produce multiple tenors at the same time to model the yield curve
using machine learning models.
In this chapter, we will learn about the following concepts related to supervised
regression and time series techniques:
• Application and comparison of different time series and machine learning
models.
• Interpretation of the models and results. Understanding the potential overfitting
and underfitting and intuition behind linear versus nonlinear models.
• Performing data preparation and transformations to be used in machine learning
models.
• Performing feature selection and engineering to improve model performance.
• Using data visualization and data exploration to understand outcomes.
Supervised Learning: Regression (Including Time Series Models) | 85
3 These models are discussed later in this chapter.
4 There may be reordering or renaming of the steps or substeps based on the appropriateness and intuitiveness
of the steps/substeps.
• Algorithm tuning to improve model performance. Understanding, implement‐
ing, and tuning time series models such as ARIMA for prediction.
• Framing a problem statement related to portfolio management and behavioral
finance in a regression-based machine learning framework.
• Understanding how deep learning–based models such as LSTM can be used for
time series prediction.
The models used for supervised regression were presented in Chapters 3 and 4. Prior
to the case studies, we will discuss time series models. We highly recommend readers
turn to Time Series Analysis and Its Applications, 4th Edition, by Robert H. Shumway
and David S. Stoffer (Springer) for a more in-depth understanding of time series con‐
cepts, and to Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow,
2nd Edition, by Aurélien Géron (O’Reilly) for more on concepts in supervised regres‐
sion models.
This Chapter’s Code Repository
A Python-based master template for supervised regression, a time
series model template, and the Jupyter notebook for all case studies
presented in this chapter are included in the folder Chapter 5 - Sup.
Learning - Regression and Time Series models of the code repository
for this book.
For any new supervised regression–based case study, use the com‐
mon template from the code repository, modify the elements spe‐
cific to the case study, and borrow the concepts and insights from
the case studies presented in this chapter. The template also
includes the implementation and tuning of the ARIMA and LSTM
models.3 The templates are designed to run on the cloud (i.e., Kag‐
gle, Google Colab, and AWS). All the case studies have been
designed on a uniform regression template.4
Time Series Models
A time series is a sequence of numbers that are ordered by a time index.
In this section we will cover the following aspects of time series models, which we
further leverage in the case studies:
86 | Chapter 5: Supervised Learning: Regression (Including Time Series Models)
https://oreil.ly/sJFV0
https://oreil.ly/sJFV0
• The components of a time series
• Autocorrelation and stationarity of time series
• Traditional time series models (e.g., ARIMA)
• Use of deep learning models for time series modeling
• Conversion of time series data for use in a supervised learning framework
Time Series Breakdown
A time series can be broken down into the following components:
Trend Component
A trend is a consistent directional movement in a time series. These trends will
be either deterministic or stochastic. The former allows us to provide an underly‐
ing rationale for the trend, while the latter is a random feature of a series that we
will be unlikely to explain. Trends often appear in financial series, and many
trading models use sophisticated trend identification algorithms.
Seasonal Component
Many time series contain seasonal variation. This is particularly true in series
representing business sales or climate levels. In quantitative finance we often see
seasonal variation, particularly in series related to holiday seasons or annual tem‐
perature variation (such as natural gas).
We can write the components of a time series yt as
yt = St + T t + Rt
where St is the seasonal component, T t is the trend component, and Rt represents the
remainder component of the time series not captured by seasonal or trend
component.
The Python code for breaking down a time series (Y) into its component is as follows:
import statsmodels.api as sm
sm.tsa.seasonal_decompose(Y,freq=52).plot()
Figure 5-1 shows the time series broken down into trend, seasonality, and remainder
components. Breaking down a time series into these components may help us better
understand the time series and identify its behavior for better prediction.
The three time series components are shown separately in the bottom three panels.
These components can be added together to reconstruct the actual time series shown
in the top panel (shown as “observed”). Notice that the time series shows a trending
component after 2017. Hence, the prediction model for this time series should incor‐
Time Series Models | 87
porate the information regarding the trending behavior after 2017. In terms of sea‐
sonality there is some increase in the magnitude in the beginning of the calendar
year. The residual component shown in the bottom panel is what is left over when
theseasonal and trend components have been subtracted from the data. The residual
component is mostly flat with some spikes and noise around 2018 and 2019. Also,
each of the plots are on different scales, and the trend component has maximum
range as shown by the scale on the plot.
Figure 5-1. Time series components
Autocorrelation and Stationarity
When we are given one or more time series, it is relatively straightforward to decom‐
pose the time series into trend, seasonality, and residual components. However, there
are other aspects that come into play when dealing with time series data, particularly
in finance.
Autocorrelation
There are many situations in which consecutive elements of a time series exhibit cor‐
relation. That is, the behavior of sequential points in the series affect each other in a
dependent manner. Autocorrelation is the similarity between observations as a func‐
tion of the time lag between them. Such relationships can be modeled using an autor‐
egression model. The term autoregression indicates that it is a regression of the
variable against itself.
In an autoregression model, we forecast the variable of interest using a linear combi‐
nation of past values of the variable.
88 | Chapter 5: Supervised Learning: Regression (Including Time Series Models)
5 A white noise process is a random process of random variables that are uncorrelated and have a mean of zero
and a finite variance.
Thus, an autoregressive model of order p can be written as
yt = c + ϕ1yt –1 + ϕ2yt –2 + ....ϕp yt – p + ϵ
where ϵt is white noise.5 An autoregressive model is like a multiple regression but
with lagged values of yt as predictors. We refer to this as an AR(p) model, an autore‐
gressive model of order p. Autoregressive models are remarkably flexible at handling
a wide range of different time series patterns.
Stationarity
A time series is said to be stationary if its statistical properties do not change over
time. Thus a time series with trend or with seasonality is not stationary, as the trend
and seasonality will affect the value of the time series at different times. On the other
hand, a white noise series is stationary, as it does not matter when you observe it; it
should look similar at any point in time.
Figure 5-2 shows some examples of nonstationary series.
Figure 5-2. Nonstationary plots
In the first plot, we can clearly see that the mean varies (increases) with time, result‐
ing in an upward trend. Thus this is a nonstationary series. For a series to be classi‐
fied as stationary, it should not exhibit a trend. Moving on to the second plot, we
certainly do not see a trend in the series, but the variance of the series is a function of
time. A stationary series must have a constant variance; hence this series is a nonsta‐
tionary series as well. In the third plot, the spread becomes closer as the time increa‐
ses, which implies that the covariance is a function of time. The three examples
shown in Figure 5-2 represent nonstationary time series. Now look at a fourth plot, as
shown in Figure 5-3.
Time Series Models | 89
Figure 5-3. Stationary plot
In this case, the mean, variance, and covariance are constant with time. This is what a
stationary time series looks like. Predicting future values using this fourth plot would
be easier. Most statistical models require the series to be stationary to make effective
and precise predictions.
The two major reasons behind nonstationarity of a time series are trend and season‐
ality, as shown in Figure 5-2. In order to use time series forecasting models, we gener‐
ally convert any nonstationary series to a stationary series, making it easier to model
since statistical properties don’t change over time.
Differencing
Differencing is one of the methods used to make a time series stationary. In this
method, we compute the difference of consecutive terms in the series. Differencing is
typically performed to get rid of the varying mean. Mathematically, differencing can
be written as:
yt
′ = yt – yt –1
where yt is the value at a time t.
When the differenced series is white noise, the original series is referred to as a non‐
stationary series of degree one.
Traditional Time Series Models (Including the ARIMA Model)
There are many ways to model a time series in order to make predictions. Most of the
time series models aim at incorporating the trend, seasonality, and remainder com‐
ponents while addressing the autocorrelation and stationarity embedded in the time
series. For example, the autoregressive (AR) model discussed in the previous section
addresses the autocorrelation in the time series.
One of the most widely used models in time series forecasting is the ARIMA model.
90 | Chapter 5: Supervised Learning: Regression (Including Time Series Models)
ARIMA
If we combine stationarity with autoregression and a moving average model (dis‐
cussed further on in this section), we obtain an ARIMA model. ARIMA is an acro‐
nym for AutoRegressive Integrated Moving Average, and it has the following
components:
AR(p)
It represents autoregression, i.e., regression of the time series onto itself, as dis‐
cussed in the previous section, with an assumption that current series values
depend on its previous values with some lag (or several lags). The maximum lag
in the model is referred to as p.
I(d)
It represents order of integration. It is simply the number of differences needed
to make the series stationary.
MA(q)
It represents moving average. Without going into detail, it models the error of
the time series; again, the assumption is that current error depends on the previ‐
ous with some lag, which is referred to as q.
The moving average equation is written as:
yt = c + ϵt + θ1ϵt –1 + θ2ϵt –2
where, ϵt is white noise. We refer to this as an MA(q) model of order q.
Combining all the components, the full ARIMA model can be written as:
yt
′ = c + ϕ1yt –1
′ + ⋯ + ϕp yt – p
′ + θ1εt –1 + ⋯ + θqεt –q + εt
where yt
' is the differenced series (it may have been differenced more than once). The
predictors on the right-hand side include both lagged values of yt
' and lagged errors.
We call this an ARIMA(p,d,q) model, where p is the order of the autoregressive part,
d is the degree of first differencing involved, and q is the order of the moving average
part. The same stationarity and invertibility conditions that are used for autoregres‐
sive and moving average models also apply to an ARIMA model.
The Python code to fit the ARIMA model of the order (1,0,0) is shown in the
following:
from statsmodels.tsa.arima_model import ARIMA
model=ARIMA(endog=Y_train,order=[1,0,0])
Time Series Models | 91
The family of ARIMA models has several variants, and some of them are as follows:
ARIMAX
ARIMA models with exogenous variables included. We will be using this model
in case study 1.
SARIMA
“S” in this model stands for seasonal, and this model is targeted at modeling the
seasonality component embedded in the time series, along with other
components.
VARMA
This is the extension of the model to multivariate case, when there are many vari‐
ables to be predicted simultaneously. We predict many variables simultaneously
in “Case Study 4: Yield Curve Prediction” on page 141.
Deep Learning Approach to Time Series Modeling
The traditional time series models such as ARIMA are well understood and effective
on many problems. However, these traditional methods also suffer from several limi‐
tations. Traditional time series models are linear functions, or simple transformations
of linear functions, and they require manually diagnosed parameters, such as time
dependence, and don’t perform well with corrupt or missing data.
If we look at the advancements in the field of deep learning for time series prediction,
we see that recurrent neural network (RNN) has gained increasing attention in recent
years. These methods can identify structure and patterns such as nonlinearity, can
seamlessly model problems with multiple input variables, and are relatively robust to
missingdata. The RNN models can retain state from one iteration to the next by
using their own output as input for the next step. These deep learning models can be
referred to as time series models, as they can make future predictions using the data
points in the past, similar to traditional time series models such as ARIMA. There‐
fore, there are a wide range of applications in finance where these deep learning mod‐
els can be leveraged. Let us look at the deep learning models for time series
forecasting.
RNNs
Recurrent neural networks (RNNs) are called “recurrent” because they perform the
same task for every element of a sequence, with the output being dependent on the
previous computations. RNN models have a memory, which captures information
about what has been calculated so far. As shown in Figure 5-4, a recurrent neural net‐
work can be thought of as multiple copies of the same network, each passing a mes‐
sage to a successor.
92 | Chapter 5: Supervised Learning: Regression (Including Time Series Models)
6 A detailed explanation of LSTM models can be found in this blog post by Christopher Olah .
Figure 5-4. Recurrent Neural Network
In Figure 5-4:
• Xt is the input at time step t.
• Ot is the output at time step t.
• St is the hidden state at time step t. It’s the memory of the network. It is calcula‐
ted based on the previous hidden state and the input at the current step.
The main feature of an RNN is this hidden state, which captures some information
about a sequence and uses it accordingly whenever needed.
Long short-term memory
Long short-term memory (LSTM) is a special kind of RNN explicitly designed to
avoid the long-term dependency problem. Remembering information for long peri‐
ods of time is practically default behavior for an LSTM model.6 These models are
composed of a set of cells with features to memorize the sequence of data. These cells
capture and store the data streams. Further, the cells interconnect one module of the
past to another module of the present to convey information from several past time
instants to the present one. Due to the use of gates in each cell, data in each cell can
be disposed, filtered, or added for the next cells.
The gates, based on artificial neural network layers, enable the cells to optionally let
data pass through or be disposed. Each layer yields numbers in the range of zero to
one, depicting the amount of every segment of data that ought to be let through in
each cell. More precisely, an estimation of zero value implies “let nothing pass
through.” An estimation of one indicates “let everything pass through.” Three types
of gates are involved in each LSTM, with the goal of controlling the state of each cell:
Time Series Models | 93
https://oreil.ly/4PDhr
7 An ARIMA model and a Keras-based LSTM model will be demonstrated in one of the case studies.
Forget Gate
Outputs a number between zero and one, where one shows “completely keep
this” and zero implies “completely ignore this.” This gate conditionally decides
whether the past should be forgotten or preserved.
Input Gate
Chooses which new data needs to be stored in the cell.
Output Gate
Decides what will yield out of each cell. The yielded value will be based on the
cell state along with the filtered and newly added data.
Keras wraps the efficient numerical computation libraries and functions and allows
us to define and train LSTM neural network models in a few short lines of code. In
the following code, LSTM module from keras.layers is used for implementing
LSTM network. The network is trained with the variable X_train_LSTM. The network
has a hidden layer with 50 LSTM blocks or neurons and an output layer that makes a
single value prediction. Also refer to Chapter 3 for a more detailed description of all
the terms (i.e., sequential, learning rate, momentum, epoch, and batch size).
A sample Python code for implementing an LSTM model in Keras is shown below:
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import SGD
from keras.layers import LSTM
def create_LSTMmodel(learn_rate = 0.01, momentum=0):
 # create model
 model = Sequential()
 model.add(LSTM(50, input_shape=(X_train_LSTM.shape[1],\
 X_train_LSTM.shape[2])))
 #More number of cells can be added if needed
 model.add(Dense(1))
 optimizer = SGD(lr=learn_rate, momentum=momentum)
 model.compile(loss='mse', optimizer='adam')
 return model
LSTMModel = create_LSTMmodel(learn_rate = 0.01, momentum=0)
LSTMModel_fit = LSTMModel.fit(X_train_LSTM, Y_train_LSTM, validation_data=\
 (X_test_LSTM, Y_test_LSTM),epochs=330, batch_size=72, verbose=0, shuffle=False)
In terms of both learning and implementation, LSTM provides considerably more
options for fine-tuning compared to ARIMA models. Although deep learning models
have several advantages over traditional time series models, deep learning models are
more complicated and difficult to train.7
94 | Chapter 5: Supervised Learning: Regression (Including Time Series Models)
Modifying Time Series Data for Supervised Learning Models
A time series is a sequence of numbers that are ordered by a time index. Supervised
learning is where we have input variables (X) and an output variable (Y). Given a
sequence of numbers for a time series dataset, we can restructure the data into a set of
predictor and predicted variables, just like in a supervised learning problem. We can
do this by using previous time steps as input variables and using the next time step as
the output variable. Let’s make this concrete with an example.
We can restructure a time series shown in the left table in Figure 5-5 as a supervised
learning problem by using the value at the previous time step to predict the value at
the next time step. Once we’ve reorganized the time series dataset this way, the data
would look like the table on the right.
Figure 5-5. Modifying time series for supervised learning models
We can see that the previous time step is the input (X) and the next time step is the
output (Y) in our supervised learning problem. The order between the observations is
preserved and must continue to be preserved when using this dataset to train a super‐
vised model. We will delete the first and last row while training our supervised model
as we don’t have values for either X or Y.
In Python, the main function to help transform time series data into a supervised
learning problem is the shift() function from the Pandas library. We will demon‐
strate this approach in the case studies. The use of prior time steps to predict the next
time step is called the sliding window, time delay, or lag method.
Having discussed all the concepts of supervised learning and time series models, let
us move to the case studies.
Case Study 1: Stock Price Prediction
One of the biggest challenges in finance is predicting stock prices. However, with the
onset of recent advancements in machine learning applications, the field has been
evolving to utilize nondeterministic solutions that learn what is going on in order to
make more accurate predictions. Machine learning techniques naturally lend
Case Study 1: Stock Price Prediction | 95
themselves to stock price prediction based on historical data. Predictions can be
made for a single time point ahead or for a set of future time points.
As a high-level overview, other than the historical price of the stock itself, the features
that are generally useful for stock price prediction are as follows:
Correlated assets
An organization depends on and interacts with many external factors, including
its competitors, clients, the global economy, the geopolitical situation, fiscal and
monetary policies, access to capital, and so on. Hence, its stock price may be cor‐
related not only with the stock price of other companies but also with other
assets such as commodities, FX, broad-based indices, or even fixed income
securities.
Technical indicators
A lot of investors follow technical indicators. Moving average, exponential mov‐
ingaverage, and momentum are the most popular indicators.
Fundamental analysis
Two primary data sources to glean features that can be used in fundamental
analysis include:
Performance reports
Annual and quarterly reports of companies can be used to extract or deter‐
mine key metrics, such as ROE (Return on Equity) and P/E (Price-to-
Earnings).
News
News can indicate upcoming events that can potentially move the stock price
in a certain direction.
In this case study, we will use various supervised learning–based models to predict
the stock price of Microsoft using correlated assets and its own historical data. By the
end of this case study, readers will be familiar with a general machine learning
approach to stock prediction modeling, from gathering and cleaning data to building
and tuning different models.
In this case study, we will focus on:
• Looking at various machine learning and time series models, ranging in com‐
plexity, that can be used to predict stock returns.
• Visualization of the data using different kinds of charts (i.e., density, correlation,
scatterplot, etc.)
• Using deep learning (LSTM) models for time series forecasting.
96 | Chapter 5: Supervised Learning: Regression (Including Time Series Models)
8 Refer to “Case Study 3: Bitcoin Trading Strategy” on page 179 presented in Chapter 6 and “Case Study 1: NLP
and Sentiment Analysis–Based Trading Strategies” on page 362 presented in Chapter 10 to understand the
usage of technical indicators and news-based fundamental analysis as features in the price prediction.
9 Equity markets have trading holidays, while currency markets do not. However, the alignment of the dates
across all the time series is ensured before any modeling or analysis.
• Implementation of the grid search for time series models (i.e., ARIMA model).
• Interpretation of the results and examining potential overfitting and underfitting
of the data across the models.
Blueprint for Using Supervised Learning Models to Predict a
Stock Price
1. Problem definition
In the supervised regression framework used for this case study, the weekly return of
Microsoft stock is the predicted variable. We need to understand what affects Micro‐
soft stock price and incorporate as much information into the model. Out of correla‐
ted assets, technical indicators, and fundamental analysis (discussed in the section
before), we will focus on correlated assets as features in this case study.8
For this case study, other than the historical data of Microsoft, the independent vari‐
ables used are the following potentially correlated assets:
Stocks
IBM (IBM) and Alphabet (GOOGL)
Currency9
USD/JPY and GBP/USD
Indices
S&P 500, Dow Jones, and VIX
The dataset used for this case study is extracted from Yahoo Finance and the FRED
website. In addition to predicting the stock price accurately, this case study will also
demonstrate the infrastructure and framework for each step of time series and super‐
vised regression–based modeling for stock price prediction. We will use the daily
closing price of the last 10 years, from 2010 onward.
Case Study 1: Stock Price Prediction | 97
https://fred.stlouisfed.org
https://fred.stlouisfed.org
2. Getting started—loading the data and Python packages
2.1. Loading the Python packages. The list of the libraries used for data loading, data
analysis, data preparation, model evaluation, and model tuning are shown below. The
packages used for different purposes have been segregated in the Python code that
follows. The details of most of these packages and functions were provided in Chap‐
ters 2 and 4. The use of these packages will be demonstrated in different steps of the
model development process.
Function and modules for the supervised regression models
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.neural_network import MLPRegressor
Function and modules for data analysis and model evaluation
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2, f_regression
Function and modules for deep learning models
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import SGD
from keras.layers import LSTM
from keras.wrappers.scikit_learn import KerasRegressor
Function and modules for time series models
from statsmodels.tsa.arima_model import ARIMA
import statsmodels.api as sm
Function and modules for data preparation and visualization
# pandas, pandas_datareader, numpy and matplotlib
import numpy as np
import pandas as pd
import pandas_datareader.data as web
from matplotlib import pyplot
98 | Chapter 5: Supervised Learning: Regression (Including Time Series Models)
10 In different case studies across the book we will demonstrate loading the data through different sources (e.g.,
CSV, and external websites like quandl).
from pandas.plotting import scatter_matrix
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from pandas.plotting import scatter_matrix
from statsmodels.graphics.tsaplots import plot_acf
2.2. Loading the data. One of the most important steps in machine learning and pre‐
dictive modeling is gathering good data. The following steps demonstrate the loading
of data from the Yahoo Finance and FRED websites using the Pandas DataReader
function:10
stk_tickers = ['MSFT', 'IBM', 'GOOGL']
ccy_tickers = ['DEXJPUS', 'DEXUSUK']
idx_tickers = ['SP500', 'DJIA', 'VIXCLS']
stk_data = web.DataReader(stk_tickers, 'yahoo')
ccy_data = web.DataReader(ccy_tickers, 'fred')
idx_data = web.DataReader(idx_tickers, 'fred')
Next, we define our dependent (Y) and independent (X) variables. The predicted
variable is the weekly return of Microsoft (MSFT). The number of trading days in a
week is assumed to be five, and we compute the return using five trading days. For
independent variables we use the correlated assets and the historical return of MSFT
at different frequencies.
The variables used as independent variables are lagged five-day return of stocks (IBM
and GOOG), currencies (USD/JPY and GBP/USD), and indices (S&P 500, Dow
Jones, and VIX), along with lagged 5-day, 15-day, 30-day and 60-day return of
MSFT.
The lagged five-day variables embed the time series component by using a time-delay
approach, where the lagged variable is included as one of the independent variables.
This step is reframing the time series data into a supervised regression–based model
framework.
return_period = 5
Y = np.log(stk_data.loc[:, ('Adj Close', 'MSFT')]).diff(return_period).\
shift(-return_period)
Y.name = Y.name[-1]+'_pred'
X1 = np.log(stk_data.loc[:, ('Adj Close', ('GOOGL', 'IBM'))]).diff(return_period)
X1.columns = X1.columns.droplevel()
X2 = np.log(ccy_data).diff(return_period)
X3 = np.log(idx_data).diff(return_period)
X4 = pd.concat([np.log(stk_data.loc[:, ('Adj Close', 'MSFT')]).diff(i) \
Case Study 1: Stock Price Prediction | 99
for i in [return_period, return_period*3,\
return_period*6, return_period*12]], axis=1).dropna()
X4.columns = ['MSFT_DT', 'MSFT_3DT', 'MSFT_6DT', 'MSFT_12DT']
X = pd.concat([X1, X2, X3, X4], axis=1)
dataset = pd.concat([Y, X], axis=1).dropna().iloc[::return_period, :]
Y = dataset.loc[:, Y.name]
X = dataset.loc[:, X.columns]
3. Exploratory data analysis
We will look at descriptive statistics, data visualization, and time series analysis in
this section.
3.1. Descriptivestatistics. Let’s have a look at the dataset we have:
dataset.head()
Output
The variable MSFT_pred is the return of Microsoft stock and is the predicted vari‐
able. The dataset contains the lagged series of other correlated stocks, currencies, and
indices. Additionally, it also consists of the lagged historical returns of MSFT.
3.2. Data visualization. The fastest way to learn more about the data is to visualize it.
The visualization involves independently understanding each attribute of the dataset.
We will look at the scatterplot and the correlation matrix. These plots give us a sense
of the interdependence of the data. Correlation can be calculated and displayed for
each pair of the variables by creating a correlation matrix. Hence, besides the rela‐
tionship between independent and dependent variables, it also shows the correlation
among the independent variables. This is useful to know because some machine
learning algorithms like linear and logistic regression can have poor performance if
there are highly correlated input variables in the data:
correlation = dataset.corr()
pyplot.figure(figsize=(15,15))
pyplot.title('Correlation Matrix')
sns.heatmap(correlation, vmax=1, square=True,annot=True,cmap='cubehelix')
100 | Chapter 5: Supervised Learning: Regression (Including Time Series Models)
Output
Looking at the correlation plot (full-size version available on GitHub), we see some
correlation of the predicted variable with the lagged 5-day, 15-day, 30-day, and 60-
day returns of MSFT. Also, we see a higher negative correlation of many asset returns
versus VIX, which is intuitive.
Next, we can visualize the relationship between all the variables in the regression
using the scatterplot matrix shown below:
pyplot.figure(figsize=(15,15))
scatter_matrix(dataset,figsize=(12,12))
pyplot.show()
Case Study 1: Stock Price Prediction | 101
https://oreil.ly/g3wVU
Output
Looking at the scatterplot (full-size version available on GitHub), we see some linear
relationship of the predicted variable with the lagged 15-day, 30-day, and 60-day
returns of MSFT. Otherwise, we do not see any special relationship between our pre‐
dicted variable and the features.
3.3. Time series analysis. Next, we delve into the time series analysis and look at the
decomposition of the time series of the predicted variable into trend and seasonality
components:
res = sm.tsa.seasonal_decompose(Y,freq=52)
fig = res.plot()
fig.set_figheight(8)
fig.set_figwidth(15)
pyplot.show()
102 | Chapter 5: Supervised Learning: Regression (Including Time Series Models)
https://oreil.ly/g3wVU
11 The time series is not the stock price but stock return, so the trend is mild compared to the stock price series.
Output
We can see that for MSFT there has been a general upward trend in the return series.
This may be due to the large run-up of MSFT in the recent years, causing more posi‐
tive weekly return data points than negative.11 The trend may show up in the con‐
stant/bias terms in our models. The residual (or white noise) term is relatively small
over the entire time series.
4. Data preparation
This step typically involves data processing, data cleaning, looking at feature impor‐
tance, and performing feature reduction. The data obtained for this case study is rela‐
tively clean and doesn’t require further processing. Feature reduction might be useful
here, but given the relatively small number of variables considered, we will keep all of
them as is. We will demonstrate data preparation in some of the subsequent case
studies in detail.
5. Evaluate models
5.1. Train-test split and evaluation metrics. As described in Chapter 2, it is a good idea
to partition the original dataset into a training set and a test set. The test set is a sam‐
ple of the data that we hold back from our analysis and modeling. We use it right at
the end of our project to confirm the performance of our final model. It is the final
test that gives us confidence on our estimates of accuracy on unseen data. We will use
Case Study 1: Stock Price Prediction | 103
80% of the dataset for modeling and use 20% for testing. With time series data, the
sequence of values is important. So we do not distribute the dataset into training and
test sets in random fashion, but we select an arbitrary split point in the ordered list of
observations and create two new datasets:
validation_size = 0.2
train_size = int(len(X) * (1-validation_size))
X_train, X_test = X[0:train_size], X[train_size:len(X)]
Y_train, Y_test = Y[0:train_size], Y[train_size:len(X)]
5.2. Test options and evaluation metrics. To optimize the various hyperparameters of
the models, we use ten-fold cross validation (CV) and recalculate the results ten times
to account for the inherent randomness in some of the models and the CV process.
We will evaluate algorithms using the mean squared error metric. This metric gives
an idea of the performance of the supervised regression models. All these concepts,
including cross validation and evaluation metrics, have been described in Chapter 4:
num_folds = 10
scoring = 'neg_mean_squared_error'
5.3. Compare models and algorithms. Now that we have completed the data loading and
designed the test harness, we need to choose a model.
5.3.1. Machine learning models from Scikit-learn. In this step, the supervised regression
models are implemented using the sklearn package:
Regression and tree regression algorithms
models = []
models.append(('LR', LinearRegression()))
models.append(('LASSO', Lasso()))
models.append(('EN', ElasticNet()))
models.append(('KNN', KNeighborsRegressor()))
models.append(('CART', DecisionTreeRegressor()))
models.append(('SVR', SVR()))
Neural network algorithms
models.append(('MLP', MLPRegressor()))
Ensemble models
# Boosting methods
models.append(('ABR', AdaBoostRegressor()))
models.append(('GBR', GradientBoostingRegressor()))
# Bagging methods
models.append(('RFR', RandomForestRegressor()))
models.append(('ETR', ExtraTreesRegressor()))
104 | Chapter 5: Supervised Learning: Regression (Including Time Series Models)
Once we have selected all the models, we loop over each of them. First, we run the k-
fold analysis. Next, we run the model on the entire training and testing dataset.
All the algorithms use default tuning parameters. We will calculate the mean and
standard deviation of the evaluation metric for each algorithm and collect the results
for model comparison later:
names = []
kfold_results = []
test_results = []
train_results = []
for name, model in models:
 names.append(name)
 ## k-fold analysis:
 kfold = KFold(n_splits=num_folds, random_state=seed)
 #converted mean squared error to positive. The lower the better
 cv_results = -1* cross_val_score(model, X_train, Y_train, cv=kfold, \
 scoring=scoring)
 kfold_results.append(cv_results)
 # Full Training period
 res = model.fit(X_train, Y_train)
 train_result = mean_squared_error(res.predict(X_train), Y_train)
 train_results.append(train_result)
 # Test results
 test_result = mean_squared_error(res.predict(X_test), Y_test)
 test_results.append(test_result)
Let’s compare the algorithms by looking at the cross validation results:
Cross validation results
fig = pyplot.figure()
fig.suptitle('Algorithm Comparison: Kfold results')
ax = fig.add_subplot(111)
pyplot.boxplot(kfold_results)
ax.set_xticklabels(names)
fig.set_size_inches(15,8)
pyplot.show()
Case Study 1: Stock Price Prediction | 105
Output
Although the results of a couple of the models look good, we see that the linear
regression and the regularized regression including the lasso regression (LASSO) and
elastic net (EN) seem to perform best. This indicates a strong linear relationship
between the dependent and independent variables. Going back to the exploratory
analysis, we saw a good correlation and linear relationship of the target variables with
the different lagged MSFT variables.
Let us look at the errors of the test set as well:
Trainingand test error
# compare algorithms
fig = pyplot.figure()
ind = np.arange(len(names)) # the x locations for the groups
width = 0.35 # the width of the bars
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
pyplot.bar(ind - width/2, train_results, width=width, label='Train Error')
pyplot.bar(ind + width/2, test_results, width=width, label='Test Error')
fig.set_size_inches(15,8)
pyplot.legend()
ax.set_xticks(ind)
ax.set_xticklabels(names)
pyplot.show()
106 | Chapter 5: Supervised Learning: Regression (Including Time Series Models)
Output
Examining the training and test error, we still see a stronger performance from the
linear models. Some of the algorithms, such as the decision tree regressor (CART),
overfit on the training data and produced very high error on the test set. Ensemble
models such as gradient boosting regression (GBR) and random forest regression
(RFR) have low bias but high variance. We also see that the artificial neural network
algorithm (shown as MLP in the chart) shows higher errors in both the training and
test sets. This is perhaps due to the linear relationship of the variables not captured
accurately by ANN, improper hyperparameters, or insufficient training of the model.
Our original intuition from the cross validation results and the scatterplots also seem
to demonstrate a better performance of linear models.
We now look at some of the time series and deep learning models that can be used.
Once we are done creating these, we will compare their performance against that of
the supervised regression–based models. Due to the nature of time series models, we
are not able to run a k-fold analysis. We can still compare our results to the other
models based on the full training and testing results.
5.3.2. Time series–based models: ARIMA and LSTM. The models used so far already
embed the time series component by using a time-delay approach, where the lagged
variable is included as one of the independent variables. However, for the time ser‐
ies–based models we do not need the lagged variables of MSFT as the independent
variables. Hence, as a first step we remove MSFT’s previous returns for these models.
We use all other variables as the exogenous variables in these models.
Let us first prepare the dataset for ARIMA models by having only the correlated var‐
riables as exogenous variables:
Case Study 1: Stock Price Prediction | 107
X_train_ARIMA=X_train.loc[:, ['GOOGL', 'IBM', 'DEXJPUS', 'SP500', 'DJIA', \
'VIXCLS']]
X_test_ARIMA=X_test.loc[:, ['GOOGL', 'IBM', 'DEXJPUS', 'SP500', 'DJIA', \
'VIXCLS']]
tr_len = len(X_train_ARIMA)
te_len = len(X_test_ARIMA)
to_len = len (X)
We now configure the ARIMA model with the order (1,0,0) and use the independent
variables as the exogenous variables in the model. The version of the ARIMA model
where the exogenous variables are also used is known as the ARIMAX model, where
"X" represents exogenous variables:
modelARIMA=ARIMA(endog=Y_train,exog=X_train_ARIMA,order=[1,0,0])
model_fit = modelARIMA.fit()
Now we fit the ARIMA model:
error_Training_ARIMA = mean_squared_error(Y_train, model_fit.fittedvalues)
predicted = model_fit.predict(start = tr_len -1 ,end = to_len -1, \
 exog = X_test_ARIMA)[1:]
error_Test_ARIMA = mean_squared_error(Y_test,predicted)
error_Test_ARIMA
Output
0.0005931919240399084
Error of this ARIMA model is reasonable.
Now let’s prepare the dataset for the LSTM model. We need the data in the form of
arrays of all the input variables and the output variables.
The logic behind the LSTM is that data is taken from the previous day (the data of all
the other features for that day—correlated assets and the lagged variables of MSFT)
and we try to predict the next day. Then we move the one-day window with one day
and again predict the next day. We iterate like this over the whole dataset (of course
in batches). The code below will create a dataset in which X is the set of independent
variables at a given time (t) and Y is the target variable at the next time (t + 1):
seq_len = 2 #Length of the seq for the LSTM
Y_train_LSTM, Y_test_LSTM = np.array(Y_train)[seq_len-1:], np.array(Y_test)
X_train_LSTM = np.zeros((X_train.shape[0]+1-seq_len, seq_len, X_train.shape[1]))
X_test_LSTM = np.zeros((X_test.shape[0], seq_len, X.shape[1]))
for i in range(seq_len):
 X_train_LSTM[:, i, :] = np.array(X_train)[i:X_train.shape[0]+i+1-seq_len, :]
 X_test_LSTM[:, i, :] = np.array(X)\
 [X_train.shape[0]+i-1:X.shape[0]+i+1-seq_len, :]
In the next step, we create the LSTM architecture. As we can see, the input of the
LSTM is in X_train_LSTM, which goes into 50 hidden units in the LSTM layer and
then is transformed to a single output—the stock return value. The hyperparameters
108 | Chapter 5: Supervised Learning: Regression (Including Time Series Models)
(i.e., learning rate, optimizer, activation function, etc.) were discussed in Chapter 3 of
the book:
# LSTM Network
def create_LSTMmodel(learn_rate = 0.01, momentum=0):
 # create model
 model = Sequential()
 model.add(LSTM(50, input_shape=(X_train_LSTM.shape[1],\
 X_train_LSTM.shape[2])))
 #More cells can be added if needed
 model.add(Dense(1))
 optimizer = SGD(lr=learn_rate, momentum=momentum)
 model.compile(loss='mse', optimizer='adam')
 return model
LSTMModel = create_LSTMmodel(learn_rate = 0.01, momentum=0)
LSTMModel_fit = LSTMModel.fit(X_train_LSTM, Y_train_LSTM, \
 validation_data=(X_test_LSTM, Y_test_LSTM),\
 epochs=330, batch_size=72, verbose=0, shuffle=False)
Now we fit the LSTM model with the data and look at the change in the model per‐
formance metric over time simultaneously in the training set and the test set:
pyplot.plot(LSTMModel_fit.history['loss'], label='train', )
pyplot.plot(LSTMModel_fit.history['val_loss'], '--',label='test',)
pyplot.legend()
pyplot.show()
Output
error_Training_LSTM = mean_squared_error(Y_train_LSTM,\
 LSTMModel.predict(X_train_LSTM))
predicted = LSTMModel.predict(X_test_LSTM)
error_Test_LSTM = mean_squared_error(Y_test,predicted)
Case Study 1: Stock Price Prediction | 109
Now, in order to compare the time series and the deep learning models, we append
the result of these models to the results of the supervised regression–based models:
test_results.append(error_Test_ARIMA)
test_results.append(error_Test_LSTM)
train_results.append(error_Training_ARIMA)
train_results.append(error_Training_LSTM)
names.append("ARIMA")
names.append("LSTM")
Output
Looking at the chart, we find the time series–based ARIMA model comparable to the
linear supervised regression models: linear regression (LR), lasso regression (LASSO),
and elastic net (EN). This can primarily be due to the strong linear relationship as
discussed before. The LSTM model performs decently; however, the ARIMA model
outperforms the LSTM model in the test set. Hence, we select the ARIMA model for
model tuning.
6. Model tuning and grid search
Let us perform the model tuning of the ARIMA model.
110 | Chapter 5: Supervised Learning: Regression (Including Time Series Models)
Model Tuning for the Supervised Learning or Time Series Models
The detailed implementation of grid search for all the supervised
learning–based models, along with the ARIMA and LSTM models,
is provided in the Regression-Master template under the GitHub
repository for this book. For the grid search of the ARIMA and
LSTM models, refer to the “ARIMA and LSTM Grid Search” sec‐
tion of the Regression-Master template.
The ARIMA model is generally represented as ARIMA(p,d,q) model, where p is the
order of the autoregressive part, d is the degree of first differencing involved, and q is
the order of the moving average part. The order of the ARIMA model was set to
(1,0,0). So we perform a grid search with different p, d, and q combinations in the
ARIMA model’s order and select the combination that minimizes the fitting error:
def evaluate_arima_model(arima_order):
 #predicted = list()
 modelARIMA=ARIMA(endog=Y_train,exog=X_train_ARIMA,order=arima_order)model_fit = modelARIMA.fit()
 error = mean_squared_error(Y_train, model_fit.fittedvalues)
 return error
# evaluate combinations of p, d and q values for an ARIMA model
def evaluate_models(p_values, d_values, q_values):
 best_score, best_cfg = float("inf"), None
 for p in p_values:
 for d in d_values:
 for q in q_values:
 order = (p,d,q)
 try:
 mse = evaluate_arima_model(order)
 if mse < best_score:
 best_score, best_cfg = mse, order
 print('ARIMA%s MSE=%.7f' % (order,mse))
 except:
 continue
 print('Best ARIMA%s MSE=%.7f' % (best_cfg, best_score))
# evaluate parameters
p_values = [0, 1, 2]
d_values = range(0, 2)
q_values = range(0, 2)
warnings.filterwarnings("ignore")
evaluate_models(p_values, d_values, q_values)
Output
ARIMA(0, 0, 0) MSE=0.0009879
ARIMA(0, 0, 1) MSE=0.0009721
ARIMA(1, 0, 0) MSE=0.0009696
ARIMA(1, 0, 1) MSE=0.0009685
Case Study 1: Stock Price Prediction | 111
https://oreil.ly/9S8h_
https://oreil.ly/9S8h_
ARIMA(2, 0, 0) MSE=0.0009684
ARIMA(2, 0, 1) MSE=0.0009683
Best ARIMA(2, 0, 1) MSE=0.0009683
We see that the ARIMA model with the order (2,0,1) is the best performer out of all
the combinations tested in the grid search, although there isn’t a significant differ‐
ence in the mean squared error (MSE) with other combinations. This means that the
model with the autoregressive lag of two and moving average of one yields the best
result. We should not forget the fact that there are other exogenous variables in the
model that influence the order of the best ARIMA model as well.
7. Finalize the model
In the last step we will check the finalized model on the test set.
7.1. Results on the test dataset. 
# prepare model
modelARIMA_tuned=ARIMA(endog=Y_train,exog=X_train_ARIMA,order=[2,0,1])
model_fit_tuned = modelARIMA_tuned.fit()
# estimate accuracy on validation set
predicted_tuned = model_fit.predict(start = tr_len -1 ,\
 end = to_len -1, exog = X_test_ARIMA)[1:]
print(mean_squared_error(Y_test,predicted_tuned))
Output
0.0005970582461404503
The MSE of the model on the test set looks good and is actually less than that of the
training set.
In the last step, we will visualize the output of the selected model and compare the
modeled data against the actual data. In order to visualize the chart, we convert the
return time series to a price time series. We also assume the price at the beginning of
the test set as one for the sake of simplicity. Let us look at the plot of actual versus
predicted data:
# plotting the actual data versus predicted data
predicted_tuned.index = Y_test.index
pyplot.plot(np.exp(Y_test).cumprod(), 'r', label='actual',)
# plotting t, a separately
pyplot.plot(np.exp(predicted_tuned).cumprod(), 'b--', label='predicted')
pyplot.legend()
pyplot.rcParams["figure.figsize"] = (8,5)
pyplot.show()
112 | Chapter 5: Supervised Learning: Regression (Including Time Series Models)
Looking at the chart, we clearly see the trend has been captured perfectly by the
model. The predicted series is less volatile compared to the actual time series, and it
aligns with the actual data for the first few months of the test set. A point to note is
that the purpose of the model is to compute the next day’s return given the data
observed up to the present day, and not to predict the stock price several days in the
future given the current data. Hence, a deviation from the actual data is expected as
we move away from the beginning of the test set. The model seems to perform well
for the first few months, with deviation from the actual data increasing six to seven
months after the beginning of the test set.
Conclusion
We can conclude that simple models—linear regression, regularized regression (i.e.,
Lasso and elastic net)—along with the time series models, such as ARIMA, are prom‐
ising modeling approaches for stock price prediction problems. This approach helps
us deal with overfitting and underfitting, which are some of the key challenges in pre‐
dicting problems in finance.
We should also note that we can use a wider set of indicators, such as P/E ratio, trad‐
ing volume, technical indicators, or news data, which might lead to better results. We
will demonstrate this in some of the future case studies in the book.
Overall, we created a supervised-regression and time series modeling framework that
allows us to perform stock price prediction using historical data. This framework
generates results to analyze risk and profitability before risking any capital.
Case Study 1: Stock Price Prediction | 113
Case Study 2: Derivative Pricing
In computational finance and risk management, several numerical methods (e.g.,
finite differences, fourier methods, and Monte Carlo simulation) are commonly used
for the valuation of financial derivatives.
The Black-Scholes formula is probably one of the most widely cited and used models
in derivative pricing. Numerous variations and extensions of this formula are used to
price many kinds of financial derivatives. However, the model is based on several
assumptions. It assumes a specific form of movement for the derivative price, namely
a Geometric Brownian Motion (GBM). It also assumes a conditional payment at
maturity of the option and economic constraints, such as no-arbitrage. Several other
derivative pricing models have similarly impractical model assumptions. Finance
practitioners are well aware that these assumptions are violated in practice, and prices
from these models are further adjusted using practitioner judgment.
Another aspect of the many traditional derivative pricing models is model calibra‐
tion, which is typically done not by historical asset prices but by means of derivative
prices (i.e., by matching the market prices of heavily traded options to the derivative
prices from the mathematical model). In the process of model calibration, thousands
of derivative prices need to be determined in order to fit the parameters of the model,
and the overall process is time consuming. Efficient numerical computation is
increasingly important in financial risk management, especially when we deal with
real-time risk management (e.g., high frequency trading). However, due to the
requirement of a highly efficient computation, certain high-quality asset models and
methodologies are discarded during model calibration of traditional derivative pric‐
ing models.
Machine learning can potentially be used to tackle these drawbacks related to imprac‐
tical model assumptions and inefficient model calibration. Machine learning algo‐
rithms have the ability to tackle more nuances with very few theoretical assumptions
and can be effectively used for derivative pricing, even in a world with frictions. With
the advancements in hardware, we can train machine learning models on high per‐
formance CPUs, GPUs, and other specialized hardware to achieve a speed increase of
several orders of magnitude as compared to the traditional derivative pricing models.
Additionally, market data is plentiful, so it is possible to train a machine learning
algorithm to learn the function that is collectively generating derivative prices in the
market. Machine learning models can capture subtle nonlinearities in the data that
are not obtainable through other statistical approaches.
In this case study, we look at derivative pricing from a machine learning standpoint
and use a supervised regression–based model to price an option from simulated data.
The main idea here is to come up with a machine learning framework for derivative
pricing. Achieving a machine learning model with high accuracy would mean that we
114 | Chapter 5: Supervised Learning: Regression (Including Time Series Models)
12 The predicted variable, which is the option price, should ideally be directly obtained for the market. Given
this case study is more for demonstration purposes, we use model-generated option price for thesake of con‐
venience.
can leverage the efficient numerical calculation of machine learning for derivative
pricing with fewer underlying model assumptions.
In this case study, we will focus on:
• Developing a machine learning–based framework for derivative pricing.
• Comparison of linear and nonlinear supervised regression models in the context
of derivative pricing.
Blueprint for Developing a Machine Learning Model for
Derivative Pricing
1. Problem definition
In the supervised regression framework we used for this case study, the predicted
variable is the price of the option, and the predictor variables are the market data
used as inputs to the Black-Scholes option pricing model.
The variables selected to estimate the market price of the option are stock price, strike
price, time to expiration, volatility, interest rate, and dividend yield. The predicted
variable for this case study was generated using random inputs and feeding them into
the well-known Black-Scholes model.12
The price of a call option per the Black-Scholes option pricing model is defined in
Equation 5-1.
Equation 5-1. Black-Scholes equation for call option
Se –qτΦ(d1) – e –rτ KΦ(d2)
with
d1 = ln (S / K ) + (r – q + σ 2 / 2)τ
σ τ
and
Case Study 2: Derivative Pricing | 115
d2 = ln (S / K ) + (r – q – σ 2 / 2)τ
σ τ
= d1 – σ τ
where we have stock price S ; strike price K ; risk-free rate r ; annual dividend yield q;
time to maturity τ = T – t (represented as a unitless fraction of one year); and vola‐
tility σ.
To make the logic simpler, we define moneyness as M = K / S and look at the prices
in terms of per unit of current stock price. We also set q as 0.
This simplifies the formula to the following:
e –qτΦ( – ln (M ) + (r + σ 2 / 2)τ
σ τ ) – e –rτMΦ( – ln (M ) + (r – σ 2 / 2)τ
σ τ )
Looking at the equation above, the parameters that feed into the Black-Scholes option
pricing model are moneyness, risk-free rate, volatility, and time to maturity.
The parameter that plays the central role in derivative market is volatility, as it is
directly related to the movement of the stock prices. With the increase in the volatil‐
ity, the range of share price movements becomes much wider than that of a low vola‐
tility stock.
In the options market, there isn’t a single volatility used to price all the options. This
volatility depends on the option moneyness and time to maturity. In general, the vol‐
atility increases with higher time to maturity and with moneyness. This behavior is
referred to as volatility smile/skew. We often derive the volatility from the price of the
options existing in the market, and this volatility is referred to as “implied” volatility.
In this exercise, we assume the structure of the volatility surface and use function in
Equation 5-2, where volatility depends on the option moneyness and time to matur‐
ity to generate the option volatility surface.
Equation 5-2. Equation for vloatility
σ(M , τ) = σ0 + ατ + β(M – 1)2
2. Getting started—loading the data and Python packages
2.1. Loading the Python packages. The loading of Python packages is similar to case
study 1 in this chapter. Please refer to the Jupyter notebook of this case study for
more details.
2.2. Defining functions and parameters. To generate the dataset, we need to simulate the
input parameters and then create the predicted variable.
116 | Chapter 5: Supervised Learning: Regression (Including Time Series Models)
13 When the spot price is equal to the strike price, at-the-money option.
As a first step we define the constant parameters. The constant parameters required
for the volatility surface are defined below. These parameters are not expected to have
a significant impact on the option price; therefore, these parameters are set to some
meaningful values:
true_alpha = 0.1
true_beta = 0.1
true_sigma0 = 0.2
The risk-free rate, which is an input to the Black-Scholes option pricing model, is
defined as follows:
risk_free_rate = 0.05
Volatility and option pricing functions. In this step we define the function to compute
the volatility and price of a call option as per Equations 5-1 and 5-2:
def option_vol_from_surface(moneyness, time_to_maturity):
 return true_sigma0 + true_alpha * time_to_maturity +\
 true_beta * np.square(moneyness - 1)
def call_option_price(moneyness, time_to_maturity, option_vol):
 d1=(np.log(1/moneyness)+(risk_free_rate+np.square(option_vol))*\
 time_to_maturity)/ (option_vol*np.sqrt(time_to_maturity))
 d2=(np.log(1/moneyness)+(risk_free_rate-np.square(option_vol))*\
 time_to_maturity)/(option_vol*np.sqrt(time_to_maturity))
 N_d1 = norm.cdf(d1)
 N_d2 = norm.cdf(d2)
 return N_d1 - moneyness * np.exp(-risk_free_rate*time_to_maturity) * N_d2
2.3. Data generation. We generate the input and output variables in the following
steps:
• Time to maturity (Ts) is generated using the np.random.random function, which
generates a uniform random variable between zero and one.
• Moneyness (Ks) is generated using the np.random.randn function, which gener‐
ates a normally distributed random variable. The random number multiplied by
0.25 generates the deviation of strike from spot price,13 and the overall equation
ensures that the moneyness is greater than zero.
• Volatility (sigma) is generated as a function of time to maturity and moneyness
using Equation 5-2.
• The option price is generated using Equation 5-1 for the Black-Scholes option
price.
Case Study 2: Derivative Pricing | 117
14 Refer to the Jupyter notebook of this case study to go through other charts such as histogram plot and corre‐
lation plot.
In total we generate 10,000 data points (N):
N = 10000
Ks = 1+0.25*np.random.randn(N)
Ts = np.random.random(N)
Sigmas = np.array([option_vol_from_surface(k,t) for k,t in zip(Ks,Ts)])
Ps = np.array([call_option_price(k,t,sig) for k,t,sig in zip(Ks,Ts,Sigmas)])
Now we create the variables for predicted and predictor variables:
Y = Ps
X = np.concatenate([Ks.reshape(-1,1), Ts.reshape(-1,1), Sigmas.reshape(-1,1)], \
axis=1)
dataset = pd.DataFrame(np.concatenate([Y.reshape(-1,1), X], axis=1),
 columns=['Price', 'Moneyness', 'Time', 'Vol'])
3. Exploratory data analysis
Let’s have a look at the dataset we have.
3.1. Descriptive statistics. 
dataset.head()
Output
Price Moneyness Time Vol
0 1.390e-01 0.898 0.221 0.223
1 3.814e-06 1.223 0.052 0.210
2 1.409e-01 0.969 0.391 0.239
3 1.984e-01 0.950 0.628 0.263
4 2.495e-01 0.914 0.810 0.282
The dataset contains price—which is the price of the option and is the predicted vari‐
able—along with moneyness (the ratio of strike and spot price), time to maturity, and
volatility, which are the features in the model.
3.2. Data visualization. In this step we look at scatterplot to understand the interaction
between different variables:14
pyplot.figure(figsize=(15,15))
scatter_matrix(dataset,figsize=(12,12))
pyplot.show()
118 | Chapter 5: Supervised Learning: Regression (Including Time Series Models)
Output
The scatterplot reveals very interesting dependencies and relationships between the
variables. Let us look at the first row of the chart to see the relationship of price to
different variables. We observe that as moneyness decreases (i.e., strike price decrea‐
ses as compared to the stock price), there is an increase in the price, which is in line
with the rationale described in the previous section. Looking at the price versus time
to maturity, we see an increase in the option price. The price versus volatility chart
also shows an increase in the price with the volatility. However, option price seems to
exhibit a nonlinear relationship with most of the variables. This means that we expect
our nonlinear models to do a better job than our linear models.
Another interesting relationship is between volatility and strike. The more we deviate
from the moneyness of one, the higher the volatility we observe. This behavior is
shown due to the volatility function we defined before and illustratesthe volatility
smile/skew.
Case Study 2: Derivative Pricing | 119
4. Data preparation and analysis
We performed most of the data preparation steps (i.e., getting the dependent and
independent variables) in the preceding sections. In this step we look at the feature
importance.
4.1. Univariate feature selection. We start by looking at each feature individually and,
using the single variable regression fit as the criteria, look at the most important vari‐
ables:
bestfeatures = SelectKBest(k='all', score_func=f_regression)
fit = bestfeatures.fit(X,Y)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(['Moneyness', 'Time', 'Vol'])
#concat two dataframes for better visualization
featureScores = pd.concat([dfcolumns,dfscores],axis=1)
featureScores.columns = ['Specs','Score'] #naming the dataframe columns
featureScores.nlargest(10,'Score').set_index('Specs')
Output
Moneyness : 30282.309
Vol : 2407.757
Time : 1597.452
We observe that the moneyness is the most important variable for the option price,
followed by volatility and time to maturity. Given there are only three predictor vari‐
ables, we retain all the variables for modeling.
5. Evaluate models
5.1. Train-test split and evaluation metrics. First, we separate the training set and test
set:
validation_size = 0.2
train_size = int(len(X) * (1-validation_size))
X_train, X_test = X[0:train_size], X[train_size:len(X)]
Y_train, Y_test = Y[0:train_size], Y[train_size:len(X)]
We use the prebuilt sklearn models to run a k-fold analysis on our training data. We
then train the model on the full training data and use it for prediction of the test data.
We will evaluate algorithms using the mean squared error metric. The parameters for
the k-fold analysis and evaluation metrics are defined as follows:
num_folds = 10
seed = 7
scoring = 'neg_mean_squared_error'
120 | Chapter 5: Supervised Learning: Regression (Including Time Series Models)
5.2. Compare models and algorithms. Now that we have completed the data loading and
have designed the test harness, we need to choose a model out of the suite of the
supervised regression models.
Linear models and regression trees
models = []
models.append(('LR', LinearRegression()))
models.append(('KNN', KNeighborsRegressor()))
models.append(('CART', DecisionTreeRegressor()))
models.append(('SVR', SVR()))
Artificial neural network
models.append(('MLP', MLPRegressor()))
Boosting and bagging methods
# Boosting methods
models.append(('ABR', AdaBoostRegressor()))
models.append(('GBR', GradientBoostingRegressor()))
# Bagging methods
models.append(('RFR', RandomForestRegressor()))
models.append(('ETR', ExtraTreesRegressor()))
Once we have selected all the models, we loop over each of them. First, we run the k-
fold analysis. Next, we run the model on the entire training and testing dataset.
The algorithms use default tuning parameters. We will calculate the mean and stan‐
dard deviation of error metric and save the results for use later.
Output
Case Study 2: Derivative Pricing | 121
The Python code for the k-fold analysis step is similar to that used in case study 1.
Readers can also refer to the Jupyter notebook of this case study in the code reposi‐
tory for more details. Let us look at the performance of the models in the training set.
We see clearly that the nonlinear models, including classification and regression tree
(CART), ensemble models, and artificial neural network (represented by MLP in the
chart above), perform a lot better that the linear algorithms. This is intuitive given the
nonlinear relationships we observed in the scatterplot.
Artificial neural networks (ANN) have the natural ability to model any function with
fast experimentation and deployment times (definition, training, testing, inference).
ANN can effectively be used in complex derivative pricing situations. Hence, out of
all the models with good performance, we choose ANN for further analysis.
6. Model tuning and finalizing the model
Determining the proper number of nodes for the middle layer of an ANN is more of
an art than a science, as discussed in Chapter 3. Too many nodes in the middle layer,
and thus too many connections, produce a neural network that memorizes the input
data and lacks the ability to generalize. Therefore, increasing the number of nodes in
the middle layer will improve performance on the training set, while decreasing the
number of nodes in the middle layer will improve performance on a new dataset.
As discussed in Chapter 3, the ANN model has several other hyperparameters such as
learning rate, momentum, activation function, number of epochs, and batch size. All
these hyperparameters can be tuned during the grid search process. However, in this
step, we stick to performing grid search on the number of hidden layers for the pur‐
pose of simplicity. The approach to perform grid search on other hyperparameters is
the same as described in the following code snippet:
'''
hidden_layer_sizes : tuple, length = n_layers - 2, default (100,)
 The ith element represents the number of neurons in the ith
 hidden layer.
'''
param_grid={'hidden_layer_sizes': [(20,), (50,), (20,20), (20, 30, 20)]}
model = MLPRegressor()
kfold = KFold(n_splits=num_folds, random_state=seed)
grid = GridSearchCV(estimator=model, param_grid=param_grid, scoring=scoring, \
 cv=kfold)
grid_result = grid.fit(X_train, Y_train)
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
 print("%f (%f) with: %r" % (mean, stdev, param))
122 | Chapter 5: Supervised Learning: Regression (Including Time Series Models)
Output
Best: -0.000024 using {'hidden_layer_sizes': (20, 30, 20)}
-0.000580 (0.000601) with: {'hidden_layer_sizes': (20,)}
-0.000078 (0.000041) with: {'hidden_layer_sizes': (50,)}
-0.000090 (0.000140) with: {'hidden_layer_sizes': (20, 20)}
-0.000024 (0.000011) with: {'hidden_layer_sizes': (20, 30, 20)}
The best model has three layers, with 20, 30, and 20 nodes in each hidden layer,
respectively. Hence, we prepare a model with this configuration and check its perfor‐
mance on the test set. This is a crucial step, because a greater number of layers may
lead to overfitting and have poor performance in the test set.
# prepare model
model_tuned = MLPRegressor(hidden_layer_sizes=(20, 30, 20))
model_tuned.fit(X_train, Y_train)
# estimate accuracy on validation set
# transform the validation dataset
predictions = model_tuned.predict(X_test)
print(mean_squared_error(Y_test, predictions))
Output
3.08127276609567e-05
We see that the root mean squared error (RMSE) is 3.08e–5, which is less than one
cent. Hence, the ANN model does an excellent job of fitting the Black-Scholes option
pricing model. A greater number of layers and tuning of other hyperparameters may
enable the ANN model to capture the complex relationship and nonlinearity in the
data even better. Overall, the results suggest that ANN may be used to train an option
pricing model that matches market prices.
7. Additional analysis: removing the volatility data
As an additional analysis, we make the process harder by trying to predict the price
without the volatility data. If the model performance is good, we will eliminate the
need to have a volatility function as described before. In this step, we further compare
the performance of the linear and nonlinear models. In the following code snippet,
we remove the volatility variable from the dataset of the predictor variable and define
the training set and test set again:
X = X[:, :2]
validation_size = 0.2
train_size = int(len(X) * (1-validation_size))
X_train, X_test = X[0:train_size], X[train_size:len(X)]
Y_train, Y_test = Y[0:train_size], Y[train_size:len(X)]
Next, we run the suite of the models (except the regularized regression model) with
the new dataset, with the same parametersand similar Python code as before. The
performance of all the models after removing the volatility data is as follows:
Case Study 2: Derivative Pricing | 123
Looking at the result, we have a similar conclusion as before and see a poor perfor‐
mance of the linear regression and good performance of the ensemble and ANN
models. The linear regression now does even a worse job than before. However, the
performance of ANN and other ensemble models does not deviate much from their
previous performance. This implies the information of the volatility is likely captured
in other variables, such as moneyness and time to maturity. Overall, it is good news
as it means that fewer variables might be needed to achieve the same performance.
Conclusion
We know that derivative pricing is a nonlinear problem. As expected, our linear
regression model did not do as well as our nonlinear models, and the non-linear
models have a very good overall performance. We also observed that removing the
volatility increases the difficulty of the prediction problem for the linear regression.
However, the nonlinear models such as ensemble models and ANN are still able to
do well at the prediction process. This does indicate that one might be able to side‐
step the development of an option volatility surface and achieve a good prediction
with a smaller number of variables.
We saw that an artificial neural network (ANN) can reproduce the Black-Scholes
option pricing formula for a call option to a high degree of accuracy, meaning we can
leverage efficient numerical calculation of machine learning in derivative pricing
without relying on the impractical assumptions made in the traditional derivative
pricing models. The ANN and the related machine learning architecture can easily be
extended to pricing derivatives in the real world, with no knowledge of the theory of
derivative pricing. The use of machine learning techniques can lead to much faster
derivative pricing compared to traditional derivative pricing models. The price we
124 | Chapter 5: Supervised Learning: Regression (Including Time Series Models)
might have to pay for this extra speed is some loss of accuracy. However, this reduced
accuracy is often well within reasonable limits and acceptable from a practical point
of view. New technology has commoditized the use of ANN, so it might be worth‐
while for banks, hedge funds, and financial institutions to explore these models for
derivative pricing.
Case Study 3: Investor Risk Tolerance and Robo-Advisors
The risk tolerance of an investor is one of the most important inputs to the portfolio
allocation and rebalancing steps of the portfolio management process. There is a wide
variety of risk profiling tools that take varied approaches to understanding the risk
tolerance of an investor. Most of these approaches include qualitative judgment and
involve significant manual effort. In most of the cases, the risk tolerance of an
investor is decided based on a risk tolerance questionnaire.
Several studies have shown that these risk tolerance questionnaires are prone to
error, as investors suffer from behavioral biases and are poor judges of their own risk
perception, especially during stressed markets. Also, given that these questionnaires
must be manually completed by investors, they eliminate the possibility of automat‐
ing the entire investment management process.
So can machine learning provide a better understanding of an investor’s risk profile
than a risk tolerance questionnaire can? Can machine learning contribute to auto‐
mating the entire portfolio management process by cutting the client out of the loop?
Could an algorithm be written to develop a personality profile for the client that
would be a better representation of how they would deal with different market
scenarios?
The goal of this case study is to answer these questions. We first build a supervised
regression–based model to predict the risk tolerance of an investor. We then build a
robo-advisor dashboard in Python and implement the risk tolerance prediction
model in the dashboard. The overall purpose is to demonstrate the automation of the
manual steps in the portfolio management process with the help of machine learning.
This can prove to be immensely useful, specifically for robo-advisors.
A dashboard is one of the key features of a robo-advisor as it provides access to
important information and allows users to interact with their accounts free of any
human dependency, making the portfolio management process highly efficient.
Figure 5-6 provides a quick glance at the robo-advisor dashboard built for this case
study. The dashboard performs end-to-end asset allocation for an investor, embed‐
ding the machine learning–based risk tolerance model constructed in this case study.
Case Study 3: Investor Risk Tolerance and Robo-Advisors | 125
Figure 5-6. Robo-advisor dashboard
This dashboard has been built in Python and is described in detail in an additional
step in this case study. Although it has been built in the context of robo-advisors, it
can be extended to other areas in finance and can embed the machine learning mod‐
els discussed in other case studies, providing finance decision makers with a graphi‐
cal interface for analyzing and interpreting model results.
In this case study, we will focus on:
• Feature elimination and feature importance/intuition.
• Using machine learning to automate manual processes involved in portfolio
management process.
• Using machine learning to quantify and model the behavioral bias of investors/
individuals.
• Embedding machine learning models into user interfaces or dashboards using
Python.
126 | Chapter 5: Supervised Learning: Regression (Including Time Series Models)
15 Given that the primary purpose of the model is to be used in the portfolio management context, the individ‐
ual is also referred to as investor in the case study.
Blueprint for Modeling Investor Risk Tolerance and Enabling
a Machine Learning–Based Robo-Advisor
1. Problem definition
In the supervised regression framework used for this case study, the predicted vari‐
able is the “true” risk tolerance of an individual,15 and the predictor variables are
demographic, financial, and behavioral attributes of an individual.
The data used for this case study is from the Survey of Consumer Finances (SCF),
which is conducted by the Federal Reserve Board. The survey includes responses
about household demographics, net worth, financial, and nonfinancial assets for the
same set of individuals in 2007 (precrisis) and 2009 (postcrisis). This enables us to see
how each household’s allocation changed after the 2008 global financial crisis. Refer
to the data dictionary for more information on this survey.
2. Getting started—loading the data and Python packages
2.1. Loading the Python packages. The details on loading the standard Python packages
were presented in the previous case studies. Refer to the Jupyter notebook for this
case study for more details.
2.2. Loading the data. In this step we load the data from the Survey of Consumer
Finances and look at the data shape:
# load dataset
dataset = pd.read_excel('SCFP2009panel.xlsx')
Let us look at the size of the data:
dataset.shape
Output
(19285, 515)
As we can see, the dataset has a total of 19,285 observations with 515 columns. The
number of columns represents the number of features.
Case Study 3: Investor Risk Tolerance and Robo-Advisors | 127
https://oreil.ly/2vxJ6
https://oreil.ly/_L8vS
16 There potentially can be several ways of computing the risk tolerance. In this case study, we use the intuitive
ways to measure the risk tolerance of an individual.
3. Data preparation and feature selection
In this step we prepare the predicted and predictor variables to be used for modeling.
3.1. Preparing the predicted variable. In the first step, we prepare the predicted variable,
which is the true risk tolerance.
The steps to compute the true risk tolerance are as follows:
1. Computethe risky assets and the risk-free assets for all the individuals in the sur‐
vey data. Risky and risk-free assets are defined as follows:
Risky assets
Investments in mutual funds, stocks, and bonds.
Risk-free assets
Checking and savings balances, certificates of deposit, and other cash balan‐
ces and equivalents.
2. Take the ratio of risky assets to total assets (where total assets is the sum of risky
and risk-free assets) of an individual and consider that as a measure of the indi‐
vidual’s risk tolerance.16 From the SCF, we have the data of risky and risk-free
assets for the individuals in 2007 and 2009. We use this data and normalize the
risky assets with the price of a stock index (S&P500) in 2007 versus 2009 to get
risk tolerance.
3. Identify the “intelligent” investors. Some literature describes an intelligent
investor as one who does not change their risk tolerance during changes in the
market. So we consider the investors who changed their risk tolerance by less
than 10% between 2007 and 2009 as the intelligent investors. Of course, this is a
qualitative judgment, and there can be several other ways of defining an intelli‐
gent investor. However, as mentioned before, beyond coming up with a precise
definition of true risk tolerance, the purpose of this case study is to demonstrate
the usage of machine learning and provide a machine learning–based framework
in portfolio management that can be further leveraged for more detailed analysis.
Let us compute the predicted variable. First, we get the risky and risk-free assets and
compute the risk tolerance for 2007 and 2009 in the following code snippet:
# Compute the risky assets and risk-free assets for 2007
dataset['RiskFree07']= dataset['LIQ07'] + dataset['CDS07'] + dataset['SAVBND07']\
 + dataset['CASHLI07']
dataset['Risky07'] = dataset['NMMF07'] + dataset['STOCKS07'] + dataset['BOND07']
128 | Chapter 5: Supervised Learning: Regression (Including Time Series Models)
# Compute the risky assets and risk-free assets for 2009
dataset['RiskFree09']= dataset['LIQ09'] + dataset['CDS09'] + dataset['SAVBND09']\
+ dataset['CASHLI09']
dataset['Risky09'] = dataset['NMMF09'] + dataset['STOCKS09'] + dataset['BOND09']
# Compute the risk tolerance for 2007
dataset['RT07'] = dataset['Risky07']/(dataset['Risky07']+dataset['RiskFree07'])
#Average stock index for normalizing the risky assets in 2009
Average_SP500_2007=1478
Average_SP500_2009=948
# Compute the risk tolerance for 2009
dataset['RT09'] = dataset['Risky09']/(dataset['Risky09']+dataset['RiskFree09'])*\
 (Average_SP500_2009/Average_SP500_2007)
Let us look at the details of the data:
dataset.head()
Output
The data above displays some of the columns out of the 521 columns of the dataset.
Let us compute the percentage change in risk tolerance between 2007 and 2009:
dataset['PercentageChange'] = np.abs(dataset['RT09']/dataset['RT07']-1)
Next, we drop the rows containing “NA” or “NaN”:
# Drop the rows containing NA
dataset=dataset.dropna(axis=0)
dataset=dataset[~dataset.isin([np.nan, np.inf, -np.inf]).any(1)]
Let us investigate the risk tolerance behavior of individuals in 2007 versus 2009. First
we look at the risk tolerance in 2007:
sns.distplot(dataset['RT07'], hist=True, kde=False,
 bins=int(180/5), color = 'blue',
 hist_kws={'edgecolor':'black'})
Case Study 3: Investor Risk Tolerance and Robo-Advisors | 129
Output
Looking at the risk tolerance in 2007, we see that a significant number of individuals
had a risk tolerance close to one, meaning investments were skewed more toward the
risky assets. Now let us look at the risk tolerance in 2009:
sns.distplot(dataset['RT09'], hist=True, kde=False,
 bins=int(180/5), color = 'blue',
 hist_kws={'edgecolor':'black'})
Output
Clearly, the behavior of the individuals reversed after the crisis. Overall risk tolerance
decreased, which is shown by the outsized proportion of households having risk tol‐
erance close to zero in 2009. Most of the investments of these individuals were in
risk-free assets.
130 | Chapter 5: Supervised Learning: Regression (Including Time Series Models)
In the next step, we pick the intelligent investors whose change in risk tolerance
between 2007 and 2009 was less than 10%, as described in “3.1. Preparing the predic‐
ted variable” on page 128:
dataset3 = dataset[dataset['PercentageChange']<=.1]
We assign the true risk tolerance as the average risk tolerance of these intelligent
investors between 2007 and 2009:
dataset3['TrueRiskTolerance'] = (dataset3['RT07'] + dataset3['RT09'])/2
This is the predicted variable for this case study.
Let us drop other labels that might not be needed for the prediction:
dataset3.drop(labels=['RT07', 'RT09'], axis=1, inplace=True)
dataset3.drop(labels=['PercentageChange'], axis=1, inplace=True)
3.2. Feature selection—limit the feature space. In this section, we will explore ways to
condense the feature space.
3.2.1. Feature elimination. To filter the features further, we check the description in
the data dictionary and keep only the features that are relevant.
Looking at the entire data, we have more than 500 features in the dataset. However,
academic literature and industry practice indicate risk tolerance is heavily influenced
by investor demographic, financial, and behavioral attributes, such as age, current
income, net worth, and willingness to take risk. All these attributes were available in
the dataset and are summarized in the following section. These attributes are used as
features to predict investors’ risk tolerance.
In the dataset, each of the columns contains a numeric value corresponding to the
value of the attribute. The details are as follows:
AGE
There are six age categories, where 1 represents age less than 35 and 6 represents
age more than 75.
Case Study 3: Investor Risk Tolerance and Robo-Advisors | 131
https://oreil.ly/_L8vS
EDUC
There are four education categories, where 1 represents no high school and 4
represents college degree.
MARRIED
There are two categories to represent marital status, where 1 represents married
and 2 represents unmarried.
OCCU
This represents occupation category. A value of 1 represents managerial status
and 4 represents unemployed.
KIDS
Number of children.
WSAVED
This represents the individual’s spending versus income, split into three cate‐
gories. For example, 1 represents spending exceeded income.
NWCAT
This represents net worth category. There are five categories, where 1 represents
net worth less than the 25th percentile and 5 represents net worth more than the
90th percentile.
INCCL
This represents income category. There are five categories, where 1 represents
income less than $10,000 and 5 represents income more than $100,000.
RISK
This represents the willingness to take risk on a scale of 1 to 4, where 1 represents
the highest level of willingness to take risk.
We keep only the intuitive features as of 2007 and remove all the intermediate fea‐
tures and features related to 2009, as the variables of 2007 are the only ones required
for predicting the risk tolerance:
keep_list2 = ['AGE07','EDCL07','MARRIED07','KIDS07','OCCAT107','INCOME07',\
'RISK07','NETWORTH07','TrueRiskTolerance']
drop_list2 = [col for col in dataset3.columns if col not in keep_list2]
dataset3.drop(labels=drop_list2, axis=1, inplace=True)
132 | Chapter 5: Supervised Learning: Regression (Including Time Series Models)
Now let us look at the correlation among the features:
# correlation
correlation = dataset3.corr()
plt.figure(figsize=(15,15))
plt.title('Correlation Matrix')
sns.heatmap(correlation, vmax=1, square=True,annot=True,cmap='cubehelix')
Output
Looking at the correlation chart (full-size version available on GitHub), net worth
and income are positively correlated with risk tolerance. With a greater number of
kids and marriage, risk tolerance decreases. As the willingness to take risks decreases,
the risk tolerance decreases. With age there isa positive relationship of the risk
Case Study 3: Investor Risk Tolerance and Robo-Advisors | 133
https://oreil.ly/iQpk4
17 We could have chosen RMSE as the evaluation metric; however, R2 was chosen as the evaluation metric given
that we already used RMSE as the evaluation metric in the previous case studies.
tolerance. As per Hui Wang and Sherman Hanna’s paper “Does Risk Tolerance
Decrease with Age?,” risk tolerance increases as people age (i.e., the proportion of net
wealth invested in risky assets increases as people age) when other variables are held
constant.
So in summary, the relationship of these variables with risk tolerance seems intuitive.
4. Evaluate models
4.1. Train-test split. Let us split the data into training and test set:
Y= dataset3["TrueRiskTolerance"]
X = dataset3.loc[:, dataset3.columns != 'TrueRiskTolerance']
validation_size = 0.2
seed = 3
X_train, X_validation, Y_train, Y_validation = \
train_test_split(X, Y, test_size=validation_size, random_state=seed)
4.2. Test options and evaluation metrics. We use R2 as the evaluation metric and select
10 as the number of folds for cross validation.17
num_folds = 10
scoring = 'r2'
4.3. Compare models and algorithms. Next, we select the suite of the regression model
and perform the k-folds cross validation.
Regression Models
# spot-check the algorithms
models = []
models.append(('LR', LinearRegression()))
models.append(('LASSO', Lasso()))
models.append(('EN', ElasticNet()))
models.append(('KNN', KNeighborsRegressor()))
models.append(('CART', DecisionTreeRegressor()))
models.append(('SVR', SVR()))
#Ensemble Models
# Boosting methods
models.append(('ABR', AdaBoostRegressor()))
models.append(('GBR', GradientBoostingRegressor()))
# Bagging methods
models.append(('RFR', RandomForestRegressor()))
models.append(('ETR', ExtraTreesRegressor()))
134 | Chapter 5: Supervised Learning: Regression (Including Time Series Models)
The Python code for the k-fold analysis step is similar to that of previous case studies.
Readers can also refer to the Jupyter notebook of this case study in the code reposi‐
tory for more details. Let us look at the performance of the models in the training set.
The nonlinear models perform better than the linear models, which means that there
is a nonlinear relationship between the risk tolerance and the variables used to pre‐
dict it. Given random forest regression is one of the best methods, we use it for fur‐
ther grid search.
5. Model tuning and grid search
As discussed in Chapter 4, random forest has many hyperparameters that can be
tweaked while performing the grid search. However, we will confine our grid search
to number of estimators (n_estimators) as it is one of the most important hyper‐
parameters. It represents the number of trees in the random forest model. Ideally,
this should be increased until no further improvement is seen in the model:
# 8. Grid search : RandomForestRegressor
'''
n_estimators : integer, optional (default=10)
 The number of trees in the forest.
'''
param_grid = {'n_estimators': [50,100,150,200,250,300,350,400]}
model = RandomForestRegressor()
kfold = KFold(n_splits=num_folds, random_state=seed)
grid = GridSearchCV(estimator=model, param_grid=param_grid, scoring=scoring, \
 cv=kfold)
grid_result = grid.fit(X_train, Y_train)
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
Case Study 3: Investor Risk Tolerance and Robo-Advisors | 135
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
Output
Best: 0.738632 using {'n_estimators': 250}
Random forest with number of estimators as 250 is the best model after grid search.
6. Finalize the model
Let us look at the results on the test dataset and check the feature importance.
6.1. Results on the test dataset. We prepare the random forest model with the number
of estimators as 250:
model = RandomForestRegressor(n_estimators = 250)
model.fit(X_train, Y_train)
Let us look at the performance in the training set:
from sklearn.metrics import r2_score
predictions_train = model.predict(X_train)
print(r2_score(Y_train, predictions_train))
Output
0.9640632406817223
The R2 of the training set is 96%, which is a good result. Now let us look at the perfor‐
mance in the test set:
predictions = model.predict(X_validation)
print(mean_squared_error(Y_validation, predictions))
print(r2_score(Y_validation, predictions))
Output
0.007781840953471237
0.7614494526639909
From the mean squared error and R2 of 76% shown above for the test set, the random
forest model does an excellent job of fitting the risk tolerance.
6.2. Feature importance and features intuition
Let us look into the feature importance of the variables within the random forest
model:
import pandas as pd
import numpy as np
model = RandomForestRegressor(n_estimators= 200,n_jobs=-1)
model.fit(X_train,Y_train)
#use inbuilt class feature_importances of tree based classifiers
#plot graph of feature importances for better visualization
136 | Chapter 5: Supervised Learning: Regression (Including Time Series Models)
feat_importances = pd.Series(model.feature_importances_, index=X.columns)
feat_importances.nlargest(10).plot(kind='barh')
plt.show()
Output
In the chart, the x-axis represents the magnitude of the importance of a feature.
Hence, income and net worth, followed by age and willingness to take risk, are the
key variables in determining risk tolerance.
6.3. Save model for later use. In this step we save the model for later use. The saved
model can be used directly for prediction given the set of input variables. The model
is saved as finalized_model.sav using the dump module of the pickle package. This
saved model can be loaded using the load module.
Let’s save the model as the first step:
# Save Model Using Pickle
from pickle import dump
from pickle import load
# save the model to disk
filename = 'finalized_model.sav'
dump(model, open(filename, 'wb'))
Now let’s load the saved model and use it for prediction:
# load the model from disk
loaded_model = load(open(filename, 'rb'))
# estimate accuracy on validation set
predictions = loaded_model.predict(X_validation)
result = mean_squared_error(Y_validation, predictions)
print(r2_score(Y_validation, predictions))
print(result)
Case Study 3: Investor Risk Tolerance and Robo-Advisors | 137
Output
0.7683894847939692
0.007555447734714956
7. Additional step: robo-advisor dashboard
We mentioned the robo-advisor dashboard in the beginning of this case study. The
robo-advisor dashboard performs an automation of the portfolio management pro‐
cess and aims to overcome the problem of traditional risk tolerance profiling.
Python Code for Robo-Advisor Dashboard
This robo-advisor dashboard is built in Python using the plotly
dash package. Dash is a productive Python framework for building
web applications with good user interfaces. The code for the robo-
advisor dashboard is added to the code repository for this book.
The code is in a Jupyter notebook called “Sample Robo-advisor”. A
detailed description of the code is outside the scope of this case
study. However, the codebase can be leveraged for creation of any
new machine learning–enabled dashboard.
The dashboard has two panels:
• Inputs for investor characteristics
• Asset allocation and portfolio performance
Input for investor characteristics. Figure 5-7 shows the input panel for the investor
characteristics. This panel takes all the input regarding the investor’s demographic,
financial, and behavioral attributes. These inputs are for the predicted variables we
used in the risk tolerance model created in the preceding steps. The interface is
designed to input the categorical and continuous variables in the correct format.
Once the inputs are submitted, we leverage the model saved in “6.3. Save model for
later use” on page 137. This model takes all the inputs and produces the risk tolerance
of an investor (referto the predict_riskTolerance function of the “Sample Robo-
advisor” Jupyter notebook in the code repository for this book for more details). The
risk tolerance prediction model is embedded in this dashboard and is triggered once
the “Calculate Risk Tolerance” button is pressed after submitting the inputs.
138 | Chapter 5: Supervised Learning: Regression (Including Time Series Models)
https://dash.plot.ly
https://oreil.ly/8fTDy
Figure 5-7. Robo-advisor input panel
7.2 Asset allocation and portfolio performance. Figure 5-8 shows the “Asset Allocation
and Portfolio Performance” panel, which performs the following functionalities:
• Once the risk tolerance is computed using the model, it is displayed on the top of
this panel.
• In the next step, we pick the assets for our portfolio from the dropdown.
• Once the list of assets are submitted, the traditional mean-variance portfolio allo‐
cation model is used to allocate the portfolio among the assets selected. Risk
Case Study 3: Investor Risk Tolerance and Robo-Advisors | 139
tolerance is one of the key inputs for this process. (Refer to the get_asset_allo
cation function of the “Sample Robo-advisor” Jupyter notebook in the code
repository for this book for more details.)
• The dashboard also shows the historical performance of the allocated portfolio
for an initial investment of $100.
Figure 5-8. Robo-advisor asset allocation and portfolio performance panel
Although the dashboard is a basic version of the robo-advisor dashboard, it performs
end-to-end asset allocation for an investor and provides the portfolio view and his‐
torical performance of the portfolio over a selected period. There are several potential
enhancements to this prototype in terms of the interface and underlying models
used. The dashboard can be enhanced to include additional instruments and incor‐
porate additional features such as real-time portfolio monitoring, portfolio rebalanc‐
ing, and investment advisory. In terms of the underlying models used for asset
allocation, we have used the traditional mean-variance optimization method, but it
can be further enhanced to use the allocation algorithms based on machine learning
techniques such as eigen-portfolio, hierarchical risk parity, or reinforcement learn‐
ing–based models, described in Chapters 7, 8 and 9, respectively. The risk tolerance
model can be further enhanced by using additional features or using the actual data
of the investors rather than using data from the Survey of Consumer Finances.
140 | Chapter 5: Supervised Learning: Regression (Including Time Series Models)
Conclusion
In this case study, we introduced the regression-based algorithm applied to compute
an investor’s risk tolerance, followed by a demonstration of the model in a robo-
advisor setup. We showed that machine learning models might be able to objectively
analyze the behavior of different investors in a changing market and attribute these
changes to variables involved in determining risk appetite. With an increase in the
volume of investors’ data and the availability of rich machine learning infrastructure,
such models might prove to be more useful than existing manual processes.
We saw that there is a nonlinear relationship between the variables and the risk toler‐
ance. We analyzed the feature importance and found that results of the case study are
quite intuitive. Income and net worth, followed by age and willingness to take risk,
are the key variables to deciding risk tolerance. These variables have been considered
key variables to model risk tolerance across academic and industry literature.
Through the robo-advisor dashboard powered by machine learning, we demon‐
strated an effective combination of data science and machine learning implementa‐
tion in wealth management. Robo-advisors and investment managers could leverage
such models and platforms to enhance the portfolio management process with the
help of machine learning.
Case Study 4: Yield Curve Prediction
A yield curve is a line that plots yields (interest rates) of bonds having equal credit
quality but differing maturity dates. This yield curve is used as a benchmark for other
debt in the market, such as mortgage rates or bank lending rates. The most frequently
reported yield curve compares the 3-months, 2-years, 5-years, 10-years, and 30-years
U.S. Treasury debt.
The yield curve is the centerpiece in a fixed income market. Fixed income markets
are important sources of finance for governments, national and supranational insti‐
tutions, banks, and private and public corporations. In addition, yield curves are very
important to investors in pension funds and insurance companies.
The yield curve is a key representation of the state of the bond market. Investors
watch the bond market closely as it is a strong predictor of future economic activity
and levels of inflation, which affect prices of goods, financial assets, and real estate.
The slope of the yield curve is an important indicator of short-term interest rates and
is followed closely by investors.
Hence, an accurate yield curve forecasting is of critical importance in financial appli‐
cations. Several statistical techniques and tools commonly used in econometrics and
finance have been applied to model the yield curve.
Case Study 4: Yield Curve Prediction | 141
In this case study we will use supervised learning–based models to predict the yield
curve. This case study is inspired by the paper Artificial Neural Networks in Fixed
Income Markets for Yield Curve Forecasting by Manuel Nunes et al. (2018).
In this case study, we will focus on:
• Simultaneous modeling (producing multiple outputs at the same time) of the
interest rates.
• Comparison of neural network versus linear regression models.
• Modeling a time series in a supervised regression–based framework.
• Understanding the variable intuition and feature selection.
Overall, the case study is similar to the stock price prediction case study presented
earlier in this chapter, with the following differences:
• We predict multiple outputs simultaneously, rather than a single output.
• The predicted variable in this case study is not the return variable.
• Given that we already covered time series models in case study 1, we focus on
artificial neural networks for prediction in this case study.
Blueprint for Using Supervised Learning Models to Predict
the Yield Curve
1. Problem definition
In the supervised regression framework used for this case study, three tenors (1M,
5Y, and 30Y) of the yield curve are the predicted variables. These tenors represent
short-term, medium-term, and long-term tenors of the yield curve.
We need to understand what affects the movement of the yield curve and hence
incorporate as much information into our model as we can. As a high-level overview,
other than the historical price of the yield curve itself, we look at other correlated
variables that can influence the yield curve. The independent or predictor variables
we consider are:
• Previous value of the treasury curve for different tenors. The tenors used are 1-
month, 3-month, 1-year, 2-year, 5-year, 7-year, 10-year, and 30-year yields.
142 | Chapter 5: Supervised Learning: Regression (Including Time Series Models)
• Percentage of the federal debt held by the public, foreign governments, and the
federal reserve.
• Corporate spread on Baa-rated debt relative to the 10-year treasury rate.
The federal debt and corporate spread are correlated variables and can be potentially
useful in modeling the yield curve. The dataset used for this case study is extracted
from Yahoo Finance and FRED. We will use the daily data of the last 10 years, from
2010 onward.
By the end of this case study, readers will be familiar with a general machine learning
approach to yield curve modeling, from gathering and cleaning data to building and
tuning different models.
2. Getting started—loading the data and Python packages
2.1. Loading the Python packages. The loading of Python packagesis similar to other
case studies in this chapter. Refer to the Jupyter notebook of this case study for more
details.
2.2. Loading the data. The following steps demonstrate the loading of data using Pan‐
das’s DataReader function:
# Get the data by webscraping using pandas datareader
tsy_tickers = ['DGS1MO', 'DGS3MO', 'DGS1', 'DGS2', 'DGS5', 'DGS7', 'DGS10',
 'DGS30',
 'TREAST', # Treasury securities held by the Federal Reserve ($MM)
 'FYGFDPUN', # Federal Debt Held by the Public ($MM)
 'FDHBFIN', # Federal Debt Held by International Investors ($BN)
 'GFDEBTN', # Federal Debt: Total Public Debt ($BN)
 'BAA10Y', # Baa Corporate Bond Yield Relative to Yield on 10-Year
 ]
tsy_data = web.DataReader(tsy_tickers, 'fred').dropna(how='all').ffill()
tsy_data['FDHBFIN'] = tsy_data['FDHBFIN'] * 1000
tsy_data['GOV_PCT'] = tsy_data['TREAST'] / tsy_data['GFDEBTN']
tsy_data['HOM_PCT'] = tsy_data['FYGFDPUN'] / tsy_data['GFDEBTN']
tsy_data['FOR_PCT'] = tsy_data['FDHBFIN'] / tsy_data['GFDEBTN']
Next, we define our dependent (Y) and independent (X) variables. The predicted
variables are the rate for three tenors of the yield curve (i.e., 1M, 5Y, and 30Y) as
mentioned before. The number of trading days in a week is assumed to be five, and
we compute the lagged version of the variables mentioned in the problem definition
section as independent variables using five trading day lag.
The lagged five-day variables embed the time series component by using a time-delay
approach, where the lagged variable is included as one of the independent variables.
This step reframes the time series data into a supervised regression–based model
framework.
Case Study 4: Yield Curve Prediction | 143
https://fred.stlouisfed.org
3. Exploratory data analysis. We will look at descriptive statistics and data visualization
in this section.
3.1. Descriptive statistics. Let us look at the shape and the columns in the dataset:
dataset.shape
Output
(505, 15)
The data contains around 500 observations with 15 columns.
3.2. Data visualization. Let us first plot the predicted variables and see their behavior:
Y.plot(style=['-','--',':'])
Output
In the plot, we see that the deviation among the short-term, medium-term, and long-
term rates was higher in 2010 and has been decreasing since then. There was a drop
in the long-term and medium-term rates during 2011, and they also have been
declining since then. The order of the rates has been in line with the tenors. However,
for a few months in recent years, the 5Y rate has been lower than the 1M rate. In the
time series of all the tenors, we can see that the mean varies with time, resulting in an
upward trend. Thus these series are nonstationary time series.
144 | Chapter 5: Supervised Learning: Regression (Including Time Series Models)
In some cases, the linear regression for such nonstationary dependent variables might
not be valid. However, we are using the lagged variables, which are also nonstation‐
ary as independent variables. So we are effectively modeling a nonstationary time ser‐
ies against another nonstationary time series, which might still be valid.
Next, we look at the scatterplots (a correlation plot is skipped for this case study as it
has a similar interpretation to that of a scatterplot). We can visualize the relationship
between all the variables in the regression using the scatter matrix shown below:
# Scatterplot Matrix
pyplot.figure(figsize=(15,15))
scatter_matrix(dataset,figsize=(15,16))
pyplot.show()
Output
Case Study 4: Yield Curve Prediction | 145
Looking at the scatterplot (full-size version available on GitHub), we see a significant
linear relationship of the predicted variables with their lags and other tenors of the
yield curve. There is also a linear relationship, with negative slope between 1M, 5Y
rates versus corporate spread and changes in foreign government purchases. The 30Y
rate shows a linear relationship with these variables, although the slope is negative.
Overall, we see a lot of linear relationships, and we expect the linear models to per‐
form well.
4. Data preparation and analysis
We performed most of the data preparation steps (i.e., getting the dependent and
independent variables) in the preceding steps, and so we’ll skip this step.
5. Evaluate models
In this step we evaluate the models. The Python code for this step is similar to dthat
in case study 1, and some of the repetitive code is skipped. Readers can also refer to
the Jupyter notebook of this case study in the code repository for this book for more
details.
5.1. Train-test split and evaluation metrics. We will use 80% of the dataset for modeling
and use 20% for testing. We will evaluate algorithms using the mean squared error
metric. All the algorithms use default tuning parameters.
5.2. Compare models and algorithms. In this case study, the primary purpose is to com‐
pare the linear models with the artificial neural network in yield curve modeling. So
we stick to the linear regression (LR), regularized regression (LASSO and EN), and
artificial neural network (shown as MLP). We also include a few other models such as
KNN and CART, as these models are simpler with good interpretation, and if there is
a nonlinear relationship between the variables, the CART and KNN models will be
able to capture it and provide a good comparison benchmark for ANN.
Looking at the training and test error, we see a good performance of the linear regres‐
sion model. We see that lasso and elastic net perform poorly. These are regularized
regression models, and they reduce the number of variables in case they are not
important. A decrease in the number of variables might have caused a loss of infor‐
mation leading to poor model performance. KNN and CART are good, but looking
closely, we see that the test errors are higher than the training error. We also see that
the performance of the artificial neural network (MLP) algorithm is comparable to
the linear regression model. Despite its simplicity, the linear regression is a tough
benchmark to beat for one-step-ahead forecasting when there is a significant linear
relationship between the variables.
146 | Chapter 5: Supervised Learning: Regression (Including Time Series Models)
https://oreil.ly/XIsvu
Output
6. Model tuning and grid search.
Similar to case study 2 of this chapter, we perform a grid search of the ANN model
with different combinations of hidden layers. Several other hyperparameters such as
learning rate, momentum, activation function, number of epochs, and batch size can
be tuned during the grid search process, similar to the steps mentioned below.
'''
hidden_layer_sizes : tuple, length = n_layers - 2, default (100,)
 The ith element represents the number of neurons in the ith
 hidden layer.
'''
param_grid={'hidden_layer_sizes': [(20,), (50,), (20,20), (20, 30, 20)]}
model = MLPRegressor()
kfold = KFold(n_splits=num_folds, random_state=seed)
grid = GridSearchCV(estimator=model, param_grid=param_grid, scoring=scoring, \
 cv=kfold)
grid_result = grid.fit(X_train, Y_train)
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
 print("%f (%f) with: %r" % (mean, stdev, param))
Output
Best: -0.018006 using {'hidden_layer_sizes': (20, 30, 20)}
-0.036433 (0.019326) with: {'hidden_layer_sizes': (20,)}
Case Study 4: Yield Curve Prediction | 147
-0.020793 (0.007075) with: {'hidden_layer_sizes': (50,)}
-0.026638 (0.010154) with: {'hidden_layer_sizes': (20, 20)}
-0.018006 (0.005637) with: {'hidden_layer_sizes': (20, 30, 20)}
The best model is the model with three layers, with 20, 30, and 20 nodes in each hid‐
den layer, respectively. Hence, we prepare a model with this configuration and check
itsperformance on the test set. This is a crucial step, as a greater number of layers
may lead to overfitting and have poor performance in the test set.
Prediction comparison. In the last step we look at the prediction plot of actual data
versus the prediction from both linear regression and ANN models. Refer to the
Jupyter notebook of this case study for the Python code of this section.
148 | Chapter 5: Supervised Learning: Regression (Including Time Series Models)
Looking at the charts above, we see that the predictions of the linear regression and
ANN are comparable. For 1M tenor, the fitting with ANN is slightly poor compared
to the regression. However, for 5Y and 30Y tenors the ANN performs as well as the
regression model.
Conclusion
In this case study, we applied supervised regression to the prediction of several tenors
of yield curve. The linear regression model, despite its simplicity, is a tough bench‐
mark to beat for such one-step-ahead forecasting, given the dominant characteristic
of the last available value of the variable to predict. The ANN results in this case
study are comparable to the linear regression models. An additional benefit of ANN
is that it is more flexible to changing market conditions. Also, ANN models can be
enhanced by performing grid search on several other hyperparameters and the
option of incorporating recurrent neural networks, such as LSTM.
Overall, we built a machine learning–based model using ANN with an encouraging
outcome, in the context of fixed income instruments. This allows us to perform pre‐
dictions using historical data to generate results and analyze risk and profitability
before risking any actual capital in the fixed income market.
Chapter Summary
In “Case Study 1: Stock Price Prediction” on page 95, we covered a machine learning
and time series–based framework for stock price prediction. We demonstrated the
significance of visualization and compared time series against the machine learning
Chapter Summary | 149
models. In “Case Study 2: Derivative Pricing” on page 114, we explored the use of
machine learning for a traditional derivative pricing problem and demonstrated a
high model performance. In “Case Study 3: Investor Risk Tolerance and Robo-
Advisors” on page 125, we demonstrated how supervised learning models can be
used to model the risk tolerance of investors, which can lead to automation of the
portfolio management process. “Case Study 4: Yield Curve Prediction” on page 141
was similar to the stock price prediction case study, providing another example of
comparison of linear and nonlinear models in the context of fixed income markets.
We saw that time series and linear supervised learning models worked well for asset
price prediction problems (i.e., case studies 1 and 4), where the predicted variable
had a significant linear relationship with its lagged component. However, in deriva‐
tive pricing and risk tolerance prediction, where there are nonlinear relationships,
ensemble and ANN models performed better. Readers who are interested in imple‐
menting a case study using supervised regression or time series models are encour‐
aged to understand the nuances in the variable relationships and model intuition
before proceeding to model selection.
Overall, the concepts in Python, machine learning, time series, and finance presented
in this chapter through the case studies can used as a blueprint for any other super‐
vised regression–based problem in finance.
Exercises
• Using the concepts and framework of machine learning and time series models
specified in case study 1, develop a predictive model for another asset class—cur‐
rency pair (EUR/USD, for example) or bitcoin.
• In case study 1, add some technical indicators, such as trend or momentum, and
check the enhancement in the model performance. Some of the ideas of the tech‐
nical indicators can be borrowed from “Case Study 3: Bitcoin Trading Strategy”
on page 179 in Chapter 6.
• Using the concepts in “Case Study 2: Derivative Pricing” on page 114, develop a
machine learning–based model to price American options.
• Incorporate multivariate time series modeling using a variant of the ARIMA
model, such as VARMAX, for rates prediction in the yield curve prediction case
study and compare the performance against the machine learning–based models.
• Enhance the robo-advisor dashboard presented in “Case Study 3: Investor Risk
Tolerance and Robo-Advisors” on page 125 to incorporate instruments other
than equities.
150 | Chapter 5: Supervised Learning: Regression (Including Time Series Models)
https://oreil.ly/EMUXv
https://oreil.ly/t7s8q
CHAPTER 6
Supervised Learning: Classification
Here are some of the key questions that financial analysts attempt to solve:
• Is a borrower going to repay their loan or default on it?
• Will the instrument price go up or down?
• Is this credit card transaction a fraud or not?
All of these problem statements, in which the goal is to predict the categorical class
labels, are inherently suitable for classification-based machine learning.
Classification-based algorithms have been used across many areas within finance that
require predicting a qualitative response. These include fraud detection, default pre‐
diction, credit scoring, directional forecasting of asset price movement, and buy/sell
recommendations. There are many other use cases of classification-based supervised
learning in portfolio management and algorithmic trading.
In this chapter we cover three such classification-based case studies that span a
diverse set of areas, including fraud detection, loan default probability, and formulat‐
ing a trading strategy.
In “Case Study 1: Fraud Detection” on page 153, we use a classification-based algorithm
to predict whether a transaction is fraudulent. The focus of this case study is also to
deal with an unbalanced dataset, given that the fraud dataset is highly unbalanced
with a small number of fraudulent observations.
In “Case Study 2: Loan Default Probability” on page 166, we use a classification-based
algorithm to predict whether a loan will default. The case study focuses on various
techniques and concepts of data processing, feature selection, and exploratory
analysis.
151
1 There may be reordering or renaming of the steps or substeps based on the appropriateness and intuitiveness
of the steps/substeps.
In “Case Study 3: Bitcoin Trading Strategy” on page 179, we use classification-based
algorithms to predict whether the current trading signal of bitcoin is to buy or sell
depending on the relationship between the short-term and long-term price. We pre‐
dict the trend of bitcoin’s price using technical indicators. The prediction model can
easily be transformed into a trading bot that can perform buy, sell, or hold actions
without human intervention.
In addition to focusing on different problem statements in finance, these case studies
will help you understand:
• How to develop new features such as technical indicators for an investment strat‐
egy using feature engineering, and how to improve model performance.
• How to use data preparation and data transformation, and how to perform fea‐
ture reduction and use feature importance.
• How to use data visualization and exploratory data analysis for feature reduction
and to improve model performance.
• How to use algorithm tuning and grid search across various classification-based
models to improve model performance.
• How to handle unbalanced data.
• How to use the appropriate evaluation metrics for classification.
This Chapter’s Code Repository
A Python-based master template for supervised classification
model, along with the Jupyter notebook for the case studies presen‐
ted in this chapter, is included in the folder Chapter 6 - Sup. Learn‐
ing - Classification models in the code repository for this book. All
of the case studies presented in this chapter use the standardized
seven-step model development process presented in Chapter 2.1
For any new classification-basedproblem, the master template
from the code repository can be modified with the elements spe‐
cific to the problem. The templates are designed to run on cloud
infrastructure (e.g., Kaggle, Google Colab, or AWS). In order to
run the template on the local machine, all the packages used within
the template must be installed successfully.
152 | Chapter 6: Supervised Learning: Classification
https://oreil.ly/y19Yc
https://oreil.ly/y19Yc
Case Study 1: Fraud Detection
Fraud is one of the most significant issues the finance sector faces. It is incredibly
costly. According to one study, it is estimated that the typical organization loses 5%
of its annual revenue to fraud each year. When applied to the 2017 estimated Gross
World Product of $79.6 trillion, this translates to potential global losses of up to $4
trillion.
Fraud detection is a task inherently suitable for machine learning, as machine learn‐
ing–based models can scan through huge transactional datasets, detect unusual activ‐
ity, and identify all cases that might be prone to fraud. Also, the computations of
these models are faster compared to traditional rule-based approaches. By collecting
data from various sources and then mapping them to trigger points, machine learn‐
ing solutions are able to discover the rate of defaulting or fraud propensity for each
potential customer and transaction, providing key alerts and insights for the financial
institutions.
In this case study, we will use various classification-based models to detect whether a
transaction is a normal payment or a fraud.
The focuses of this case study are:
• Handling unbalanced data by downsampling/upsampling the data.
• Selecting the right evaluation metric, given that one of the main goals is to
reduce false negatives (cases in which fraudulent transactions incorrectly go
unnoticed).
Blueprint for Using Classification Models to Determine
Whether a Transaction Is Fraudulent
1. Problem definition
In the classification framework defined for this case study, the response (or target)
variable has the column name “Class.” This column has a value of 1 in the case of
fraud and a value of 0 otherwise.
The dataset used is obtained from Kaggle. This dataset holds transactions by Euro‐
pean cardholders that occurred over two days in September 2013, with 492 cases of
fraud out of 284,807 transactions.
Case Study 1: Fraud Detection | 153
https://oreil.ly/CeFRs
The dataset has been anonymized for privacy reasons. Given that certain feature
names are not provided (i.e., they are called V1, V2, V3, etc.), the visualization and
feature importance will not give much insight into the behavior of the model.
By the end of this case study, readers will be familiar with a general approach to fraud
modeling, from gathering and cleaning data to building and tuning a classifier.
2. Getting started—loading the data and Python packages
2.1. Loading the Python packages. The list of the libraries used for data loading, data
analysis, data preparation, model evaluation, and model tuning are shown below. The
packages used for different purposes have been separated in the Python code below.
The details of most of these packages and functions have been provided in Chapter 2
and Chapter 4:
Packages for data loading, data analysis, and data preparation
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot
from pandas import read_csv, set_option
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import StandardScaler
Packages for model evaluation and classification models
from sklearn.model_selection import train_test_split, KFold,\
 cross_val_score, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.pipeline import Pipeline
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier,
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.metrics import classification_report, confusion_matrix,\
 accuracy_score
Packages for deep learning models
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier
154 | Chapter 6: Supervised Learning: Classification
Packages for saving the model
from pickle import dump
from pickle import load
3. Exploratory data analysis
The following sections walk through some high-level data inspection.
3.1. Descriptive statistics. The first thing we must do is gather a basic sense of our data.
Remember, except for the transaction and amount, we do not know the names of
other columns. The only thing we know is that the values of those columns have been
scaled. Let’s look at the shape and columns of the data:
# shape
dataset.shape
Output
(284807, 31)
#peek at data
set_option('display.width', 100)
dataset.head(5)
Output
5 rows × 31 columns
As shown, the variable names are nondescript (V1, V2, etc.). Also, the data type for
the entire dataset is float, except Class, which is of type integer.
How many are fraud and how many are not fraud? Let us check:
class_names = {0:'Not Fraud', 1:'Fraud'}
print(dataset.Class.value_counts().rename(index = class_names))
Output
Not Fraud 284315
Fraud 492
Name: Class, dtype: int64
Notice the stark imbalance of the data labels. Most of the transactions are nonfraud.
If we use this dataset as the base for our modeling, most models will not place enough
emphasis on the fraud signals; the nonfraud data points will drown out any weight
Case Study 1: Fraud Detection | 155
the fraud signals provide. As is, we may encounter difficulties modeling the predic‐
tion of fraud, with this imbalance leading the models to simply assume all transac‐
tions are nonfraud. This would be an unacceptable result. We will explore some ways
of dealing with this issue in the subsequent sections.
3.2. Data visualization. Since the feature descriptions are not provided, visualizing the
data will not lead to much insight. This step will be skipped in this case study.
4. Data preparation
This data is from Kaggle and is already in a cleaned format without any empty rows
or columns. Data cleaning or categorization is unnecessary.
5. Evaluate models
Now we are ready to split the data and evaluate the models.
5.1. Train-test split and evaluation metrics. As described in Chapter 2, it is a good idea
to partition the original dataset into training and test sets. The test set is a sample of
the data that we hold back from our analysis and modeling. We use it at the end of
our project to confirm the accuracy of our final model. It is the final test that gives us
confidence in our estimates of accuracy on unseen data. We will use 80% of the data‐
set for model training and 20% for testing:
Y= dataset["Class"]
X = dataset.loc[:, dataset.columns != 'Class']
validation_size = 0.2
seed = 7
X_train, X_validation, Y_train, Y_validation =\
train_test_split(X, Y, test_size=validation_size, random_state=seed)
5.2. Checking models. In this step, we will evaluate different machine learning models.
To optimize the various hyperparameters of the models, we use ten-fold cross valida‐
tion and recalculate the results ten times to account for the inherent randomness in
some of the models and the CV process. All of these models, including cross valida‐
tion, are described in Chapter 4.
Let us design our test harness. We will evaluate algorithms using the accuracy metric.
This is a gross metric that will give us a quick idea of how correct a given model is. It
is useful on binary classification problems.
# test options for classification
num_folds = 10
scoring = 'accuracy'
Let’s create a baseline of performance for this problem and spot-check a number of
differentalgorithms. The selected algorithms include:
156 | Chapter 6: Supervised Learning: Classification
Linear algorithms
Logistic regression (LR) and linear discriminant analysis (LDA).
Nonlinear algorithms
Classification and regression trees (CART) and K-nearest neighbors (KNN).
There are good reasons for selecting these models. These models are simpler and
faster models with good interpretation for problems with large datasets. CART and
KNN will be able to discern any nonlinear relationships between the variables. The
key problem here is using an unbalanced sample. Unless we resolve that, more com‐
plex models, such as ensemble and ANNs, will have poor prediction. We will focus
on addressing this later in the case study and then will evaluate the performance of
these types of models.
# spot-check basic Classification algorithms
models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
All the algorithms use default tuning parameters. We will display the mean and stan‐
dard deviation of accuracy for each algorithm as we calculate and collect the results
for use later.
results = []
names = []
for name, model in models:
 kfold = KFold(n_splits=num_folds, random_state=seed)
 cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, \
 scoring=scoring)
 results.append(cv_results)
 names.append(name)
 msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
 print(msg)
Output
LR: 0.998942 (0.000229)
LDA: 0.999364 (0.000136)
KNN: 0.998310 (0.000290)
CART: 0.999175 (0.000193)
# compare algorithms
fig = pyplot.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
pyplot.boxplot(results)
ax.set_xticklabels(names)
fig.set_size_inches(8,4)
pyplot.show()
Case Study 1: Fraud Detection | 157
The accuracy of the overall result is quite high. But let us check how well it predicts
the fraud cases. Choosing one of the model CART from the results above and looking
at the result on the test set:
# prepare model
model = DecisionTreeClassifier()
model.fit(X_train, Y_train)
# estimate accuracy on validation set
predictions = model.predict(X_validation)
print(accuracy_score(Y_validation, predictions))
print(classification_report(Y_validation, predictions))
Output
0.9992275552122467
 precision recall f1-score support
 0 1.00 1.00 1.00 56862
 1 0.77 0.79 0.78 100
 accuracy 1.00 56962
 macro avg 0.89 0.89 0.89 56962
weighted avg 1.00 1.00 1.00 56962
And producing the confusion matrix yields:
df_cm = pd.DataFrame(confusion_matrix(Y_validation, predictions), \
columns=np.unique(Y_validation), index = np.unique(Y_validation))
df_cm.index.name = 'Actual'
df_cm.columns.name = 'Predicted'
sns.heatmap(df_cm, cmap="Blues", annot=True,annot_kws={"size": 16})
158 | Chapter 6: Supervised Learning: Classification
Overall accuracy is strong, but the confusion metrics tell a different story. Despite the
high accuracy level, 21 out of 100 instances of fraud are missed and incorrectly pre‐
dicted as nonfraud. The false negative rate is substantial.
The intention of a fraud detection model is to minimize these false negatives. To do
so, the first step would be to choose the right evaluation metric.
In Chapter 4, we covered the evaluation metrics, such as accuracy, precision, and
recall, for a classification-based problem. Accuracy is the number of correct predic‐
tions made as a ratio of all predictions made. Precision is the number of items cor‐
rectly identified as positive out of total items identified as positive by the model.
Recall is the total number of items correctly identified as positive out of total true
positives.
For this type of problem, we should focus on recall, the ratio of true positives to the
sum of true positives and false negatives. So if false negatives are high, then the value
of recall will be low.
In the next step, we perform model tuning, select the model using the recall metric,
and perform under-sampling.
6. Model tuning
The purpose of the model tuning step is to perform the grid search on the model
selected in the previous step. However, since we encountered poor model perfor‐
mance in the previous section due to the unbalanced dataset, we will focus our atten‐
tion on that. We will analyze the impact of choosing the correct evaluation metric
and see the impact of using an adjusted, balanced sample.
Case Study 1: Fraud Detection | 159
6.1. Model tuning by choosing the correct evaluation metric. As mentioned in the preced‐
ing step, if false negatives are high, then the value of recall will be low. Models are
ranked according to this metric:
scoring = 'recall'
Let us spot-check some basic classification algorithms for recall:
models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
Running cross validation:
results = []
names = []
for name, model in models:
 kfold = KFold(n_splits=num_folds, random_state=seed)
 cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, \
 scoring=scoring)
 results.append(cv_results)
 names.append(name)
 msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
 print(msg)
Output
LR: 0.595470 (0.089743)
LDA: 0.758283 (0.045450)
KNN: 0.023882 (0.019671)
CART: 0.735192 (0.073650)
We see that the LDA model has the best recall of the four models. We continue by
evaluating the test set using the trained LDA:
# prepare model
model = LinearDiscriminantAnalysis()
model.fit(X_train, Y_train)
# estimate accuracy on validation set
predictions = model.predict(X_validation)
print(accuracy_score(Y_validation, predictions))
Output
0.9995435553526912
160 | Chapter 6: Supervised Learning: Classification
LDA performs better, missing only 18 out of 100 cases of fraud. Additionally, we find
fewer false positives as well. However, there is still improvement to be made.
6.2. Model tuning—balancing the sample by random under-sampling. The current data
exhibits a significant class imbalance, where there are very few data points labeled
“fraud.” The issue of such class imbalance can result in a serious bias toward the
majority class, reducing the classification performance and increasing the number of
false negatives.
One of the remedies to handle such situations is to under-sample the data. A simple
technique is to under-sample the majority class randomly and uniformly. This might
lead to a loss of information, but it may yield strong results by modeling the minority
class well.
Next, we will implement random under-sampling, which consists of removing data
to have a more balanced dataset. This will help ensure that our models avoid
overfitting.
The steps to implement random under-sampling are:
1. First, we determine the severity of the class imbalance by using value_counts()
on the class column. We determine how many instances are considered fraud
transactions (fraud = 1).
2. We bring the nonfraud transaction observation count to the same amount as
fraud transactions. Assuming we want a 50/50 ratio, this will be equivalent to 492
cases of fraud and 492 cases of nonfraud transactions.
3. We now have a subsample of our dataframe with a 50/50 ratio with regards to
our classes. We train the models on this subsample. Then we perform this itera‐
tion again to shuffle the nonfraud observations in the training sample. We keep
Case Study 1: Fraud Detection | 161
track of the model performance to see whether our models can maintain a cer‐
tain accuracy every time we repeat this process:
df = pd.concat([X_train, Y_train], axis=1)
# amount of fraud classes 492 rows.
fraud_df = df.loc[df['Class'] == 1]
non_fraud_df= df.loc[df['Class'] == 0][:492]
normal_distributed_df = pd.concat([fraud_df, non_fraud_df])
# Shuffle dataframe rows
df_new = normal_distributed_df.sample(frac=1, random_state=42)
# split out validation dataset for the end
Y_train_new= df_new["Class"]
X_train_new = df_new.loc[:, dataset.columns != 'Class']
Let us look at the distribution of the classes in the dataset:
print('Distribution of the Classes in the subsample dataset')
print(df_new['Class'].value_counts()/len(df_new))
sns.countplot('Class', data=df_new)
pyplot.title('Equally Distributed Classes', fontsize=14)
pyplot.show()
Output
Distribution of the Classes in the subsample dataset
1 0.5
0 0.5
Name: Class, dtype: float64
162 | Chapter 6: Supervised Learning: Classification
The data is now balanced, with close to 1,000 observations. We will train all the mod‐
els again, including an ANN. Now that the data is balanced, we will focus on accuracy
as our main evaluation metric, since it considers both false positives and false nega‐
tives. Recall can always be produced if needed:
#setting the evaluation metric
scoring='accuracy'
# spot-check the algorithms
models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC()))
#Neural Network
models.append(('NN', MLPClassifier()))
# Ensemble Models
# Boosting methods
models.append(('AB', AdaBoostClassifier()))
models.append(('GBM', GradientBoostingClassifier()))
# Bagging methods
models.append(('RF', RandomForestClassifier()))
models.append(('ET', ExtraTreesClassifier()))
The steps to define and compile an ANN-based deep learning model in Keras, along
with all the terms (neurons, activation, momentum, etc.) mentioned in the following
code, have been described in Chapter 3. This code can be leveraged for any deep
learning–based classification model.
Keras-based deep learning model:
# Function to create model, required for KerasClassifier
def create_model(neurons=12, activation='relu', learn_rate = 0.01, momentum=0):
 # create model
 model = Sequential()
 model.add(Dense(X_train.shape[1], input_dim=X_train.shape[1], \
 activation=activation))
 model.add(Dense(32,activation=activation))
 model.add(Dense(1, activation='sigmoid'))
 # Compile model
 optimizer = SGD(lr=learn_rate, momentum=momentum)
 model.compile(loss='binary_crossentropy', optimizer='adam', \
 metrics=['accuracy'])
 return model
models.append(('DNN', KerasClassifier(build_fn=create_model,\
epochs=50, batch_size=10, verbose=0)))
Running the cross validation on the new set of models results in the following:
Case Study 1: Fraud Detection | 163
Although a couple of models, including random forest (RF) and logistic regression
(LR), perform well, GBM slightly edges out the other models. We select this for fur‐
ther analysis. Note that the result of the deep learning model using Keras (i.e.,
“DNN”) is poor.
A grid search is performed for the GBM model by varying the number of estimators
and maximum depth. The details of the GBM model and the parameters to tune for
this model are described in Chapter 4.
# Grid Search: GradientBoosting Tuning
n_estimators = [20,180,1000]
max_depth= [2, 3,5]
param_grid = dict(n_estimators=n_estimators, max_depth=max_depth)
model = GradientBoostingClassifier()
kfold = KFold(n_splits=num_folds, random_state=seed)
grid = GridSearchCV(estimator=model, param_grid=param_grid, scoring=scoring, \
 cv=kfold)
grid_result = grid.fit(X_train_new, Y_train_new)
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
Output
Best: 0.936992 using {'max_depth': 5, 'n_estimators': 1000}
In the next step, the final model is prepared, and the result on the test set is checked:
# prepare model
model = GradientBoostingClassifier(max_depth= 5, n_estimators = 1000)
model.fit(X_train_new, Y_train_new)
# estimate accuracy on Original validation set
predictions = model.predict(X_validation)
print(accuracy_score(Y_validation, predictions))
164 | Chapter 6: Supervised Learning: Classification
Output
0.9668199852533268
The accuracy of the model is high. Let’s look at the confusion matrix:
Output
The results on the test set are impressive, with a high accuracy and, importantly, no
false negatives. However, we see that an outcome of using our under-sampled data is
a propensity for false positives—cases in which nonfraud transactions are misclassi‐
fied as fraudulent. This is a trade-off the financial institution would have to consider.
There is an inherent cost balance between the operational overhead, and possible
customer experience impact, from processing false positives and the financial loss
resulting from missing fraud cases through false negatives.
Conclusion
In this case study, we performed fraud detection on credit card transactions. We
illustrated how different classification machine learning models stack up against each
other and demonstrated that choosing the right metric can make an important differ‐
ence in model evaluation. Under-sampling was shown to lead to a significant
improvement, as all fraud cases in the test set were correctly identified after applying
under-sampling. This came with a trade-off, though. The reduction in false negatives
came with an increase in false positives.
Overall, by using different machine learning models, choosing the right evaluation
metrics, and handling unbalanced data, we demonstrated how the implementation of
a simple classification-based model can produce robust results for fraud detection.
Case Study 1: Fraud Detection | 165
Case Study 2: Loan Default Probability
Lending is one of the most important activities of the finance industry. Lenders pro‐
vide loans to borrowers in exchange for the promise of repayment with interest. That
means the lender makes a profit only if the borrower pays off the loan. Hence, the
two most critical questions in the lending industry are:
1. How risky is the borrower?
2. Given the borrower’s risk, should we lend to them?
Default prediction could be described as a perfect job for machine learning, as the
algorithms can be trained on millions of examples of consumer data. Algorithms can
perform automated tasks such as matching data records, identifying exceptions, and
calculating whether an applicant qualifies for a loan. The underlying trends can be
assessed with algorithms and continuously analyzed to detect trends that might influ‐
ence lending and underwriting risk in the future.
The goal of this case study is to build a machine learning model to predict the proba‐
bility that a loan will default.
In most real-life cases, including loan default modeling, we are unable to work with
clean, complete data. Some of the potential problems we are bound to encounter are
missing values, incomplete categorical data, and irrelevant features. Although data
cleaning may not be mentioned often, it is critical for the success of machine learning
applications. The algorithms that we use can be powerful, but without the relevant or
appropriate data, the system may fail to yield ideal results. So one of the focus areas of
this case study will be data preparation and cleaning. Various techniques and con‐
cepts of data processing, feature selection, and exploratory analysis are used for data
cleaning and organizing the feature space.
In this case study, we will focus on:
• Data preparation, data cleaning, and handling a large number of features.
• Data discretization and handling categorical data.
• Feature selection and data transformation.
166 | Chapter 6: Supervised Learning: Classification
Blueprint for Creating a Machine Learning Model for
Predicting Loan Default Probability
1. Problem definition
In the classification framework for this case study, the predicted variable is charge-off,
a debt that a creditor has given up trying to collect on after a borrowerhas missed
payments for several months. The predicted variable takes a value of 1 in case of
charge-off and a value of 0 otherwise.
We will analyze data for loans from 2007 to 2017Q3 from Lending Club, available on
Kaggle. Lending Club is a US peer-to-peer lending company. It operates an online
lending platform that enables borrowers to obtain a loan and investors to purchase
notes backed by payments made on these loans. The dataset contains more than
887,000 observations with 150 variables containing complete loan data for all loans
issued over the specified time period. The features include income, age, credit scores,
home ownership, borrower location, collections, and many others. We will investi‐
gate these 150 predictor variables for feature selection.
By the end of this case study, readers will be familiar with a general approach to loan
default modeling, from gathering and cleaning data to building and tuning a
classifier.
2. Getting started—loading the data and Python packages
2.1. Loading the Python packages. The standard Python packages are loaded in this
step. The details have been presented in the previous case studies. Please refer to the
Jupyter notebook for this case study for more details.
2.2. Loading the data. The loan data for the time period from 2007 to 2017Q3 is
loaded:
# load dataset
dataset = pd.read_csv('LoansData.csv.gz', compression='gzip', \
low_memory=True)
3. Data preparation and feature selection
In the first step, let us look at the size of the data:
dataset.shape
Case Study 2: Loan Default Probability | 167
https://oreil.ly/DG9j5
https://oreil.ly/DG9j5
2 The predicted variable is further used for correlation-based feature reduction.
Output
(1646801, 150)
Given that there are 150 features for each loan, we will first focus on limiting the fea‐
ture space, followed by the exploratory analysis.
3.1. Preparing the predicted variable. Here, we look at the details of the predicted vari‐
able and prepare it. The predicted variable will be derived from the loan_status col‐
umn. Let’s check the value distributions:2
dataset['loan_status'].value_counts(dropna=False)
Output
Current 788950
Fully Paid 646902
Charged Off 168084
Late (31-120 days) 23763
In Grace Period 10474
Late (16-30 days) 5786
Does not meet the credit policy. Status:Fully Paid 1988
Does not meet the credit policy. Status:Charged Off 761
Default 70
NaN 23
Name: loan_status, dtype: int64
From the data definition documentation:
Fully Paid
Loans that have been fully repaid.
Default
Loans that have not been current for 121 days or more.
Charged Off
Loans for which there is no longer a reasonable expectation of further payments.
A large proportion of observations show a status of Current, and we do not know
whether those will be Charged Off, Fully Paid, or Default in the future. The obser‐
vations for Default are tiny in number compared to Fully Paid or Charged Off and
are not considered. The remaining categories of loan status are not of prime
importance for this analysis. So, in order to convert this to a binary classification
problem and to analyze in detail the effect of important variables on the loan status,
we will consider only two major categories—Charged Off and Fully Paid:
168 | Chapter 6: Supervised Learning: Classification
dataset = dataset.loc[dataset['loan_status'].isin(['Fully Paid', 'Charged Off'])]
dataset['loan_status'].value_counts(normalize=True, dropna=False)
Output
Fully Paid 0.793758
Charged Off 0.206242
Name: loan_status, dtype: float64
About 79% of the remaining loans have been fully paid and 21% have been charged
off, so we have a somewhat unbalanced classification problem, but it is not as unbal‐
anced as the dataset of fraud detection we saw in the previous case study.
In the next step, we create a new binary column in the dataset, where we categorize
Fully Paid as 0 and Charged Off as 1. This column represents the predicted variable
for this classification problem. A value of 1 in this column indicates the borrower has
defaulted:
dataset['charged_off'] = (dataset['loan_status'] == 'Charged Off').apply(np.uint8)
dataset.drop('loan_status', axis=1, inplace=True)
3.2. Feature selection—limit the feature space. The full dataset has 150 features for each
loan, but not all features contribute to the prediction variable. Removing features of
low importance can improve accuracy and reduce both model complexity and over‐
fitting. Training time can also be reduced for very large datasets. We’ll eliminate fea‐
tures in the following steps using three different approaches:
• Eliminating features that have more than 30% missing values.
• Eliminating features that are unintuitive based on subjective judgment.
• Eliminating features with low correlation with the predicted variable.
3.2.1. Feature elimination based on significant missing values. First, we calculate the per‐
centage of missing data for each feature:
missing_fractions = dataset.isnull().mean().sort_values(ascending=False)
#Drop the missing fraction
drop_list = sorted(list(missing_fractions[missing_fractions > 0.3].index))
dataset.drop(labels=drop_list, axis=1, inplace=True)
dataset.shape
Output
(814986, 92)
This dataset has 92 columns remaining once some of the columns with a significant
number of missing values are dropped.
3.2.2. Feature elimination based on intuitiveness. To filter the features further we check
the description in the data dictionary and keep the features that intuitively contribute
Case Study 2: Loan Default Probability | 169
to the prediction of default. We keep features that contain the relevant credit detail of
the borrower, including annual income, FICO score, and debt-to-income ratio. We
also keep those features that are available to investors when considering an invest‐
ment in the loan. These include features in the loan application and any features
added by Lending Club when the loan listing was accepted, such as loan grade and
interest rate.
The list of the features retained are shown in the following code snippet:
keep_list = ['charged_off','funded_amnt','addr_state', 'annual_inc', \
'application_type','dti', 'earliest_cr_line', 'emp_length',\
'emp_title', 'fico_range_high',\
'fico_range_low', 'grade', 'home_ownership', 'id', 'initial_list_status', \
'installment', 'int_rate', 'loan_amnt', 'loan_status',\
'mort_acc', 'open_acc', 'pub_rec', 'pub_rec_bankruptcies', \
'purpose', 'revol_bal', 'revol_util', \
'sub_grade', 'term', 'title', 'total_acc',\
'verification_status', 'zip_code','last_pymnt_amnt',\
'num_actv_rev_tl', 'mo_sin_rcnt_rev_tl_op',\
'mo_sin_old_rev_tl_op',"bc_util","bc_open_to_buy",\
"avg_cur_bal","acc_open_past_24mths" ]
drop_list = [col for col in dataset.columns if col not in keep_list]
dataset.drop(labels=drop_list, axis=1, inplace=True)
dataset.shape
Output
(814986, 39)
After removing the features in this step, 39 columns remain.
3.2.3. Feature elimination based on the correlation. The next step is to check the correla‐
tion with the predicted variable. Correlation gives us the interdependence between
the predicted variable and the feature. We select features with a moderate-to-strong
relationship with the target variable and drop those that have a correlation of less
than 3% with the predicted variable:
correlation = dataset.corr()
correlation_chargeOff = abs(correlation['charged_off'])
drop_list_corr = sorted(list(correlation_chargeOff\
 [correlation_chargeOff < 0.03].index))
print(drop_list_corr)
Output
['pub_rec', 'pub_rec_bankruptcies', 'revol_bal', 'total_acc']
The columns with low correlation are dropped from the dataset, and we are leftwith
only 35 columns:
dataset.drop(labels=drop_list_corr, axis=1, inplace=True)
170 | Chapter 6: Supervised Learning: Classification
4. Feature selection and exploratory analysis
In this step, we perform the exploratory data analysis of the feature selection. Given
that many features had to be eliminated, it is preferable that we perform the explora‐
tory data analysis after feature selection to better visualize the relevant features. We
will also continue the feature elimination by visually screening and dropping those
features deemed irrelevant.
4.1. Feature analysis and exploration. In the following sections, we take a deeper dive
into the dataset features.
4.1.1. Analyzing the categorical features. Let us look at the some of the categorical fea‐
tures in the dataset.
First, let’s look at the id, emp_title, title, and zip_code features:
dataset[['id','emp_title','title','zip_code']].describe()
Output
id emp_title title zip_code
count 814986 766415 807068 814986
unique 814986 280473 60298 925
top 14680062 Teacher Debt consolidation 945xx
freq 1 11351 371874 9517
IDs are all unique and irrelevant for modeling. There are too many unique values for
employment titles and titles. Occupation and job title may provide some information
for default modeling; however, we assume much of this information is embedded in
the verified income of the customer. Moreover, additional cleaning steps on these
features, such as standardizing or grouping the titles, would be necessary to extract
any marginal information. This work is outside the scope of this case study but could
be explored in subsequent iterations of the model.
Geography could play a role in credit determination, and zip codes provide a granu‐
lar view of this dimension. Again, additional work would be necessary to prepare this
feature for modeling and was deemed outside the scope of this case study.
dataset.drop(['id','emp_title','title','zip_code'], axis=1, inplace=True)
Let’s look at the term feature.
Term refers to the number of payments on the loan. Values are in months and can be
either 36 or 60. The 60-month loans are more likely to charge off.
Case Study 2: Loan Default Probability | 171
Let’s convert term to integers and group by the term for further analysis:
dataset['term'] = dataset['term'].apply(lambda s: np.int8(s.split()[0]))
dataset.groupby('term')['charged_off'].value_counts(normalize=True).loc[:,1]
Output
term
36 0.165710
60 0.333793
Name: charged_off, dtype: float64
Loans with five-year periods are more than twice as likely to charge-off as loans with
three-year periods. This feature seems to be important for the prediction.
Let’s look at the emp_length feature:
dataset['emp_length'].replace(to_replace='10+ years', value='10 years',\
 inplace=True)
dataset['emp_length'].replace('< 1 year', '0 years', inplace=True)
def emp_length_to_int(s):
 if pd.isnull(s):
 return s
 else:
 return np.int8(s.split()[0])
dataset['emp_length'] = dataset['emp_length'].apply(emp_length_to_int)
charge_off_rates = dataset.groupby('emp_length')['charged_off'].value_counts\
 (normalize=True).loc[:,1]
sns.barplot(x=charge_off_rates.index, y=charge_off_rates.values)
Output
172 | Chapter 6: Supervised Learning: Classification
Loan status does not appear to vary much with employment length (on average);
hence this feature is dropped:
dataset.drop(['emp_length'], axis=1, inplace=True)
Let’s look at the sub_grade feature:
charge_off_rates = dataset.groupby('sub_grade')['charged_off'].value_counts\
(normalize=True).loc[:,1]
sns.barplot(x=charge_off_rates.index, y=charge_off_rates.values)
Output
As shown in the chart, there’s a clear trend of higher probability of charge-off as the
sub-grade worsens, and so it is considered to be a key feature.
4.1.2. Analyzing the continuous features. Let’s look at the annual_inc feature:
dataset[['annual_inc']].describe()
Output
annual_inc
count 8.149860e+05
mean 7.523039e+04
std 6.524373e+04
min 0.000000e+00
25% 4.500000e+04
50% 6.500000e+04
75% 9.000000e+04
max 9.550000e+06
Case Study 2: Loan Default Probability | 173
Annual income ranges from $0 to $9,550,000, with a median of $65,000. Because of
the large range of incomes, we use a log transform of the annual income variable:
dataset['log_annual_inc'] = dataset['annual_inc'].apply(lambda x: np.log10(x+1))
dataset.drop('annual_inc', axis=1, inplace=True)
Let’s look at the FICO score (fico_range_low, fico_range_high) feature:
dataset[['fico_range_low','fico_range_high']].corr()
Output
fico_range_low fico_range_high
fico_range_low 1.0 1.0
fico_range_high 1.0 1.0
Given that the correlation between FICO low and high is 1, it is preferred that we
keep only one feature, which we take as the average of FICO scores:
dataset['fico_score'] = 0.5*dataset['fico_range_low'] +\
 0.5*dataset['fico_range_high']
dataset.drop(['fico_range_high', 'fico_range_low'], axis=1, inplace=True)
4.2. Encoding categorical data. In order to use a feature in the classification models, we
need to convert the categorical data (i.e., text features) to its numeric representation.
This process is called encoding. There can be different ways of encoding. However,
for this case study we will use a label encoder, which encodes labels with a value
between 0 and n, where n is the number of distinct labels. The LabelEncoder func‐
tion from sklearn is used in the following step, and all the categorical columns are
encoded at once:
from sklearn.preprocessing import LabelEncoder
# Categorical boolean mask
categorical_feature_mask = dataset.dtypes==object
# filter categorical columns using mask and turn it into a list
categorical_cols = dataset.columns[categorical_feature_mask].tolist()
Let us look at the categorical columns:
categorical_cols
Output
['grade',
 'sub_grade',
 'home_ownership',
 'verification_status',
 'purpose',
 'addr_state',
 'initial_list_status',
 'application_type']
174 | Chapter 6: Supervised Learning: Classification
3 Sampling is covered in detail in “Case Study 1: Fraud Detection” on page 153.
4.3. Sampling data. Given that the loan data is skewed, it is sampled to have an equal
number of charge-off and no charge-off observations. Sampling leads to a more bal‐
anced dataset and avoids overfitting:3
loanstatus_0 = dataset[dataset["charged_off"]==0]
loanstatus_1 = dataset[dataset["charged_off"]==1]
subset_of_loanstatus_0 = loanstatus_0.sample(n=5500)
subset_of_loanstatus_1 = loanstatus_1.sample(n=5500)
dataset = pd.concat([subset_of_loanstatus_1, subset_of_loanstatus_0])
dataset = dataset.sample(frac=1).reset_index(drop=True)
print("Current shape of dataset :",dataset.shape)
Although sampling may have its advantages, there might be some disadvantages as
well. Sampling may exclude some data that might not be homogeneous to the data
that is taken. This affects the level of accuracy in the results. Also, selection of the
proper size of samples is a difficult job. Hence, sampling should be performed with
caution and should generally be avoided in the case of a relatively balanced dataset.
5. Evaluate algorithms and models
5.1. Train-test split. Splitting out the validation dataset for the model evaluation is the
next step:
Y= dataset["charged_off"]
X = dataset.loc[:, dataset.columns != 'charged_off']
validation_size = 0.2
seed = 7
X_train, X_validation, Y_train, Y_validation = \
train_test_split(X, Y, test_size=validation_size, random_state=seed)
5.2. Test options and evaluation metrics. In this step, the test options and evaluation
metrics are selected. The roc_auc evaluation metric is selected for this classification.
The details of this metric were provided in Chapter 4. This metric represents a
model’s ability to discriminate between positive and negative classes. An roc_auc of
1.0 represents a model that made all predictions perfectly, and a value of 0.5 repre‐
sents a model that is as good as random.
num_folds = 10
scoring = 'roc_auc'
The model cannot afford to havea high amount of false negatives as that leads to a
negative impact on the investors and the credibility of the company. So we can use
recall as we did in the fraud detection use case.
Case Study 2: Loan Default Probability | 175
5.3. Compare models and algorithms. Let us spot-check the classification algorithms.
We include ANN and ensemble models in the list of models to be checked:
models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
# Neural Network
models.append(('NN', MLPClassifier()))
# Ensemble Models
# Boosting methods
models.append(('AB', AdaBoostClassifier()))
models.append(('GBM', GradientBoostingClassifier()))
# Bagging methods
models.append(('RF', RandomForestClassifier()))
models.append(('ET', ExtraTreesClassifier()))
After performing the k-fold cross validation on the models shown above, the overall
performance is as follows:
The gradient boosting method (GBM) model performs best, and we select it for grid
search in the next step. The details of GBM along with the model parameters are
described in Chapter 4.
176 | Chapter 6: Supervised Learning: Classification
6. Model tuning and grid search
We tune the number of estimator and maximum depth hyperparameters, which were
discussed in Chapter 4:
# Grid Search: GradientBoosting Tuning
n_estimators = [20,180]
max_depth= [3,5]
param_grid = dict(n_estimators=n_estimators, max_depth=max_depth)
model = GradientBoostingClassifier()
kfold = KFold(n_splits=num_folds, random_state=seed)
grid = GridSearchCV(estimator=model, param_grid=param_grid, scoring=scoring, \
 cv=kfold)
grid_result = grid.fit(X_train, Y_train)
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
Output
Best: 0.952950 using {'max_depth': 5, 'n_estimators': 180}
A GBM model with max_depth of 5 and number of estimators of 150 results in the
best model.
7. Finalize the model
Now, we perform the final steps for selecting a model.
7.1. Results on the test dataset. Let us prepare the GBM model with the parameters
found during the grid search step and check the results on the test dataset:
model = GradientBoostingClassifier(max_depth= 5, n_estimators= 180)
model.fit(X_train, Y_train)
# estimate accuracy on validation set
predictions = model.predict(X_validation)
print(accuracy_score(Y_validation, predictions))
Output
0.889090909090909
Case Study 2: Loan Default Probability | 177
The accuracy of the model is a reasonable 89% on the test set. Let us examine the
confusion matrix:
Looking at the confusion matrix and the overall result of the test set, both the rate of
false positives and the rate of false negatives are lower; the overall model performance
looks good and is in line with the training set results.
7.2. Variable intuition/feature importance. In this step, we compute and display the
variable importance of our trained model:
print(model.feature_importances_) #use inbuilt class feature_importances
feat_importances = pd.Series(model.feature_importances_, index=X.columns)
#plot graph of feature importances for better visualization
feat_importances.nlargest(10).plot(kind='barh')
pyplot.show()
Output
The results of the model importance are intuitive. The last payment amount seems to
be the most important feature, followed by loan term and sub-grade.
178 | Chapter 6: Supervised Learning: Classification
Conclusion
In this case study, we introduced the classification-based tree algorithm applied to
loan default prediction. We showed that data preparation is one of the most impor‐
tant steps. We addressed this by performing feature elimination using different tech‐
niques, such as feature intuition, correlation analysis, visualization, and data quality
checks of the features. We illustrated that there can be different ways of handling and
analyzing the categorical data and converting categorical data into a usable format for
the models.
We emphasized that performing data processing and establishing an understanding
of variable importance is key in the model development process. A focus on these
steps led to the implementation of a simple classification-based model that produced
robust results for default prediction.
Case Study 3: Bitcoin Trading Strategy
First released as open source in 2009 by the pseudonymous Satoshi Nakamoto, bit‐
coin is the longest-running and most well-known cryptocurrency.
A major drawback of cryptocurrency trading is the volatility of the market. Since
cryptocurrency markets trade 24/7, tracking cryptocurrency positions against quickly
changing market dynamics can rapidly become an impossible task to manage. This is
where automated trading algorithms and trading bots can assist.
Various machine learning algorithms can be used to generate trading signals in an
attempt to predict the market’s movement. One could use machine learning algo‐
rithms to classify the next day’s movement into three categories: market will rise
(take a long position), market will fall (take a short position), or market will move
sideways (take no position). Since we know the market direction, we can decide the
optimum entry and exit points.
Machine learning has one key aspect called feature engineering. It means that we can
create new, intuitive features and feed them to a machine learning algorithm in order
to achieve better results. We can introduce different technical indicators as features
to help predict future prices of an asset. These technical indicators are derived from
market variables such as price or volume and have additional information or signals
embedded in them. There are many different categories of technical indicators,
including trend, volume, volatility, and momentum indicators.
In this case study, we will use various classification-based models to predict whether
the current position signal is buy or sell. We will create additional trend and momen‐
tum indicators from market prices to leverage as additional features in the prediction.
Case Study 3: Bitcoin Trading Strategy | 179
In this case study, we will focus on:
• Building a trading strategy using classification (classification of long/short sig‐
nals).
• Feature engineering and constructing technical indicators of trend, momentum,
and mean reversion.
• Building a framework for backtesting results of a trading strategy.
• Choosing the right evaluation metric to assess a trading strategy.
Blueprint for Using Classification-Based Models to Predict
Whether to Buy or Sell in the Bitcoin Market
1. Problem definition
The problem of predicting a buy or sell signal for a trading strategy is defined in the
classification framework, where the predicted variable has a value of 1 for buy and 0
for sell. This signal is decided through the comparison of the short-term and long-
term price trends.
The data used is from one of the largest bitcoin exchanges in terms of average daily
volume, Bitstamp. The data covers prices from January 2012 to May 2017. Different
trend and momentum indicators are created from the data and are added as features
to enhance the performance of the prediction model.
By the end of this case study, readers will be familiar with a general approach to
building a trading strategy, from cleaning data and feature engineering to model tun‐
ing and developing a backtesting framework.
2. Getting started—loading the data and Python packages
Let’s load the packages and the data.
2.1. Loading the Python packages. The standard Python packages are loaded in this
step. The details have been presented in the previous case studies. Refer to the Jupyter
notebook for this case study for more details.
2.2. Loading the data. The bitcoin data fetched from the Bitstamp website is loaded in
this step:
180 | Chapter 6: Supervised Learning: Classification
https://www.bitstamp.com
# load dataset
dataset = pd.read_csv('BitstampData.csv')3. Exploratory data analysis
In this step, we will take a detailed look at this data.
3.1. Descriptive statistics. First, let us look at the shape of the data:
dataset.shape
Output
(2841377, 8)
# peek at data
set_option('display.width', 100)
dataset.tail(2)
Output
Timestamp Open High Low Close Volume_(BTC) Volume_(Currency) Weighted_Price
2841372 1496188560 2190.49 2190.49 2181.37 2181.37 1.700166 3723.784755 2190.247337
2841373 1496188620 2190.50 2197.52 2186.17 2195.63 6.561029 14402.811961 2195.206304
The dataset has minute-by-minute data of OHLC (Open, High, Low, Close) and
traded volume of bitcoin. The dataset is large, with approximately 2.8 million total
observations.
4. Data preparation
In this part, we will clean the data to prepare for modeling.
4.1. Data cleaning. We clean the data by filling the NaNs with the last available values:
dataset[dataset.columns.values] = dataset[dataset.columns.values].ffill()
The Timestamp column is not useful for modeling and is dropped from the dataset:
dataset=dataset.drop(columns=['Timestamp'])
4.2. Preparing the data for classification. As a first step, we will create the target variable
for our model. This is the column that will indicate whether the trading signal is buy
or sell. We define the short-term price as the 10-day rolling average and the long-
term price as the 60-day rolling average. We attach a label of 1 (0) if the short-term
price is higher (lower) than the long-term price:
# Create short simple moving average over the short window
dataset['short_mavg'] = dataset['Close'].rolling(window=10, min_periods=1,\
center=False).mean()
Case Study 3: Bitcoin Trading Strategy | 181
# Create long simple moving average over the long window
dataset['long_mavg'] = dataset['Close'].rolling(window=60, min_periods=1,\
center=False).mean()
# Create signals
dataset['signal'] = np.where(dataset['short_mavg'] >
dataset['long_mavg'], 1.0, 0.0)
4.3. Feature engineering. We begin feature engineering by analyzing the features we
expect may influence the performance of the prediction model. Based on a concep‐
tual understanding of key factors that drive investment strategies, the task at hand is
to identify and construct new features that may capture the risks or characteristics
embodied by these return drivers. For this case study, we will explore the efficacy of
specific momentum technical indicators.
The current data of the bitcoin consists of date, open, high, low, close, and volume.
Using this data, we calculate the following momentum indicators:
Moving average
A moving average provides an indication of a price trend by cutting down the
amount of noise in the series.
Stochastic oscillator %K
A stochastic oscillator is a momentum indicator that compares the closing price
of a security to a range of its previous prices over a certain period of time. %K
and %D are slow and fast indicators. The fast indicator is more sensitive than the
slow indicator to changes in the price of the underlying security and will likely
result in many transaction signals.
Relative strength index (RSI)
This is a momentum indicator that measures the magnitude of recent price
changes to evaluate overbought or oversold conditions in the price of a stock or
other asset. The RSI ranges from 0 to 100. An asset is deemed to be overbought
once the RSI approaches 70, meaning that the asset may be getting overvalued
and is a good candidate for a pullback. Likewise, if the RSI approaches 30, it is an
indication that the asset may be getting oversold and is therefore likely to become
undervalued.
Rate of change (ROC)
This is a momentum oscillator that measures the percentage change between the
current price and the n period past prices. Assets with higher ROC values are
considered more likely to be overbought; those with lower ROC, more likely to
be oversold.
182 | Chapter 6: Supervised Learning: Classification
Momentum (MOM)
This is the rate of acceleration of a security’s price or volume—that is, the speed
at which the price is changing.
The following steps show how to generate some useful features for prediction. The
features for trend and momentum can be leveraged for other trading strategy models:
#calculation of exponential moving average
def EMA(df, n):
 EMA = pd.Series(df['Close'].ewm(span=n, min_periods=n).mean(), name='EMA_'\
 + str(n))
 return EMA
dataset['EMA10'] = EMA(dataset, 10)
dataset['EMA30'] = EMA(dataset, 30)
dataset['EMA200'] = EMA(dataset, 200)
dataset.head()
#calculation of rate of change
def ROC(df, n):
 M = df.diff(n - 1)
 N = df.shift(n - 1)
 ROC = pd.Series(((M / N) * 100), name = 'ROC_' + str(n))
 return ROC
dataset['ROC10'] = ROC(dataset['Close'], 10)
dataset['ROC30'] = ROC(dataset['Close'], 30)
#calculation of price momentum
def MOM(df, n):
 MOM = pd.Series(df.diff(n), name='Momentum_' + str(n))
 return MOM
dataset['MOM10'] = MOM(dataset['Close'], 10)
dataset['MOM30'] = MOM(dataset['Close'], 30)
#calculation of relative strength index
def RSI(series, period):
 delta = series.diff().dropna()
 u = delta * 0
 d = u.copy()
 u[delta > 0] = delta[delta > 0]
 d[delta < 0] = -delta[delta < 0]
 u[u.index[period-1]] = np.mean( u[:period] ) #first value is sum of avg gains
 u = u.drop(u.index[:(period-1)])
 d[d.index[period-1]] = np.mean( d[:period] ) #first value is sum of avg losses
 d = d.drop(d.index[:(period-1)])
 rs = u.ewm(com=period-1, adjust=False).mean() / \
 d.ewm(com=period-1, adjust=False).mean()
 return 100 - 100 / (1 + rs)
dataset['RSI10'] = RSI(dataset['Close'], 10)
dataset['RSI30'] = RSI(dataset['Close'], 30)
dataset['RSI200'] = RSI(dataset['Close'], 200)
#calculation of stochastic osillator.
Case Study 3: Bitcoin Trading Strategy | 183
def STOK(close, low, high, n):
 STOK = ((close - low.rolling(n).min()) / (high.rolling(n).max() - \
 low.rolling(n).min())) * 100
 return STOK
def STOD(close, low, high, n):
 STOK = ((close - low.rolling(n).min()) / (high.rolling(n).max() - \
 low.rolling(n).min())) * 100
 STOD = STOK.rolling(3).mean()
 return STOD
dataset['%K10'] = STOK(dataset['Close'], dataset['Low'], dataset['High'], 10)
dataset['%D10'] = STOD(dataset['Close'], dataset['Low'], dataset['High'], 10)
dataset['%K30'] = STOK(dataset['Close'], dataset['Low'], dataset['High'], 30)
dataset['%D30'] = STOD(dataset['Close'], dataset['Low'], dataset['High'], 30)
dataset['%K200'] = STOK(dataset['Close'], dataset['Low'], dataset['High'], 200)
dataset['%D200'] = STOD(dataset['Close'], dataset['Low'], dataset['High'], 200)
#calculation of moving average
def MA(df, n):
 MA = pd.Series(df['Close'].rolling(n, min_periods=n).mean(), name='MA_'\
 + str(n))
 return MA
dataset['MA21'] = MA(dataset, 10)
dataset['MA63'] = MA(dataset, 30)
dataset['MA252'] = MA(dataset, 200)
With our features completed, we’ll prepare them for use.
4.4. Data visualization. In this step, we visualize different properties of the features
and the predicted variable:
dataset[['Weighted_Price']].plot(grid=True)
plt.show()
Output
184 | Chapter 6: Supervised Learning: Classification
The chart illustrates a sharp rise in the price of bitcoin, increasing from close to $0 to
around $2,500 in 2017. Also, high price volatility is readily visible.
Let us look at the distribution of the predicted variable:
fig = plt.figure()
plot = dataset.groupby(['signal']).size().plot(kind='barh', color='red')
plt.show()
Output
The predicted variable is 1 more than 52% of the time, meaning there are more buy
signals than sell signals. The predicted variable is relatively balanced, especially as
compared to the fraud dataset we saw in the first case study.
5. Evaluate algorithms and models
In this step, we will evaluate different algorithms.
5.1. Train-test split. We first split the dataset into training (80%) and test (20%) sets.
For this case study we use 100,000 observations for a faster calculation. The next steps
would be same in the event we want to use the entire dataset:
# split out validation dataset forthe end
subset_dataset= dataset.iloc[-100000:]
Y= subset_dataset["signal"]
X = subset_dataset.loc[:, dataset.columns != 'signal']
validation_size = 0.2
seed = 1
X_train, X_validation, Y_train, Y_validation =\
train_test_split(X, Y, test_size=validation_size, random_state=1)
Case Study 3: Bitcoin Trading Strategy | 185
5.2. Test options and evaluation metrics. Accuracy can be used as the evaluation metric
since there is not a significant class imbalance in the data:
# test options for classification
num_folds = 10
scoring = 'accuracy'
5.3. Compare models and algorithms. In order to know which algorithm is best for our
strategy, we evaluate the linear, nonlinear, and ensemble models.
5.3.1. Models. Checking the classification algorithms:
models = []
models.append(('LR', LogisticRegression(n_jobs=-1)))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
#Neural Network
models.append(('NN', MLPClassifier()))
# Ensemble Models
# Boosting methods
models.append(('AB', AdaBoostClassifier()))
models.append(('GBM', GradientBoostingClassifier()))
# Bagging methods
models.append(('RF', RandomForestClassifier(n_jobs=-1)))
After performing the k-fold cross validation, the comparison of the models is as
follows:
186 | Chapter 6: Supervised Learning: Classification
Although some of the models show promising results, we prefer an ensemble model
given the huge size of the dataset, the large number of features, and an expected non‐
linear relationship between the predicted variable and the features. Random forest
has the best performance among the ensemble models.
6. Model tuning and grid search
A grid search is performed for the random forest model by varying the number of
estimators and maximum depth. The details of the random forest model and the
parameters to tune are discussed in Chapter 4:
n_estimators = [20,80]
max_depth= [5,10]
criterion = ["gini","entropy"]
param_grid = dict(n_estimators=n_estimators, max_depth=max_depth, \
 criterion = criterion )
model = RandomForestClassifier(n_jobs=-1)
kfold = KFold(n_splits=num_folds, random_state=seed)
grid = GridSearchCV(estimator=model, param_grid=param_grid, \
 scoring=scoring, cv=kfold)
grid_result = grid.fit(X_train, Y_train)
print("Best: %f using %s" % (grid_result.best_score_,\
 grid_result.best_params_))
Output
Best: 0.903438 using {'criterion': 'gini', 'max_depth': 10, 'n_estimators': 80}
7. Finalize the model
Let us finalize the model with the best parameters found during the tuning step and
perform the variable intuition.
7.1. Results on the test dataset. In this step, we evaluate the selected model on the test
set:
# prepare model
model = RandomForestClassifier(criterion='gini', n_estimators=80,max_depth=10)
#model = LogisticRegression()
model.fit(X_train, Y_train)
# estimate accuracy on validation set
predictions = model.predict(X_validation)
print(accuracy_score(Y_validation, predictions))
Output
0.9075
Case Study 3: Bitcoin Trading Strategy | 187
The selected model performs quite well, with an accuracy of 90.75%. Let us look at
the confusion matrix:
The overall model performance is reasonable and is in line with the training set
results.
7.2. Variable intuition/feature importance. Let us look into the feature importance of the
model:
Importance = pd.DataFrame({'Importance':model.feature_importances_*100},\
 index=X.columns)
Importance.sort_values('Importance', axis=0, ascending=True).plot(kind='barh', \
color='r' )
plt.xlabel('Variable Importance')
Output
The result of the variable importance looks intuitive, and the momentum indicators
of RSI and MOM over the last 30 days seem to be the two most important features.
188 | Chapter 6: Supervised Learning: Classification
The feature importance chart corroborates the fact that introducing new features
leads to an improvement in the model performance.
7.3. Backtesting results. In this additional step, we perform a backtest on the model
we’ve developed. We create a column for strategy returns by multiplying the daily
returns by the position that was held at the close of business the previous day and
compare it against the actual returns.
Backtesting Trading Strategies
A backtesting approach similar to the one presented in this case
study can be used to quickly backtest any trading strategy.
backtestdata = pd.DataFrame(index=X_validation.index)
backtestdata['signal_pred'] = predictions
backtestdata['signal_actual'] = Y_validation
backtestdata['Market Returns'] = X_validation['Close'].pct_change()
backtestdata['Actual Returns'] = backtestdata['Market Returns'] *\
backtestdata['signal_actual'].shift(1)
backtestdata['Strategy Returns'] = backtestdata['Market Returns'] * \
backtestdata['signal_pred'].shift(1)
backtestdata=backtestdata.reset_index()
backtestdata.head()
backtestdata[['Strategy Returns','Actual Returns']].cumsum().hist()
backtestdata[['Strategy Returns','Actual Returns']].cumsum().plot()
Output
Case Study 3: Bitcoin Trading Strategy | 189
Looking at the backtesting results, we do not deviate significantly from the actual
market return. Indeed, the achieved momentum trading strategy made us better at
predicting the price direction to buy or sell in order to make profits. However, as our
accuracy is not 100% (but more than 96%), we made relatively few losses compared
to the actual returns.
Conclusion
This case study demonstrated that framing the problem is a key step when tackling a
finance problem with machine learning. In doing so, it was detemined that trans‐
forming the labels according to an investment objective and performing feature engi‐
neering were required for this trading strategy. We demonstrated the efficiency of
using intuitive features related to the trend and momentum of the price movement.
This helped increase the predictive power of the model. Finally, we introduced a
backtesting framework, which allowed us to simulate a trading strategy using histori‐
cal data. This enabled us to generate results and analyze risk and profitability before
risking any actual capital.
Chapter Summary
In “Case Study 1: Fraud Detection” on page 153, we explored the issue of an unbal‐
anced dataset and the importance of having the right evaluation metric. In “Case
Study 2: Loan Default Probability” on page 166, various techniques and concepts of
data processing, feature selection, and exploratory analysis were covered. In “Case
Study 3: Bitcoin Trading Strategy” on page 179, we looked at ways to create technical
190 | Chapter 6: Supervised Learning: Classification
indicators as features in order to use them for model enhancement. We also prepared
a backtesting framework for a trading strategy.
Overall, the concepts in Python, machine learning, and finance presented in this
chapter can used as a blueprint for any other classification-based problem in finance.
Exercises
• Predict whether a stock price will go up or down using the features related to the
stock or macroeconomic variables (use the ideas from the bitcoin-based case
study presented in this chapter).
• Create a model to detect money laundering using the features of a transaction. A
sample dataset for this exercise can be obtained from Kaggle.
• Perform a credit rating analysis of corporations using the features related to
creditworthiness.
Exercises | 191
https://oreil.ly/GcinN
PART III
Unsupervised Learning
CHAPTER 7
Unsupervised Learning:
Dimensionality Reduction
In previous chapters, we used supervised learning techniques to build machine learn‐
ing models using data where the answer was already known (i.e., the class labels were
available in our input data). Now we will explore unsupervised learning, where we
draw inferences from datasets consisting of input data when the answer is unknown.
Unsupervised learning algorithms attempt to infer patterns from the data without
any knowledge of the output the data is meant toyield. Without requiring labeled
data, which can be time-consuming and impractical to create or acquire, this family
of models allows for easy use of larger datasets for analysis and model development.
Dimensionality reduction is a key technique within unsupervised learning. It com‐
presses the data by finding a smaller, different set of variables that capture what mat‐
ters most in the original features, while minimizing the loss of information.
Dimensionality reduction helps mitigate problems associated with high dimensional‐
ity and permits the visualization of salient aspects of higher-dimensional data that is
otherwise difficult to explore.
In the context of finance, where datasets are often large and contain many dimen‐
sions, dimensionality reduction techniques prove to be quite practical and useful.
Dimensionality reduction enables us to reduce noise and redundancy in the dataset
and find an approximate version of the dataset using fewer features. With fewer vari‐
ables to consider, exploration and visualization of a dataset becomes more straight‐
forward. Dimensionality reduction techniques also enhance supervised learning–
based models by reducing the number of features or by finding new ones. Practition‐
ers use these dimensionality reduction techniques to allocate funds across asset
classes and individual investments, identify trading strategies and signals, implement
portfolio hedging and risk management, and develop instrument pricing models.
195
In this chapter, we will discuss fundamental dimensionality reduction techniques and
walk through three case studies in the areas of portfolio management, interest rate
modeling, and trading strategy development. The case studies are designed to not
only cover diverse topics from a finance standpoint but also highlight multiple
machine learning and data science concepts. The standardized template containing
the detailed implementation of modeling steps in Python and machine learning and
finance concepts can be used as a blueprint for any other dimensionality reduction–
based problem in finance.
In “Case Study 1: Portfolio Management: Finding an Eigen Portfolio” on page 202, we
use a dimensionality reduction algorithm to allocate capital into different asset classes
to maximize risk-adjusted returns. We also introduce a backtesting framework to
assess the performance of the portfolio we constructed.
In “Case Study 2: Yield Curve Construction and Interest Rate Modeling” on page 217,
we use dimensionality reduction techniques to generate the typical movements of a
yield curve. This will illustrate how dimensionality reduction techniques can be used
for reducing the dimension of market variables across a number of asset classes to
promote faster and effective portfolio management, trading, hedging, and risk man‐
agement.
In “Case Study 3: Bitcoin Trading: Enhancing Speed and Accuracy” on page 227, we
use dimensionality reduction techniques for algorithmic trading. This case study
demonstrates data exploration in low dimension.
In addition to the points mentioned above, readers will understand the following
points by the end of this chapter:
• Basic concepts of models and techniques used for dimensionality reduction and
how to implement them in Python.
• Concepts of eigenvalues and eigenvectors of Principal Component Analysis
(PCA), selecting the right number of principal components, and extracting the
factor weights of principal components.
• Usage of dimensionality reduction techniques such as singular value decomposi‐
tion (SVD) and t-SNE to summarize high-dimensional data for effective data
exploration and visualization.
• How to reconstruct the original data using the reduced principal components.
• How to enhance the speed and accuracy of supervised learning algorithms using
dimensionality reduction.
196 | Chapter 7: Unsupervised Learning: Dimensionality Reduction
• A backtesting framework for the portfolio performance computing and analyz‐
ing portfolio performance metrics such as the Sharpe ratio and the annualized
return of the portfolio.
This Chapter’s Code Repository
A Python-based master template for dimensionality reduction,
along with the Jupyter notebook for all the case studies in this
chapter, is included in the folder Chapter 7 - Unsup. Learning -
Dimensionality Reduction in the code repository for this book. To
work through any dimensionality reduction–modeling machine
learning problems in Python involving the dimensionality reduc‐
tion models (such as PCA, SVD, Kernel PCA, or t-SNE) presented
in this chapter, readers need to modify the template slightly to
align with their problem statement. All the case studies presented
in this chapter use the standard Python master template with the
standardized model development steps presented in Chapter 3. For
the dimensionality reduction case studies, steps 6 (i.e., model tun‐
ing) and 7 (i.e., finalizing the model) are relatively lighter com‐
pared to the supervised learning models, so these steps have been
merged with step 5. For situations in which steps are irrelevant,
they have been skipped or combined with others to make the flow
of the case study more intuitive.
Dimensionality Reduction Techniques
Dimensionality reduction represents the information in a given dataset more effi‐
ciently by using fewer features. These techniques project data onto a lower dimen‐
sional space by either discarding variation in the data that is not informative or
identifying a lower dimensional subspace on or near where the data resides.
There are many types of dimensionality reduction techniques. In this chapter, we will
cover these most frequently used techniques for dimensionality reduction:
• Principal component analysis (PCA)
• Kernel principal component analysis (KPCA)
• t-distributed stochastic neighbor embedding (t-SNE)
After application of these dimensionality reduction techniques, the low-dimension
feature subspace can be a linear or nonlinear function of the corresponding high-
dimensional feature subspace. Hence, on a broad level these dimensionality reduc‐
tion algorithms can be classified as linear and nonlinear. Linear algorithms, such as
PCA, force the new variables to be linear combinations of the original features.
Dimensionality Reduction Techniques | 197
https://oreil.ly/tI-KJ
https://oreil.ly/tI-KJ
Nonlinear algorithms such KPCA and t-SNE can capture more complex structures in
the data. However, given the infinite number of options, the algorithms still need to
make assumptions to arrive at a solution.
Principal Component Analysis
The idea of principal component analysis (PCA) is to reduce the dimensionality of a
dataset with a large number of variables, while retaining as much variance in the data
as possible. PCA allows us to understand whether there is a different representation
of the data that can explain a majority of the original data points.
PCA finds a set of new variables that, through a linear combination, yield the original
variables. The new variables are called principal components (PCs). These principal
components are orthogonal (or independent) and can represent the original data.
The number of components is a hyperparameter of the PCA algorithm that sets the
target dimensionality.
The PCA algorithm works by projecting the original data onto the principal compo‐
nent space. It then identifies a sequence of principal components, each of which
aligns with the direction of maximum variance in the data (after accounting for var‐
iation captured by previously computed components). The sequential optimization
also ensures that new components are not correlated with existing components. Thus
the resulting set constitutes an orthogonal basis for a vector space.
The decline in the amount of variance of the original data explained by each principal
component reflects the extent of correlation among the original features. The number
of components that capture, for example, 95% of the original variation relative to the
total number of featuresprovides an insight into the linearly independent informa‐
tion of the original data. In order to understand how PCA works, let’s consider the
distribution of data shown in Figure 7-1.
Figure 7-1. PCA-1
PCA finds a new quadrant system (y’ and x’ axes) that is obtained from the original
through translation and rotation. It will move the center of the coordinate system
from the original point (0, 0) to the center of the distribution of data points. It will
then move the x-axis into the principal axis of variation, which is the one with the
198 | Chapter 7: Unsupervised Learning: Dimensionality Reduction
1 Eigenvectors and eigenvalues are concepts of linear algebra.
most variation relative to data points (i.e., the direction of maximum spread). Then it
moves the other axis orthogonally to the principal one, into a less important direction
of variation.
Figure 7-2 shows an example of PCA in which two dimensions explain nearly all the
variance of the underlying data.
Figure 7-2. PCA-2
These new directions that contain the maximum variance are called principal compo‐
nents and are orthogonal to each other by design.
There are two approaches to finding the principal components: Eigen decomposition
and singular value decomposition (SVD).
Eigen decomposition
The steps of Eigen decomposition are as follows:
1. First, a covariance matrix is created for the features.
2. Once the covariance matrix is computed, the eigenvectors of the covariance
matrix are calculated.1 These are the directions of maximum variance.
3. The eigenvalues are then created. They define the magnitude of the principal
components.
So, for n dimensions, there will be an n × n variance-covariance matrix, and as a
result, we will have an eigenvector of n values and n eigenvalues.
Python’s sklearn library offers a powerful implementation of PCA. The
sklearn.decomposition.PCA function computes the desired number of principal
components and projects the data into the component space. The following code
snippet illustrates how to create two principal components from a dataset.
Dimensionality Reduction Techniques | 199
https://oreil.ly/fDaLg
Implementation
# Import PCA Algorithm
from sklearn.decomposition import PCA
# Initialize the algorithm and set the number of PC's
pca = PCA(n_components=2)
# Fit the model to data
pca.fit(data)
# Get list of PC's
pca.components_
# Transform the model to data
pca.transform(data)
# Get the eigenvalues
pca.explained_variance_ratio
There are additional items, such as factor loading, that can be obtained using the
functions in the sklearn library. Their use will be demonstrated in the case studies.
Singular value decomposition
Singular value decomposition (SVD) is factorization of a matrix into three matrices
and is applicable to a more general case of m × n rectangular matrices.
If A is an m × n matrix, then SVD can express the matrix as:
A = UΣV T
where A is an m × n matrix, U is an (m × m) orthogonal matrix, Σ is an (m × n)
nonnegative rectangular diagonal matrix, and V is an (n × n) orthogonal matrix. SVD
of a given matrix tells us exactly how we can decompose the matrix. Σ is a diagonal
matrix with m diagonal values called singular values. Their magnitude indicates how
significant they are to preserving the information of the original data. V contains the
principal components as column vectors.
As shown above, both Eigen decomposition and SVD tell us that using PCA is effec‐
tively looking at the initial data from a different angle. Both will always give the same
answer; however, SVD can be much more efficient than Eigen decomposition, as it is
able to handle sparse matrices (those which contain very few nonzero elements). In
addition, SVD yields better numerical stability, especially when some of the features
are strongly correlated.
Truncated SVD is a variant of SVD that computes only the largest singular values,
where the number of computes is a user-specified parameter. This method is differ‐
ent from regular SVD in that it produces a factorization where the number of col‐
umns is equal to the specified truncation. For example, given an n × n matrix, SVD
will produce matrices with n columns, whereas truncated SVD will produce matrices
with a specified number of columns that may be less than n.
200 | Chapter 7: Unsupervised Learning: Dimensionality Reduction
Implementation
from sklearn.decomposition import TruncatedSVD
svd = TruncatedSVD(ncomps=20).fit(X)
In terms of the weaknesses of the PCA technique, although it is very effective in
reducing the number of dimensions, the resulting principal components may be less
interpretable than the original features. Additionally, the results may be sensitive to
the selected number of principal components. For example, too few principal compo‐
nents may miss some information compared to the original list of features. Also, PCA
may not work well if the data is strongly nonlinear.
Kernel Principal Component Analysis
A main limitation of PCA is that it only applies linear transformations. Kernel princi‐
pal component analysis (KPCA) extends PCA to handle nonlinearity. It first maps
the original data to some nonlinear feature space (usually one of higher dimension).
Then it applies PCA to extract the principal components in that space.
A simple example of when KPCA is applicable is shown in Figure 7-3. Linear trans‐
formations are suitable for the blue and red data points on the left-hand plot.
However, if all dots are arranged as per the graph on the right, the result is not line‐
arly separable. We would then need to apply KPCA to separate the components.
Figure 7-3. Kernel PCA
Implementation
from sklearn.decomposition import KernelPCA
kpca = KernelPCA(n_components=4, kernel='rbf').fit_transform(X)
In the Python code, we specify kernel='rbf', which is the radial basis function ker‐
nel. This is commonly used as a kernel in machine learning techniques, such as in
SVMs (see Chapter 4).
Using KPCA, component separation becomes easier in a higher dimensional space,
as mapping into a higher dimensional space often provides greater classification
power.
Dimensionality Reduction Techniques | 201
https://oreil.ly/zCo-X
https://oreil.ly/zCo-X
t-distributed Stochastic Neighbor Embedding
t-distributed stochastic neighbor embedding (t-SNE) is a dimensionality reduction
algorithm that reduces the dimensions by modeling the probability distribution of
neighbors around each point. Here, the term neighbors refers to the set of points clos‐
est to a given point. The algorithm emphasizes keeping similar points together in low
dimensions as opposed to maintaining the distance between points that are apart in
high dimensions.
The algorithm starts by calculating the probability of similarity of data points in cor‐
responding high and low dimensional space. The similarity of points is calculated as
the conditional probability that a point A would choose point B as its neighbor if
neighbors were picked in proportion to their probability density under a normal dis‐
tribution centered at A. The algorithm then tries to minimize the difference between
these conditional probabilities (or similarities) in the high and low dimensional
spaces for a perfect representation of data points in the low dimensional space.
Implementation
from sklearn.manifold import TSNE
X_tsne = TSNE().fit_transform(X)
An implementation of t-SNE is shown in the third case study presented in this
chapter.
Case Study 1: Portfolio Management: Finding an Eigen
Portfolio
A primary objective of portfolio management is to allocate capital into different asset
classes to maximize risk-adjusted returns. Mean-variance portfolio optimization is
the most commonly used technique for asset allocation. This method requires an
estimated covariance matrix and expected returns of the assets considered. However,
the erratic nature of financial returns leads to estimation errors in these inputs, espe‐
cially when the sample size of returns is insufficient compared to the number of
assetsbeing allocated. These errors greatly jeopardize the optimality of the resulting
portfolios, leading to poor and unstable outcomes.
Dimensionality reduction is a technique we can use to address this issue. Using PCA,
we can take an n × n covariance matrix of our assets and create a set of n linearly
uncorrelated principal portfolios (sometimes referred to in literature as an eigen port‐
folio) made up of our assets and their corresponding variances. The principal compo‐
nents of the covariance matrix capture most of the covariation among the assets and
are mutually uncorrelated. Moreover, we can use standardized principal components
as the portfolio weights, with the statistical guarantee that the returns from these
principal portfolios are linearly uncorrelated.
202 | Chapter 7: Unsupervised Learning: Dimensionality Reduction
By the end of this case study, readers will be familiar with a general approach to find‐
ing an eigen portfolio for asset allocation, from understanding concepts of PCA to
backtesting different principal components.
This case study will focus on:
• Understanding eigenvalues and eigenvectors of PCA and deriving portfolio
weights using the principal components.
• Developing a backtesting framework to evaluate portfolio performance.
• Understanding how to work through a dimensionality reduction modeling prob‐
lem from end to end.
Blueprint for Using Dimensionality Reduction for Asset
Allocation
1. Problem definition
Our goal in this case study is to maximize the risk-adjusted returns of an equity port‐
folio using PCA on a dataset of stock returns.
The dataset used for this case study is the Dow Jones Industrial Average (DJIA) index
and its respective 30 stocks. The return data used will be from the year 2000 onwards
and can be downloaded from Yahoo Finance.
We will also compare the performance of our hypothetical portfolios against a bench‐
mark and backtest the model to evaluate the effectiveness of the approach.
2. Getting started—loading the data and Python packages
2.1. Loading the Python packages. The list of the libraries used for data loading, data
analysis, data preparation, model evaluation, and model tuning are shown below. The
details of most of these packages and functions can be found in Chapters 2 and 4.
Packages for dimensionality reduction
from sklearn.decomposition import PCA
from sklearn.decomposition import TruncatedSVD
from numpy.linalg import inv, eig, svd
from sklearn.manifold import TSNE
from sklearn.decomposition import KernelPCA
Case Study 1: Portfolio Management: Finding an Eigen Portfolio | 203
Packages for data processing and visualization
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pandas import read_csv, set_option
from pandas.plotting import scatter_matrix
import seaborn as sns
from sklearn.preprocessing import StandardScaler
2.2. Loading the data. We import the dataframe containing the adjusted closing prices
for all the companies in the DJIA index:
# load dataset
dataset = read_csv('Dow_adjcloses.csv', index_col=0)
3. Exploratory data analysis
Next, we inspect the dataset.
3.1. Descriptive statistics. Let’s look at the shape of the data:
dataset.shape
Output
(4804, 30)
The data is comprised of 30 columns and 4,804 rows containing the daily closing
prices of the 30 stocks in the index since 2000.
3.2. Data visualization. The first thing we must do is gather a basic sense of our data.
Let us take a look at the return correlations:
correlation = dataset.corr()
plt.figure(figsize=(15, 15))
plt.title('Correlation Matrix')
sns.heatmap(correlation, vmax=1, square=True,annot=True, cmap='cubehelix')
There is a significant positive correlation between the daily returns. The plot (full-size
version available on GitHub) also indicates that the information embedded in the
data may be represented by fewer variables (i.e., something smaller than the 30
dimensions we have now). We will perform another detailed look at the data after
implementing dimensionality reduction.
204 | Chapter 7: Unsupervised Learning: Dimensionality Reduction
https://oreil.ly/yFwu-
Output
4. Data preparation
We prepare the data for modeling in the following sections.
4.1. Data cleaning. First, we check for NAs in the rows and either drop them or fill
them with the mean of the column:
#Checking for any null values and removing the null values'''
print('Null Values =',dataset.isnull().values.any())
Output
Null Values = True
Case Study 1: Portfolio Management: Finding an Eigen Portfolio | 205
Some stocks were added to the index after our start date. To ensure proper analysis,
we will drop those with more than 30% missing values. Two stocks fit this criteria—
Dow Chemicals and Visa:
missing_fractions = dataset.isnull().mean().sort_values(ascending=False)
missing_fractions.head(10)
drop_list = sorted(list(missing_fractions[missing_fractions > 0.3].index))
dataset.drop(labels=drop_list, axis=1, inplace=True)
dataset.shape
Output
(4804, 28)
We end up with return data for 28 companies and an additional one for the DJIA
index. Now we fill the NAs with the mean of the columns:
# Fill the missing values with the last value available in the dataset.
dataset=dataset.fillna(method='ffill')
4.2. Data transformation. In addition to handling the missing values, we also want to
standardize the dataset features onto a unit scale (mean = 0 and variance = 1). All the
variables should be on the same scale before applying PCA; otherwise, a feature with
large values will dominate the result. We use StandardScaler in sklearn to standard‐
ize the dataset, as shown below:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(datareturns)
rescaledDataset = pd.DataFrame(scaler.fit_transform(datareturns),columns =\
 datareturns.columns, index = datareturns.index)
# summarize transformed data
datareturns.dropna(how='any', inplace=True)
rescaledDataset.dropna(how='any', inplace=True)
Overall, cleaning and standardizing the data is important in order to create a mean‐
ingful and reliable dataset to be used in dimensionality reduction without error.
Let us look at the returns of one of the stocks from the cleaned and standardized
dataset:
# Visualizing Log Returns for the DJIA
plt.figure(figsize=(16, 5))
plt.title("AAPL Return")
rescaledDataset.AAPL.plot()
plt.grid(True);
plt.legend()
plt.show()
206 | Chapter 7: Unsupervised Learning: Dimensionality Reduction
Output
5. Evaluate algorithms and models
5.1. Train-test split. The portfolio is divided into training and test sets to perform the
analysis regarding the best portfolio and to perform backtesting:
# Dividing the dataset into training and testing sets
percentage = int(len(rescaledDataset) * 0.8)
X_train = rescaledDataset[:percentage]
X_test = rescaledDataset[percentage:]
stock_tickers = rescaledDataset.columns.values
n_tickers = len(stock_tickers)
5.2. Model evaluation: applying principal component analysis. As the next step, we create a
function to perform PCA using the sklearn library. This function generates the prin‐
cipal components from the data that will be used for further analysis:
pca = PCA()
PrincipalComponent=pca.fit(X_train)
5.2.1. Explained variance using PCA. In this step, we look at the variance explained using
PCA. The decline in the amount of variance of the original data explained by each
principal component reflects the extent of correlation among the original features.
The first principal component captures the most variance in the original data, the
second component is a representation of the second highest variance, and so on. The
eigenvectors with the lowest eigenvalues describe the least amount of variation within
the dataset. Therefore, these values can be dropped.
The following charts show the number of principal components and the variance
explained by each.
NumEigenvalues=20
fig, axes = plt.subplots(ncols=2, figsize=(14,4))
Series1 = pd.Series(pca.explained_variance_ratio_[:NumEigenvalues]).sort_values()Case Study 1: Portfolio Management: Finding an Eigen Portfolio | 207
Series2 = pd.Series(pca.explained_variance_ratio_[:NumEigenvalues]).cumsum()
Series1.plot.barh(title='Explained Variance Ratio by Top Factors', ax=axes[0]);
Series1.plot(ylim=(0,1), ax=axes[1], title='Cumulative Explained Variance');
Output
We find that the most important factor explains around 40% of the daily return var‐
iation. This dominant principal component is usually interpreted as the “market” fac‐
tor. We will discuss the interpretation of this and the other factors when looking at
the portfolio weights.
The plot on the right shows the cumulative explained variance and indicates that
around ten factors explain 73% of the variance in returns of the 28 stocks analyzed.
5.2.2. Looking at portfolio weights. In this step, we look more closely at the individual
principal components. These may be less interpretable than the original features.
However, we can look at the weights of the factors on each principal component to
assess any intuitive themes relative to the 28 stocks. We construct five portfolios,
defining the weights of each stock as each of the first five principal components. We
then create a scatterplot that visualizes an organized descending plot with the respec‐
tive weight of every company at the current chosen principal component:
def PCWeights():
 #Principal Components (PC) weights for each 28 PCs
 weights = pd.DataFrame()
 for i in range(len(pca.components_)):
 weights["weights_{}".format(i)] = \
 pca.components_[i] / sum(pca.components_[i])
 weights = weights.values.T
 return weights
weights=PCWeights()
sum(pca.components_[0])
Output
-5.247808242068631
208 | Chapter 7: Unsupervised Learning: Dimensionality Reduction
NumComponents=5
topPortfolios = pd.DataFrame(pca.components_[:NumComponents],\
 columns=dataset.columns)
eigen_portfolios = topPortfolios.div(topPortfolios.sum(1), axis=0)
eigen_portfolios.index = [f'Portfolio {i}' for i in range( NumComponents)]
np.sqrt(pca.explained_variance_)
eigen_portfolios.T.plot.bar(subplots=True, layout=(int(NumComponents),1), \
figsize=(14,10), legend=False, sharey=True, ylim= (-1,1))
Given that scale for the plots are the same, we can also look at the heatmap as follows:
Output
# plotting heatmap
sns.heatmap(topPortfolios)
Case Study 1: Portfolio Management: Finding an Eigen Portfolio | 209
Output
The heatmap and barplots show the contribution of different stocks in each
eigenvector.
Traditionally, the intuition behind each principal portfolio is that it represents some
sort of independent risk factor. The manifestation of those risk factors depends on
the assets in the portfolio. In our case study, the assets are all U.S. domestic equities.
The principal portfolio with the largest variance is typically a systematic risk factor
(i.e., “market” factor). Looking at the first principal component (Portfolio 0), we see
that the weights are distributed homogeneously across the stocks. This nearly equal
weighted portfolio explains 40% of the variance in the index and is a fair representa‐
tion of a systematic risk factor.
The rest of the eigen portfolios typically correspond to sector or industry factors. For
example, Portfolio 1 assigns a high weight to JNJ and MRK, which are stocks from the
health care sector. Similarly, Portfolio 3 has high weights on technology and electron‐
ics companies, such AAPL, MSFT, and IBM.
When the asset universe for our portfolio is expanded to include broad, global invest‐
ments, we may identify factors for international equity risk, interest rate risk, com‐
modity exposure, geographic risk, and many others.
In the next step, we find the best eigen portfolio.
5.2.3. Finding the best eigen portfolio. To determine the best eigen portfolio, we use the
Sharpe ratio. This is an assessment of risk-adjusted performance that explains the
annualized returns against the annualized volatility of a portfolio. A high Sharpe ratio
explains higher returns and/or lower volatility for the specified portfolio. The annual‐
ized Sharpe ratio is computed by dividing the annualized returns against the annual‐
ized volatility. For annualized return we apply the geometric average of all the returns
210 | Chapter 7: Unsupervised Learning: Dimensionality Reduction
in respect to the periods per year (days of operations in the exchange in a year).
Annualized volatility is computed by taking the standard deviation of the returns and
multiplying it by the square root of the periods per year.
The following code computes the Sharpe ratio of a portfolio:
# Sharpe Ratio Calculation
# Calculation based on conventional number of trading days per year (i.e., 252).
def sharpe_ratio(ts_returns, periods_per_year=252):
 n_years = ts_returns.shape[0]/ periods_per_year
 annualized_return = np.power(np.prod(1+ts_returns), (1/n_years))-1
 annualized_vol = ts_returns.std() * np.sqrt(periods_per_year)
 annualized_sharpe = annualized_return / annualized_vol
 return annualized_return, annualized_vol, annualized_sharpe
We construct a loop to compute the principal component weights for each eigen
portfolio. Then it uses the Sharpe ratio function to look for the portfolio with the
highest Sharpe ratio. Once we know which portfolio has the highest Sharpe ratio, we
can visualize its performance against the index for comparison:
def optimizedPortfolio():
 n_portfolios = len(pca.components_)
 annualized_ret = np.array([0.] * n_portfolios)
 sharpe_metric = np.array([0.] * n_portfolios)
 annualized_vol = np.array([0.] * n_portfolios)
 highest_sharpe = 0
 stock_tickers = rescaledDataset.columns.values
 n_tickers = len(stock_tickers)
 pcs = pca.components_
 for i in range(n_portfolios):
 pc_w = pcs[i] / sum(pcs[i])
 eigen_prtfi = pd.DataFrame(data ={'weights': pc_w.squeeze()*100}, \
 index = stock_tickers)
 eigen_prtfi.sort_values(by=['weights'], ascending=False, inplace=True)
 eigen_prti_returns = np.dot(X_train_raw.loc[:, eigen_prtfi.index], pc_w)
 eigen_prti_returns = pd.Series(eigen_prti_returns.squeeze(),\
 index=X_train_raw.index)
 er, vol, sharpe = sharpe_ratio(eigen_prti_returns)
 annualized_ret[i] = er
 annualized_vol[i] = vol
 sharpe_metric[i] = sharpe
 sharpe_metric= np.nan_to_num(sharpe_metric)
 # find portfolio with the highest Sharpe ratio
 highest_sharpe = np.argmax(sharpe_metric)
 print('Eigen portfolio #%d with the highest Sharpe. Return %.2f%%,\
 vol = %.2f%%, Sharpe = %.2f' %
Case Study 1: Portfolio Management: Finding an Eigen Portfolio | 211
 (highest_sharpe,
 annualized_ret[highest_sharpe]*100,
 annualized_vol[highest_sharpe]*100,
 sharpe_metric[highest_sharpe]))
 fig, ax = plt.subplots()
 fig.set_size_inches(12, 4)
 ax.plot(sharpe_metric, linewidth=3)
 ax.set_title('Sharpe ratio of eigen-portfolios')
 ax.set_ylabel('Sharpe ratio')
 ax.set_xlabel('Portfolios')
 results = pd.DataFrame(data={'Return': annualized_ret,\
 'Vol': annualized_vol,
 'Sharpe': sharpe_metric})
 results.dropna(inplace=True)
 results.sort_values(by=['Sharpe'], ascending=False, inplace=True)
 print(results.head(5))
 plt.show()
optimizedPortfolio()
Output
Eigen portfolio #0 with the highest Sharpe. Return 11.47%, vol = 13.31%, \
Sharpe = 0.86
 Return Vol Sharpe
0 0.115 0.133 0.862
7 0.096 0.693 0.138
5 0.100 0.845 0.118
1 0.057 0.670 0.084
As shown by the results above, Portfolio 0 is the best performing one, with the highest
return and the lowest volatility. Let us look at the composition of this portfolio:
weights = PCWeights()
portfolio = portfolio = pd.DataFrame()
212 | Chapter 7: Unsupervised Learning: Dimensionality Reduction
def plotEigen(weights, plot=False, portfolio=portfolio):
 portfolio = pd.DataFrame(data ={'weights': weights.squeeze() * 100}, \
 index= stock_tickers)
 portfolio.sort_values(by=['weights'], ascending=False, inplace=True)
 if plot:
 portfolio.plot(title='Current Eigen-Portfolio Weights',
 figsize=(12, 6),
 xticks=range(0, len(stock_tickers), 1),
 rot=45,
 linewidth=3
 )
 plt.show()
 return portfolio
# Weights are stored in arrays, where 0 is the first PC's weights.
plotEigen(weights=weights[0], plot=True)
Output
Recall that this is the portfolio that explains 40% of the variance and represents the
systematic risk factor. Looking at the portfolio weights (in percentages in the y-axis),
they do not vary much and are in the range of 2.7% to 4.5% across all stocks. How‐
ever, the weights seem to be higher in the financial sector, and stocks such as AXP,
JPM, and GS have higher-than-average weights.
Case Study 1: Portfolio Management: Finding an Eigen Portfolio | 213
5.2.4. Backtesting the eigen portfolios. We will now try to backtest this algorithm on
the test set. We will look at a few of the top performers and the worst performer. For
the top performers we look at the 3rd- and 4th-ranked eigen portfolios (Portfolios 5
and 1), while the worst performer reviewed was ranked 19th (Portfolio 14):
def Backtest(eigen):
 '''
 Plots principal components returns against real returns.
 '''
 eigen_prtfi = pd.DataFrame(data ={'weights': eigen.squeeze()}, \
 index=stock_tickers)
 eigen_prtfi.sort_values(by=['weights'], ascending=False, inplace=True)
 eigen_prti_returns = np.dot(X_test_raw.loc[:, eigen_prtfi.index], eigen)
 eigen_portfolio_returns = pd.Series(eigen_prti_returns.squeeze(),\
 index=X_test_raw.index)
 returns, vol, sharpe = sharpe_ratio(eigen_portfolio_returns)
 print('Current Eigen-Portfolio:\nReturn = %.2f%%\nVolatility = %.2f%%\n\
 Sharpe = %.2f' % (returns * 100, vol * 100, sharpe))
 equal_weight_return=(X_test_raw * (1/len(pca.components_))).sum(axis=1)
 df_plot = pd.DataFrame({'EigenPorfolio Return': eigen_portfolio_returns, \
 'Equal Weight Index': equal_weight_return}, index=X_test.index)
 np.cumprod(df_plot + 1).plot(title='Returns of the equal weighted\
 index vs. First eigen-portfolio',
 figsize=(12, 6), linewidth=3)
 plt.show()
Backtest(eigen=weights[5])
Backtest(eigen=weights[1])
Backtest(eigen=weights[14])
Output
Current Eigen-Portfolio:
Return = 32.76%
Volatility = 68.64%
Sharpe = 0.48
214 | Chapter 7: Unsupervised Learning: Dimensionality Reduction
Current Eigen-Portfolio:
Return = 99.80%
Volatility = 58.34%
Sharpe = 1.71
Current Eigen-Portfolio:
Return = -79.42%
Volatility = 185.30%
Sharpe = -0.43
Case Study 1: Portfolio Management: Finding an Eigen Portfolio | 215
As shown in the preceding charts, the eigen portfolio return of the top portfolios out‐
performs the equally weighted index. The eigen portfolio ranked 19th underper‐
formed the market significantly in the test set. The outperformance and
underperformance are attributed to the weights of the stocks or sectors in the eigen
portfolio. We can drill down further to understand the individual drivers of each
portfolio. For example, Portfolio 1 assigns high weight to several stocks in the health
care sector, as discussed previously. This sector saw a significant increase in 2017
onwards, which is reflected in the chart for Eigen Portfolio 1.
Given that these eigen portfolios are independent, they also provide diversification
opportunities. As such, we can invest across these uncorrelated eigen portfolios, pro‐
viding other potential portfolio management benefits.
Conclusion
In this case study, we applied dimensionality reduction techniques in the context of
portfolio management, using eigenvalues and eigenvectors from PCA to perform
asset allocation.
We demonstrated that, while some interpretability is lost, the initution behind the
resulting portfolios can be matched to risk factors. In this example, the first eigen
portfolio represented a systematic risk factor, while others exhibited sector or indus‐
try concentration.
Through backtesting, we found that the portfolio with the best result on the training
set also achieved the strongest performance on the test set. Several of the portfolios
outperformed the index based on the Sharpe ratio, the risk-adjusted performance
metric used in this exercise.
216 | Chapter 7: Unsupervised Learning: Dimensionality Reduction
Overall, we found that using PCA and analyzing eigen portfolios can yield a robust
methodology for asset allocation and portfolio management.
Case Study 2: Yield Curve Construction and Interest Rate
Modeling
A number of problems in portfolio management, trading, and risk management
require a deep understanding and modeling of yield curves.
A yield curve represents interest rates, or yields, across a range of maturities, usually
depicted in a line graph, as discussed in “Case Study 4: Yield Curve Prediction” on
page 141 in Chapter 5. Yield curve illustrates the “price of funds” at a given point in
time and, due to the time value of money, often shows interest rates rising as a func‐
tion of maturity.
Researchers in finance have studied the yield curve and found that shifts or changes
in the shape of the yield curve are attributable to a few unobservable factors. Specifi‐
cally, empirical studies reveal that more than 99% of the movement of various U.S.
Treasury bond yields are captured by three factors, which are often referred to as
level, slope, and curvature. The names describe how each influences the yield curve
shape in response to a shock. A level shock changes the interest rates of all maturities
by almost identical amounts, inducing a parallel shift that changes the level of the
entire curve up or down. A shock to the slope factor changes the difference in short-
term and long-term rates. For instance, when long-term rates increase by a larger
amount than do short-term rates, it results in a curve that becomes steeper (i.e., visu‐
ally, the curve becomes more upward sloping). Changes in the short- and long-term
rates can also produce a flatter yield curve. The main effects of the shock to the curva‐
ture factor focuses on medium-term interest rates, leading to hump, twist, or U-
shaped characteristics.
Dimensionality reduction breaks down the movement of the yield curve into these
three factors. Reducing the yield curve into fewer components means we can focus on
a few intuitive dimensions in the yield curve. Traders and risk managers use this
technique to condense the curve in risk factors for hedging the interest rate risk. Sim‐
ilarly, portfolio managers then have fewer dimensions to analyze when allocating
funds. Interest rate structurers use this technique to model the yield curve and ana‐
lyze its shape. Overall, it promotes faster and more effective portfolio management,
trading, hedging, and risk management.
In this case study, we use PCA to generate typical movements of a yield curve and
show that the first three principal components correspond to a yield curve’s level,
slope, and curvature, respectively.
Case Study 2: Yield Curve Construction and Interest Rate Modeling | 217
This case study will focus on:
• Understanding the intuition behind eigenvectors.
• Using dimensions resulting from dimensionality reduction to reproduce the
original data.
Blueprint for Using Dimensionality Reduction to Generate a
Yield Curve
1. Problem definition
Our goal in this case study is to use dimensionality reduction techniques to generate
the typical movements of a yield curve.
The data used for this case study is obtained from Quandl, a premier source for
financial, economic, and alternative datasets. We use the data of 11 tenors (or maturi‐
ties), from 1-month to 30-years, of Treasury curves. These are of daily frequency and
are available from 1960 onwards.
2. Getting started—loading the data and Python packages
2.1. Loading the Python packages. The loading of Python packages is similar to the

Mais conteúdos dessa disciplina