Prévia do material em texto
Hariom Tatsat, Sahil Puri & Brad Lookabaugh Machine Learning & Data Science Blueprints for Finance From Building Trading Strategies to Robo-Advisors Using Python Hariom Tatsat, Sahil Puri, and Brad Lookabaugh Machine Learning and Data Science Blueprints for Finance From Building Trading Strategies to Robo-Advisors Using Python Boston Farnham Sebastopol TokyoBeijing Boston Farnham Sebastopol TokyoBeijing 978-1-492-07305-5 [LSCH] Machine Learning and Data Science Blueprints for Finance by Hariom Tatsat, Sahil Puri, and Brad Lookabaugh Copyright © 2021 Hariom Tatsat, Sahil Puri, and Brad Lookabaugh. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institu‐ tional sales department: 800-998-9938 or corporate@oreilly.com. Acquisitions Editor: Michelle Smith Indexer: WordCo Indexing Services, Inc. Development Editor: Jeff Bleiel Interior Designer: David Futato Production Editor: Christopher Faucher Cover Designer: Karen Montgomery Copyeditor: Piper Editorial, LLC Illustrator: Kate Dullea Proofreader: Arthur Johnson October 2020: First Edition Revision History for the First Edition 2020-09-29: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781492073055 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Machine Learning and Data Science Blueprints for Finance, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the authors, and do not represent the publisher’s views. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. http://oreilly.com http://oreilly.com/catalog/errata.csp?isbn=9781492073055 Table of Contents Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix Part I. The Framework 1. Machine Learning in Finance: The Landscape. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Current and Future Machine Learning Applications in Finance 2 Algorithmic Trading 2 Portfolio Management and Robo-Advisors 2 Fraud Detection 3 Loans/Credit Card/Insurance Underwriting 3 Automation and Chatbots 3 Risk Management 4 Asset Price Prediction 4 Derivative Pricing 4 Sentiment Analysis 5 Trade Settlement 5 Money Laundering 5 Machine Learning, Deep Learning, Artificial Intelligence, and Data Science 5 Machine Learning Types 7 Supervised 7 Unsupervised 8 Reinforcement Learning 9 Natural Language Processing 10 Chapter Summary 11 iii 2. Developing a Machine Learning Model in Python. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Why Python? 13 Python Packages for Machine Learning 14 Python and Package Installation 15 Steps for Model Development in Python Ecosystem 15 Model Development Blueprint 16 Chapter Summary 29 3. Artificial Neural Networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 ANNs: Architecture, Training, and Hyperparameters 32 Architecture 32 Training 34 Hyperparameters 36 Creating an Artificial Neural Network Model in Python 40 Installing Keras and Machine Learning Packages 40 Running an ANN Model Faster: GPU and Cloud Services 43 Chapter Summary 45 Part II. Supervised Learning 4. Supervised Learning: Models and Concepts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Supervised Learning Models: An Overview 51 Linear Regression (Ordinary Least Squares) 52 Regularized Regression 55 Logistic Regression 57 Support Vector Machine 58 K-Nearest Neighbors 60 Linear Discriminant Analysis 62 Classification and Regression Trees 63 Ensemble Models65 ANN-Based Models 71 Model Performance 73 Overfitting and Underfitting 73 Cross Validation 74 Evaluation Metrics 75 Model Selection 79 Factors for Model Selection 79 Model Trade-off 81 Chapter Summary 82 iv | Table of Contents 5. Supervised Learning: Regression (Including Time Series Models). . . . . . . . . . . . . . . . . 83 Time Series Models 86 Time Series Breakdown 87 Autocorrelation and Stationarity 88 Traditional Time Series Models (Including the ARIMA Model) 90 Deep Learning Approach to Time Series Modeling 92 Modifying Time Series Data for Supervised Learning Models 95 Case Study 1: Stock Price Prediction 95 Blueprint for Using Supervised Learning Models to Predict a Stock Price 97 Case Study 2: Derivative Pricing 114 Blueprint for Developing a Machine Learning Model for Derivative Pricing 115 Case Study 3: Investor Risk Tolerance and Robo-Advisors 125 Blueprint for Modeling Investor Risk Tolerance and Enabling a Machine Learning–Based Robo-Advisor 127 Case Study 4: Yield Curve Prediction 141 Blueprint for Using Supervised Learning Models to Predict the Yield Curve 142 Chapter Summary 149 Exercises 150 6. Supervised Learning: Classification. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 Case Study 1: Fraud Detection 153 Blueprint for Using Classification Models to Determine Whether a Transaction Is Fraudulent 153 Case Study 2: Loan Default Probability 166 Blueprint for Creating a Machine Learning Model for Predicting Loan Default Probability 167 Case Study 3: Bitcoin Trading Strategy 179 Blueprint for Using Classification-Based Models to Predict Whether to Buy or Sell in the Bitcoin Market 180 Chapter Summary 190 Exercises 191 Part III. Unsupervised Learning 7. Unsupervised Learning: Dimensionality Reduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 Dimensionality Reduction Techniques 197 Principal Component Analysis 198 Kernel Principal Component Analysis 201 Table of Contents | v t-distributed Stochastic Neighbor Embedding 202 Case Study 1: Portfolio Management: Finding an Eigen Portfolio 202 Blueprint for Using Dimensionality Reduction for Asset Allocation 203 Case Study 2: Yield Curve Construction and Interest Rate Modeling 217 Blueprint for Using Dimensionality Reduction to Generate a Yield Curve 218 Case Study 3: Bitcoin Trading: Enhancing Speed and Accuracy 227 Blueprint for Using Dimensionality Reduction to Enhance a Trading Strategy 228 Chapter Summary 236 Exercises 236 8. Unsupervised Learning: Clustering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 Clustering Techniques 239 k-means Clustering 239 Hierarchical Clustering 240 Affinity Propagation Clustering 242 Case Study 1: Clustering for Pairs Trading 243 Blueprint for Using Clustering to Select Pairs 244 Case Study 2: Portfolio Management: Clustering Investors 259 Blueprint for Using Clustering for Grouping Investors 260 Case Study 3: Hierarchical Risk Parity 267 Blueprint for Using Clustering to Implement Hierarchical Risk Parity 268 Chapter Summary 277 Exercises 277 Part IV. Reinforcement Learning and Natural Language Processing 9. Reinforcement Learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281 Reinforcement Learning—Theory and Concepts 283 RL Components 284 RL Modeling Framework 288 Reinforcement Learning Models 293 Key Challenges in Reinforcement Learning298 Case Study 1: Reinforcement Learning–Based Trading Strategy 298 Blueprint for Creating a Reinforcement Learning–Based Trading Strategy 300 Case Study 2: Derivatives Hedging 316 Blueprint for Implementing a Reinforcement Learning–Based Hedging Strategy 317 Case Study 3: Portfolio Allocation 334 vi | Table of Contents Blueprint for Creating a Reinforcement Learning–Based Algorithm for Portfolio Allocation 335 Chapter Summary 344 Exercises 345 10. Natural Language Processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347 Natural Language Processing: Python Packages 349 NLTK 349 TextBlob 349 spaCy 350 Natural Language Processing: Theory and Concepts 350 1. Preprocessing 351 2. Feature Representation 356 3. Inference 360 Case Study 1: NLP and Sentiment Analysis–Based Trading Strategies 362 Blueprint for Building a Trading Strategy Based on Sentiment Analysis 363 Case Study 2: Chatbot Digital Assistant 383 Blueprint for Creating a Custom Chatbot Using NLP 385 Case Study 3: Document Summarization 393 Blueprint for Using NLP for Document Summarization 394 Chapter Summary 400 Exercises 400 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401 Table of Contents | vii Preface The value of machine learning (ML) in finance is becoming more apparent each day. Machine learning is expected to become crucial to the functioning of financial mar‐ kets. Analysts, portfolio managers, traders, and chief investment officers should all be familiar with ML techniques. For banks and other financial institutions striving to improve financial analysis, streamline processes, and increase security, ML is becom‐ ing the technology of choice. The use of ML in institutions is an increasing trend, and its potential for improving various systems can be observed in trading strategies, pricing, and risk management. Although machine learning is making significant inroads across all verticals of the financial services industry, there is a gap between the ideas and the implementation of machine learning algorithms. A plethora of material is available on the web in these areas, yet very little is organized. Additionally, most of the literature is limited to trading algorithms only. Machine Learning and Data Science Blueprints for Finance fills this void and provides a machine learning toolbox customized for the financal market that allows the readers to be part of the machine learning revolution. This book is not limited to investing or trading strategies; it focuses on leveraging the art and craft of building ML-driven algorithms that are crucial in the finance industry. Implementing machine learning models in finance is easier than commonly believed. There is also a misconception that big data is needed for building machine learning models. The case studies in this book span almost all areas of machine learning and aim to handle such misconceptions. This book not only will cover the theory and case studies related to using ML in trading strategies but also will delve deep into other critical “need-to-know” concepts such as portfolio management, derivative pricing, fraud detection, corporate credit ratings, robo-advisor development, and chatbot development. It will address real-life problems faced by practitioners and provide sci‐ entifically sound solutions supported by code and examples. Preface | ix The Python codebase for this book on GitHub will be useful and serve as a starting point for industry practitioners working on their projects. The examples and case studies shown in the book demonstrate techniques that can easily be applied to a wide range of datasets. The futuristic case studies, such as reinforcement learning for trading, building a robo-advisor, and using machine learning for instrument pricing, inspire readers to think outside the box and motivate them to make the best of the models and data available. Who This Book Is For The format of the book and the list of topics covered make it suitable for professio‐ nals working in hedge funds, investment and retail banks, and fintech firms. They may have titles such as data scientist, data engineer, quantitative researcher, machine learning architect, or software engineer. Additionally, the book will be useful for those professionals working in support functions, such as compliance and risk. Whether a quantitative trader in a hedge fund is looking for ideas in using reinforce‐ ment learning for trading cryptocurrency or an investment bank quant is looking for machine learning–based techniques to improve the calibration speed of pricing mod‐ els, this book will add value. The theory, concepts, and codebase mentioned in the book will be extremely useful at every step of the model development lifecycle, from idea generation to model implementation. Readers can use the shared codebase and test the proposed solutions themselves, allowing for a hands-on reader experience. The readers should have a basic knowledge of statistics, machine learning, and Python. How This Book Is Organized This book provides a comprehensive introduction to how machine learning and data science can be used to design models across different areas in finance. It is organized into four parts. Part I: The Framework The first part provides an overview of machine learning in finance and the building blocks of machine learning implementation. These chapters serve as the foundation for the case studies covering different machine learning types presented in the rest of the book. x | Preface https://github.com/tatsath/fin-ml The chapters under the first part are as follows: Chapter 1, Machine Learning in Finance: The Landscape This chapter provides an overview of applications of machine learning in finance and provides a brief overview of several types of machine learning. Chapter 2, Developing a Machine Learning Model in Python This chapter looks at the Python-based ecosystem for machine learning. It also cover the steps for machine learning model development in the Python frame‐ work. Chapter 3, Artificial Neural Networks Given that an artificial neural network (ANN)is a primary algorithm used across all types of machine learning, this chapter looks at the details of ANNs, followed by a detailed implementation of an ANN model using Python libraries. Part II: Supervised Learning The second part covers fundamental supervised learning algorithms and illustrates specific applications and case studies. The chapters under the second part are as follows: Chapter 4, Supervised Learning: Models and Concepts This chapter provides an introduction to supervised learning techniques (both classification and regression). Given that a lot of models are common between classification and regression, the details of those models are presented together along with other concepts such as model selection and evaluation metrics for classification and regression. Chapter 5, Supervised Learning: Regression (Including Time Series Models) Supervised learning-based regression models are the most commonly used machine learning models in finance. This chapter covers the models from basic linear regression to advance deep learning. The case studies covered in this sec‐ tion include models for stock price prediction, derivatives pricing, and portfolio management. Chapter 6, Supervised Learning: Classification Classification is a subcategory of supervised learning in which the goal is to pre‐ dict the categorical class labels of new instances, based on past observations. This section discusses several case studies based on classification–based techniques, such as logistic regression, support vector machines, and random forests. Preface | xi Part III: Unsupervised Learning The third part covers the fundamental unsupervised learning algorithms and offers applications and case studies. The chapters under the third part are as follows: Chapter 7, Unsupervised Learning: Dimensionality Reduction This chapter describes the essential techniques to reduce the number of features in a dataset while retaining most of their useful and discriminatory information. It also discusses the standard approach to dimensionality reduction via principal component analysis and covers case studies in portfolio management, trading strategy, and yield curve construction. Chapter 8, Unsupervised Learning: Clustering This chapter covers the algorithms and techniques related to clustering and iden‐ tifying groups of objects that share a degree of similarity. The case studies utiliz‐ ing clustering in trading strategies and portfolio management are covered in this chapter. Part IV: Reinforcement Learning and Natural Language Processing The fourth part covers the reinforcement learning and natural language processing (NLP) techniques. The chapters under the fourth part are as follows: Chapter 9, Reinforcement Learning This chapter covers concepts and case studies on reinforcement learning, which have great potential for application in the finance industry. Reinforcement learn‐ ing’s main idea of “maximizing the rewards” aligns perfectly with the core moti‐ vation of several areas within finance. Case studies related to trading strategies, portfolio optimization, and derivatives hedging are covered in this chapter. Chapter 10, Natural Language Processing This chapter describes the techniques in natural language processing and dis‐ cusses the essential steps to transform textual data into meaningful representa‐ tions across several areas in finance. Case studies related to sentiment analysis, chatbots, and document interpretation are covered. xii | Preface Conventions Used in This Book The following typographical conventions are used in this book: Italic Indicates new terms, URLs, email addresses, filenames, and file extensions. Constant width Used for program listings, as well as within paragraphs to refer to program ele‐ ments such as variable or function names, databases, data types, environment variables, statements, and keywords. This element signifies a tip or suggestion. This element signifies a general note. This element indicates a warning or caution. This element indicates a blueprint. Using Code Presented in the Book All code in this book (case studies and master template) is available at the GitHub directory: https://github.com/tatsath/fin-ml. The code is hosted on a cloud platform, so every case study can be run without installing a package on a local machine by clicking https://mybinder.org/v2/gh/tatsath/fin-ml/master. This book is here to help you get your job done. In general, if example code is offered, you may use it in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For exam‐ ple, writing a program that uses several chunks of code from this book does not Preface | xiii https://github.com/tatsath/fin-ml https://mybinder.org/v2/gh/tatsath/fin-ml/master require permission. Selling or distributing examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission. We appreciate, but generally do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: Machine Learning and Data Science Blueprints for Finance by Hariom Tatsat, Sahil Puri, and Brad Looka‐ baugh (O’Reilly, 2021), 978-1-492-07305-5. If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at permissions@oreilly.com. Python Libraries The book uses Python 3.7. Installing the Conda package manager is recommended in order to create a Conda environment to install the requisite libraries. Installation instructions are available on the GitHub repo’s README file. O’Reilly Online Learning For more than 40 years, O’Reilly Media has provided technol‐ ogy and business training, knowledge, and insight to help companies succeed. Our unique network of experts and innovators share their knowledge and expertise through books, articles, and our online learning platform. O’Reilly’s online learning platform gives you on-demand access to live training courses, in-depth learning paths, interactive coding environments, and a vast collection of text and video from O’Reilly and 200+ other publishers. For more information, visit http://oreilly.com. How to Contact Us Please address comments and questions concerning this book to the publisher: O’Reilly Media, Inc. 1005 Gravenstein Highway North Sebastopol, CA 95472 800-998-9938 (in the United States or Canada) 707-829-0515 (international or local) 707-829-0104 (fax) xiv | Preface mailto:permissions@oreilly.com https://github.com/tatsath/fin-ml http://oreilly.com http://oreilly.com We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at https://oreil.ly/ML-and-data-science- blueprints. Email bookquestions@oreilly.com to comment or ask technical questions about this book. For news and information about our books and courses, visit http://oreilly.com. Find us on Facebook: http://facebook.com/oreilly Follow us on Twitter: http://twitter.com/oreillymedia Watch us on YouTube: http://youtube.com/oreillymedia Acknowledgments We want to thank all those who helped to make this book a reality. Special thanks to Jeff Bleiel for honest, insightful feedback, and for guiding us through the entire pro‐ cess. We are incredibly grateful to Juan Manuel Contreras, Chakri Cherukuri, and Gregory Bronner, who took time out of their busy lives to review our book in so much detail. The book benefited from their valuable feedback and suggestions. Many thanks as well to O’Reilly’s fantastic staff, in particular Michelle Smith for believing in this project and helping us define its scope. Special Thanks from Hariom I would like to thank my wife, Prachi, and my parents for their love and support. Spe‐ cial thanks to my father for encouragingme in all of my pursuits and being a contin‐ uous source of inspiration. Special Thanks from Sahil Thanks to my family, who always encouraged and supported me in all endeavors. Special Thanks from Brad Thank you to my wife, Megan, for her endless love and support. Preface | xv https://oreil.ly/ML-and-data-science-blueprints https://oreil.ly/ML-and-data-science-blueprints mailto:bookquestions@oreilly.com http://oreilly.com http://facebook.com/oreilly http://twitter.com/oreillymedia http://youtube.com/oreillymedia PART I The Framework CHAPTER 1 Machine Learning in Finance: The Landscape Machine learning promises to shake up large swathes of finance —The Economist (2017) There is a new wave of machine learning and data science in finance, and the related applications will transform the industry over the next few decades. Currently, most financial firms, including hedge funds, investment and retail banks, and fintech firms, are adopting and investing heavily in machine learning. Going for‐ ward, financial institutions will need a growing number of machine learning and data science experts. Machine learning in finance has become more prominent recently due to the availa‐ bility of vast amounts of data and more affordable computing power. The use of data science and machine learning is exploding exponentially across all areas of finance. The success of machine learning in finance depends upon building efficient infra‐ structure, using the correct toolkit, and applying the right algorithms. The concepts related to these building blocks of machine learning in finance are demonstrated and utilized throughout this book. In this chapter, we provide an introduction to the current and future application of machine learning in finance, including a brief overview of different types of machine learning. This chapter and the two that follow serve as the foundation for the case studies presented in the rest of the book. 1 Current and Future Machine Learning Applications in Finance Let’s take a look at some promising machine learning applications in finance. The case studies presented in this book cover all the applications mentioned here. Algorithmic Trading Algorithmic trading (or simply algo trading) is the use of algorithms to conduct trades autonomously. With origins going back to the 1970s, algorithmic trading (sometimes called Automated Trading Systems, which is arguably a more accurate description) involves the use of automated preprogrammed trading instructions to make extremely fast, objective trading decisions. Machine learning stands to push algorithmic trading to new levels. Not only can more advanced strategies be employed and adapted in real time, but machine learn‐ ing–based techniques can offer even more avenues for gaining special insight into market movements. Most hedge funds and financial institutions do not openly dis‐ close their machine learning–based approaches to trading (for good reason), but machine learning is playing an increasingly important role in calibrating trading decisions in real time. Portfolio Management and Robo-Advisors Asset and wealth management firms are exploring potential artificial intelligence (AI) solutions for improving their investment decisions and making use of their troves of historical data. One example of this is the use of robo-advisors, algorithms built to calibrate a finan‐ cial portfolio to the goals and risk tolerance of the user. Additionally, they provide automated financial guidance and service to end investors and clients. A user enters their financial goals (e.g., to retire at age 65 with $250,000 in savings), age, income, and current financial assets. The advisor (the allocator) then spreads investments across asset classes and financial instruments in order to reach the user’s goals. The system then calibrates to changes in the user’s goals and real-time changes in the market, aiming always to find the best fit for the user’s original goals. Robo-advisors have gained significant traction among consumers who do not need a human advisor to feel comfortable investing. 2 | Chapter 1: Machine Learning in Finance: The Landscape Fraud Detection Fraud is a massive problem for financial institutions and one of the foremost reasons to leverage machine learning in finance. There is currently a significant data security risk due to high computing power, fre‐ quent internet use, and an increasing amount of company data being stored online. While previous financial fraud detection systems depended heavily on complex and robust sets of rules, modern fraud detection goes beyond following a checklist of risk factors—it actively learns and calibrates to new potential (or real) security threats. Machine learning is ideally suited to combating fraudulent financial transactions. This is because machine learning systems can scan through vast datasets, detect unusual activities, and flag them instantly. Given the incalculably high number of ways that security can be breached, genuine machine learning systems will be an absolute necessity in the days to come. Loans/Credit Card/Insurance Underwriting Underwriting could be described as a perfect job for machine learning in finance, and indeed there is a great deal of worry in the industry that machines will replace a large swath of underwriting positions that exist today. Especially at large companies (big banks and publicly traded insurance firms), machine learning algorithms can be trained on millions of examples of consumer data and financial lending or insurance outcomes, such as whether a person defaulted on their loan or mortgage. Underlying financial trends can be assessed with algorithms and continuously ana‐ lyzed to detect trends that might influence lending and underwriting risk in the future. Algorithms can perform automated tasks such as matching data records, identifying exceptions, and calculating whether an applicant qualifies for a credit or insurance product. Automation and Chatbots Automation is patently well suited to finance. It reduces the strain that repetitive, low-value tasks put on human employees. It tackles the routine, everyday processes, freeing up teams to finish their high-value work. In doing so, it drives enormous time and cost savings. Adding machine learning and AI into the automation mix adds another level of sup‐ port for employees. With access to relevant data, machine learning and AI can pro‐ vide an in-depth data analysis to support finance teams with difficult decisions. In some cases, it may even be able to recommend the best course of action for employ‐ ees to approve and enact. Current and Future Machine Learning Applications in Finance | 3 AI and automation in the financial sector can also learn to recognize errors, reducing the time wasted between discovery and resolution. This means that human team members are less likely to be delayed in providing their reports and are able to com‐ plete their work with fewer errors. AI chatbots can be implemented to support finance and banking customers. With the rise in popularity of live chat software in banking and finance businesses, chatbots are the natural evolution. Risk Management Machine learning techniques are transforming how we approach risk management. All aspects of understanding and controlling risk are being revolutionized through the growth of solutions driven by machine learning. Examples range from deciding how much a bank should lend a customer to improving compliance and reducing model risk. Asset Price Prediction Asset price prediction is considered the most frequently discussed and most sophisti‐ cated area in finance. Predicting asset prices allows one to understand the factors that drive the market and speculate asset performance. Traditionally, asset price predic‐ tion was performed by analyzing past financial reports and market performance to determine what position to take for a specific security or asset class. However, with a tremendous increase in the amount offinancial data, the traditional approaches for analysis and stock-selection strategies are being supplemented with ML-based techniques. Derivative Pricing Recent machine learning successes, as well as the fast pace of innovation, indicate that ML applications for derivatives pricing should become widely used in the com‐ ing years. The world of Black-Scholes models, volatility smiles, and Excel spreadsheet models should wane as more advanced methods become readily available. The classic derivative pricing models are built on several impractical assumptions to reproduce the empirical relationship between the underlying input data (strike price, time to maturity, option type) and the price of the derivatives observed in the market. Machine learning methods do not rely on several assumptions; they just try to esti‐ mate a function between the input data and price, minimizing the difference between the results of the model and the target. The faster deployment times achieved with state-of-the-art ML tools are just one of the advantages that will accelerate the use of machine learning in derivatives pricing. 4 | Chapter 1: Machine Learning in Finance: The Landscape Sentiment Analysis Sentiment analysis involves the perusal of enormous volumes of unstructured data, such as videos, transcriptions, photos, audio files, social media posts, articles, and business documents, to determine market sentiment. Sentiment analysis is crucial for all businesses in today’s workplace and is an excellent example of machine learning in finance. The most common use of sentiment analysis in the financial sector is the analysis of financial news—in particular, predicting the behaviors and possible trends of mar‐ kets. The stock market moves in response to myriad human-related factors, and the hope is that machine learning will be able to replicate and enhance human intuition about financial activity by discovering new trends and telling signals. However, much of the future applications of machine learning will be in understand‐ ing social media, news trends, and other data sources related to predicting the senti‐ ments of customers toward market developments. It will not be limited to predicting stock prices and trades. Trade Settlement Trade settlement is the process of transferring securities into the account of a buyer and cash into the seller’s account following a transaction of a financial asset. Despite the majority of trades being settled automatically, and with little or no inter‐ action by human beings, about 30% of trades need to be settled manually. The use of machine learning not only can identify the reason for failed trades, but it also can analyze why the trades were rejected, provide a solution, and predict which trades may fail in the future. What usually would take a human being five to ten minutes to fix, machine learning can do in a fraction of a second. Money Laundering A United Nations report estimates that the amount of money laundered worldwide per year is 2%–5% of global GDP. Machine learning techniques can analyze internal, publicly existing, and transactional data from a client’s broader network in an attempt to spot money laundering signs. Machine Learning, Deep Learning, Artificial Intelligence, and Data Science For the majority of people, the terms machine learning, deep learning, artificial intelli‐ gence, and data science are confusing. In fact, a lot of people use one term inter‐ changeably with the others. Machine Learning, Deep Learning, Artificial Intelligence, and Data Science | 5 Figure 1-1 shows the relationships between AI, machine learning, deep learning and data science. Machine learning is a subset of AI that consists of techniques that enable computers to identify patterns in data and to deliver AI applications. Deep learning, meanwhile, is a subset of machine learning that enables computers to solve more complex problems. Data science isn’t exactly a subset of machine learning, but it uses machine learning, deep learning, and AI to analyze data and reach actionable conclusions. It combines machine learning, deep learning and AI with other disciplines such as big data ana‐ lytics and cloud computing. Figure 1-1. AI, machine learning, deep learning, and data science The following is a summary of the details about artificial intelligence, machine learn‐ ing, deep learning, and data science: Artificial intelligence Artificial intelligence is the field of study by which a computer (and its systems) develop the ability to successfully accomplish complex tasks that usually require human intelligence. These tasks include, but are not limited to, visual perception, speech recognition, decision making, and translation between languages. AI is usually defined as the science of making computers do things that require intelli‐ gence when done by humans. Machine learning Machine learning is an application of artificial intelligence that provides the AI system with the ability to automatically learn from the environment and apply those lessons to make better decisions. There are a variety of algorithms that 6 | Chapter 1: Machine Learning in Finance: The Landscape machine learning uses to iteratively learn, describe and improve data, spot pat‐ terns, and then perform actions on these patterns. Deep learning Deep learning is a subset of machine learning that involves the study of algo‐ rithms related to artificial neural networks that contain many blocks (or layers) stacked on each other. The design of deep learning models is inspired by the bio‐ logical neural network of the human brain. It strives to analyze data with a logical structure similar to how a human draws conclusions. Data science Data science is an interdisciplinary field similar to data mining that uses scien‐ tific methods, processes, and systems to extract knowledge or insights from data in various forms, either structured or unstructured. Data science is different from ML and AI because its goal is to gain insight into and understanding of the data by using different scientific tools and techniques. However, there are several tools and techniques common to both ML and data science, some of which are demonstrated in this book. Machine Learning Types This section will outline all types of machine learning that are used in different case studies presented in this book for various financial applications. The three types of machine learning, as shown in Figure 1-2, are supervised learning, unsupervised learning, and reinforcement learning. Figure 1-2. Machine learning types Supervised The main goal in supervised learning is to train a model from labeled data that allows us to make predictions about unseen or future data. Here, the term supervised refers to a set of samples where the desired output signals (labels) are already known. There are two types of supervised learning algorithms: classification and regression. Machine Learning Types | 7 Classification Classification is a subcategory of supervised learning in which the goal is to predict the categorical class labels of new instances based on past observations. Regression Regression is another subcategory of supervised learning used in the prediction of continuous outcomes. In regression, we are given a number of predictor (explana‐ tory) variables and a continuous response variable (outcome or target), and we try to find a relationship between those variables that allows us to predict an outcome. An example of regression versus classification is shown in Figure 1-3. The chart on the left shows an example of regression. The continuous response variable is return, and the observed values are plotted against the predicted outcomes. On the right, the outcome is a categorical class label, whether the market is bull or bear, and is an example of classification. Figure 1-3. Regression versus classification Unsupervised Unsupervised learning is a type of machine learning used to draw inferences from datasets consisting of input data without labeledresponses. There are two types of unsupervised learning: dimensionality reduction and clustering. Dimensionality reduction Dimensionality reduction is the process of reducing the number of features, or vari‐ ables, in a dataset while preserving information and overall model performance. It is a common and powerful way to deal with datasets that have a large number of dimensions. 8 | Chapter 1: Machine Learning in Finance: The Landscape Figure 1-4 illustrates this concept, where the dimension of data is converted from two dimensions (X1 and X2) to one dimension (Z1). Z1 conveys similar information embedded in X1 and X2 and reduces the dimension of the data. Figure 1-4. Dimensionality reduction Clustering Clustering is a subcategory of unsupervised learning techniques that allows us to dis‐ cover hidden structures in data. The goal of clustering is to find a natural grouping in data so that items in the same cluster are more similar to each other than to those from different clusters. An example of clustering is shown in Figure 1-5, where we can see the entire data clustered into two distinct groups by the clustering algorithm. Figure 1-5. Clustering Reinforcement Learning Learning from experiences, and the associated rewards or punishments, is the core concept behind reinforcement learning (RL). It is about taking suitable actions to maximize reward in a particular situation. The learning system, called an agent, can observe the environment, select and perform actions, and receive rewards (or penal‐ ties in the form of negative rewards) in return, as shown in Figure 1-6. Reinforcement learning differs from supervised learning in this way: In supervised learning, the training data has the answer key, so the model is trained with the correct answers available. In reinforcement learning, there is no explicit answer. The learning Machine Learning Types | 9 system (agent) decides what to do to perform the given task and learns whether that was a correct action based on the reward. The algorithm determines the answer key through its experience. Figure 1-6. Reinforcement learning The steps of the reinforcement learning are as follows: 1. First, the agent interacts with the environment by performing an action. 2. Then the agent receives a reward based on the action it performed. 3. Based on the reward, the agent receives an observation and understands whether the action was good or bad. If the action was good—that is, if the agent received a positive reward—then the agent will prefer performing that action. If the reward was less favorable, the agent will try performing another action to receive a posi‐ tive reward. It is basically a trial-and-error learning process. Natural Language Processing Natural language processing (NLP) is a branch of AI that deals with the problems of making a machine understand the structure and the meaning of natural language as used by humans. Several techniques of machine learning and deep learning are used within NLP. NLP has many applications in the finance sectors in areas such as sentiment analysis, chatbots, and document processing. A lot of information, such as sell side reports, earnings calls, and newspaper headlines, is communicated via text message, making NLP quite useful in the financial domain. Given the extensive application of NLP algorithms based on machine learning in finance, there is a separate chapter of this book (Chapter 10) dedicated to NLP and related case studies. 10 | Chapter 1: Machine Learning in Finance: The Landscape Chapter Summary Machine learning is making significant inroads across all the verticals of the financial services industry. This chapter covered different applications of machine learning in finance, from algorithmic trading to robo-advisors. These applications will be cov‐ ered in the case studies later in this book. Next Steps In terms of the platforms used for machine learning, the Python ecosystem is grow‐ ing and is one of the most dominant programming languages for machine learning. In the next chapter, we will learn about the model development steps, from data preparation to model deployment in a Python-based framework. Chapter Summary | 11 CHAPTER 2 Developing a Machine Learning Model in Python In terms of the platforms used for machine learning, there are many algorithms and programming languages. However, the Python ecosystem is one of the most domi‐ nant and fastest-growing programming languages for machine learning. Given the popularity and high adoption rate of Python, we will use it as the main programming language throughout the book. This chapter provides an overview of a Python-based machine learning framework. First, we will review the details of Python-based packages used for machine learning, followed by the model develop‐ ment steps in the Python framework. The steps of model development in Python presented in this chapter serve as the foundation for the case studies presented in the rest of the book. The Python frame‐ work can also be leveraged while developing any machine learning–based model in finance. Why Python? Some reasons for Python’s popularity are as follows: • High-level syntax (compared to lower-level languages of C, Java, and C++). Applications can be developed by writing fewer lines of code, making Python attractive to beginners and advanced programmers alike. • Efficient development lifecycle. • Large collection of community-managed, open-source libraries. • Strong portability. 13 The simplicity of Python has attracted many developers to create new libraries for machine learning, leading to strong adoption of Python. Python Packages for Machine Learning The main Python packages used for machine learning are highlighted in Figure 2-1. Figure 2-1. Python packages Here is a brief summary of each of these packages: NumPy Provides support for large, multidimensional arrays as well as an extensive col‐ lection of mathematical functions. Pandas A library for data manipulation and analysis. Among other features, it offers data structures to handle tables and the tools to manipulate them. Matplotlib A plotting library that allows the creation of 2D charts and plots. SciPy The combination of NumPy, Pandas, and Matplotlib is generally referred to as SciPy. SciPy is an ecosystem of Python libraries for mathematics, science, and engineering. Scikit-learn (or sklearn) A machine learning library offering a wide range of algorithms and utilities. 14 | Chapter 2: Developing a Machine Learning Model in Python https://numpy.org https://pandas.pydata.org https://matplotlib.org https://www.scipy.org https://scikit-learn.org StatsModels A Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests and statistical data exploration. TensorFlow and Theano Dataflow programming libraries that facilitate working with neural networks. Keras An artificial neural network library that can act as a simplified interface to TensorFlow/Theano packages. Seaborn A data visualization library based on Matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics. pip and Conda These are Python package managers. pip is a package manager that facilitates installation, upgrade, and uninstallation of Python packages. Conda is a package manager that handles Python packages as well as library dependencies outside of the Python packages. Python and Package Installation There are different ways of installing Python. However, it is strongly recommended that you install Python through Anaconda. Anaconda contains Python, SciPy, and Scikit-learn. After installing Anaconda, a Jupyter server can be started locally by opening the machine’s terminal and typing in the following code: $jupyter notebook All code samples in this book use Python 3 and are presented in Jupyter notebooks. Several Python packages, especially Scikit-learn and Keras, are extensivelyused in the case studies. Steps for Model Development in Python Ecosystem Working through machine learning problems from end to end is critically important. Applied machine learning will not come alive unless the steps from beginning to end are well defined. Figure 2-2 provides an outline of the simple seven-step machine learning project template that can be used to jump-start any machine learning model in Python. The Steps for Model Development in Python Ecosystem | 15 https://www.statsmodels.org https://www.tensorflow.org http://deeplearning.net/software/theano https://keras.io https://seaborn.pydata.org https://pypi.org/project/pip https://docs.conda.io/en/latest https://www.anaconda.com first few steps include exploratory data analysis and data preparation, which are typi‐ cal data science–based steps aimed at extracting meaning and insights from data. These steps are followed by model evaluation, fine-tuning, and finalizing the model. Figure 2-2. Model development steps All the case studies in this book follow the standard seven-step model development process. However, there are a few case studies in which some of the steps are skipped, renamed, or reordered based on the appropriateness and intuitiveness of the steps. Model Development Blueprint The following section covers the details of each model development step with sup‐ porting Python code. 1. Problem definition The first step in any project is defining the problem. Powerful algorithms can be used for solving the problem, but the results will be meaningless if the wrong problem is solved. The following framework should be used for defining the problem: 1. Describe the problem informally and formally. List assumptions and similar problems. 2. List the motivation for solving the problem, the benefits a solution provides, and how the solution will be used. 3. Describe how the problem would be solved using the domain knowledge. 16 | Chapter 2: Developing a Machine Learning Model in Python 2. Loading the data and packages The second step gives you everything needed to start working on the problem. This includes loading libraries, packages, and individual functions needed for the model development. 2.1. Load libraries. A sample code for loading libraries is as follows: # Load libraries import pandas as pd from matplotlib import pyplot The details of the libraries and modules for specific functionalities are defined further in the individual case studies. 2.2. Load data. The following items should be checked and removed before loading the data: • Column headers • Comments or special characters • Delimiter There are many ways of loading data. Some of the most common ways are as follows: Load CSV files with Pandas from pandas import read_csv filename = 'xyz.csv' data = read_csv(filename, names=names) Load file from URL from pandas import read_csv url = 'https://goo.gl/vhm1eU' names = ['age', 'class'] data = read_csv(url, names=names) Load file using pandas_datareader import pandas_datareader.data as web ccy_tickers = ['DEXJPUS', 'DEXUSUK'] idx_tickers = ['SP500', 'DJIA', 'VIXCLS'] stk_data = web.DataReader(stk_tickers, 'yahoo') ccy_data = web.DataReader(ccy_tickers, 'fred') idx_data = web.DataReader(idx_tickers, 'fred') Steps for Model Development in Python Ecosystem | 17 3. Exploratory data analysis In this step, we look at the dataset. 3.1. Descriptive statistics. Understanding the dataset is one of the most important steps of model development. The steps to understanding data include: 1. Viewing the raw data. 2. Reviewing the dimensions of the dataset. 3. Reviewing the data types of attributes. 4. Summarizing the distribution, descriptive statistics, and relationship among the variables in the dataset. These steps are demonstrated below using sample Python code: Viewing the data set_option('display.width', 100) dataset.head(1) Output Age Sex Job Housing SavingAccounts CheckingAccount CreditAmount Duration Purpose Risk 0 67 male 2 own NaN little 1169 6 radio/TV good Reviewing the dimensions of the dataset dataset.shape Output (284807, 31) The results show the dimension of the dataset and mean that the dataset has 284,807 rows and 31 columns. Reviewing the data types of the attributes in the data # types set_option('display.max_rows', 500) dataset.dtypes Summarizing the data using descriptive statistics # describe data set_option('precision', 3) dataset.describe() 18 | Chapter 2: Developing a Machine Learning Model in Python Output Age Job CreditAmount Duration count 1000.000 1000.000 1000.000 1000.000 mean 35.546 1.904 3271.258 20.903 std 11.375 0.654 2822.737 12.059 min 19.000 0.000 250.000 4.000 25% 27.000 2.000 1365.500 12.000 50% 33.000 2.000 2319.500 18.000 75% 42.000 2.000 3972.250 24.000 max 75.000 3.000 18424.000 72.000 3.2. Data visualization. The fastest way to learn more about the data is to visualize it. Visualization involves independently understanding each attribute of the dataset. Some of the plot types are as follows: Univariate plots Histograms and density plots Multivariate plots Correlation matrix plot and scatterplot The Python code for univariate plot types is illustrated with examples below: Univariate plot: histogram from matplotlib import pyplot dataset.hist(sharex=False, sharey=False, xlabelsize=1, ylabelsize=1,\ figsize=(10,4)) pyplot.show() Univariate plot: density plot from matplotlib import pyplot dataset.plot(kind='density', subplots=True, layout=(3,3), sharex=False,\ legend=True, fontsize=1, figsize=(10,4)) pyplot.show() Steps for Model Development in Python Ecosystem | 19 Figure 2-3 illustrates the output. Figure 2-3. Histogram (top) and density plot (bottom) The Python code for multivariate plot types is illustrated with examples below: Multivariate plot: correlation matrix plot from matplotlib import pyplot import seaborn as sns correlation = dataset.corr() pyplot.figure(figsize=(5,5)) pyplot.title('Correlation Matrix') sns.heatmap(correlation, vmax=1, square=True,annot=True,cmap='cubehelix') Multivariate plot: scatterplot matrix from pandas.plotting import scatter_matrix scatter_matrix(dataset) Figure 2-4 illustrates the output. 20 | Chapter 2: Developing a Machine Learning Model in Python Figure 2-4. Correlation (left) and scatterplot (right) 4. Data preparation Data preparation is a preprocessing step in which data from one or more sources is cleaned and transformed to improve its quality prior to its use. 4.1. Data cleaning. In machine learning modeling, incorrect data can be costly. Data cleaning involves checking the following: Validity The data type, range, etc. Accuracy The degree to which the data is close to the true values. Completeness The degree to which all required data is known. Uniformity The degree to which the data is specified using the same unit of measure. The different options for performing data cleaning include: Dropping “NA” values within data dataset.dropna(axis=0) Filling “NA” with 0 dataset.fillna(0) Filling NAs with the mean of the column dataset['col'] = dataset['col'].fillna(dataset['col'].mean()) Steps for Model Development in Python Ecosystem | 21 1 Feature selection is more relevant for supervised learning models and is described in detail in the individual case studies in Chapters 5 and 6. 2 Overfitting is covered in detail in Chapter 4. 4.2. Feature selection. The data features used to train the machine learning models have a huge influence on the performance. Irrelevant or partially relevant features can negatively impact model performance. Feature selection1 is a process in which features in data that contribute most to the prediction variable or output are auto‐ matically selected. The benefits of performing feature selection before modeling the data are: Reduces overfitting2 Less redundant data means fewer opportunities for the model to make decisions based on noise. Improves performance Less misleading data means improved modeling performance.Reduces training time and memory footprint Less data means faster training and lower memory footprint. The following sample feature is an example demonstrating when the best two fea‐ tures are selected using the SelectKBest function under sklearn. The SelectKBest function scores the features using an underlying function and then removes all but the k highest scoring feature: from sklearn.feature_selection import SelectKBest from sklearn.feature_selection import chi2 bestfeatures = SelectKBest( k=5) fit = bestfeatures.fit(X,Y) dfscores = pd.DataFrame(fit.scores_) dfcolumns = pd.DataFrame(X.columns) featureScores = pd.concat([dfcolumns,dfscores],axis=1) print(featureScores.nlargest(2,'Score')) #print 2 best features Output Specs Score 2 Variable1 58262.490 3 Variable2 321.031 When features are irrelevant, they should be dropped. Dropping the irrelevant fea‐ tures is illustrated in the following sample code: #dropping the old features dataset.drop(['Feature1','Feature2','Feature3'],axis=1,inplace=True) 22 | Chapter 2: Developing a Machine Learning Model in Python https://oreil.ly/JDo-F 4.3. Data transformation. Many machine learning algorithms make assumptions about the data. It is a good practice to perform the data preparation in such a way that exposes the data in the best possible manner to the machine learning algorithms. This can be accomplished through data transformation. The different data transformation approaches are as follows: Rescaling When data comprises attributes with varying scales, many machine learning algorithms can benefit from rescaling all the attributes to the same scale. Attributes are often rescaled in the range between zero and one. This is useful for optimization algorithms used in the core of machine learning algorithms, and it also helps to speed up the calculations in an algorithm: from sklearn.preprocessing import MinMaxScaler scaler = MinMaxScaler(feature_range=(0, 1)) rescaledX = pd.DataFrame(scaler.fit_transform(X)) Standardization Standardization is a useful technique to transform attributes to a standard nor‐ mal distribution with a mean of zero and a standard deviation of one. It is most suitable for techniques that assume the input variables represent a normal distri‐ bution: from sklearn.preprocessing import StandardScaler scaler = StandardScaler().fit(X) StandardisedX = pd.DataFrame(scaler.fit_transform(X)) Normalization Normalization refers to rescaling each observation (row) to have a length of one (called a unit norm or a vector). This preprocessing method can be useful for sparse datasets of attributes of varying scales when using algorithms that weight input values: from sklearn.preprocessing import Normalizer scaler = Normalizer().fit(X) NormalizedX = pd.DataFrame(scaler.fit_transform(X)) 5. Evaluate models Once we estimate the performance of our algorithm, we can retrain the final algo‐ rithm on the entire training dataset and get it ready for operational use. The best way to do this is to evaluate the performance of the algorithm on a new dataset. Different machine learning techniques require different evaluation metrics. Other than model performance, several other factors such as simplicity, interpretability, and training time are considered when selecting a model. The details regarding these factors are covered in Chapter 4. Steps for Model Development in Python Ecosystem | 23 https://oreil.ly/4a70f https://oreil.ly/4a70f 5.1. Training and test split. The simplest method we can use to evaluate the perfor‐ mance of a machine learning algorithm is to use different training and testing data‐ sets. We can take our original dataset and split it into two parts: train the algorithm on the first part, make predictions on the second part, and evaluate the predictions against the expected results. The size of the split can depend on the size and specifics of the dataset, although it is common to use 80% of the data for training and the remaining 20% for testing. The differences in the training and test datasets can result in meaningful differences in the estimate of accuracy. The data can easily be split into the training and test sets using the train_test_split function available in sklearn: # split out validation dataset for the end validation_size = 0.2 seed = 7 X_train, X_validation, Y_train, Y_validation =\ train_test_split(X, Y, test_size=validation_size, random_state=seed) 5.2. Identify evaluation metrics. Choosing which metric to use to evaluate machine learning algorithms is very important. An important aspect of evaluation metrics is the capability to discriminate among model results. Different types of evaluation metrics used for different kinds of ML models are covered in detail across several chapters of this book. 5.3. Compare models and algorithms. Selecting a machine learning model or algorithm is both an art and a science. There is no one solution or approach that fits all. There are several factors over and above the model performance that can impact the deci‐ sion to choose a machine learning algorithm. Let’s understand the process of model comparison with a simple example. We define two variables, X and Y, and try to build a model to predict Y using X. As a first step, the data is divided into training and test split as mentioned in the preceding section: import numpy as np import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split validation_size = 0.2 seed = 7 X = 2 - 3 * np.random.normal(0, 1, 20) Y = X - 2 * (X ** 2) + 0.5 * (X ** 3) + np.exp(-X)+np.random.normal(-3, 3, 20) # transforming the data to include another axis X = X[:, np.newaxis] Y = Y[:, np.newaxis] X_train, X_test, Y_train, Y_test = train_test_split(X, Y,\ test_size=validation_size, random_state=seed) We have no idea which algorithms will do well on this problem. Let’s design our test now. We will use two models—one linear regression and the second polynomial regression to fit Y against X. We will evaluate algorithms using the Root Mean 24 | Chapter 2: Developing a Machine Learning Model in Python 3 It should be noted that the difference in RMSE is small in this case and may not replicate with a different split of the train/test data. 4 Hyperparameters are the external characteristics of the model, can be considered the model’s settings, and are not estimated based on data-like model parameters. Squared Error (RMSE) metric, which is one of the measures of the model perfor‐ mance. RMSE will give a gross idea of how wrong all predictions are (zero is perfect): from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error, r2_score from sklearn.preprocessing import PolynomialFeatures model = LinearRegression() model.fit(X_train, Y_train) Y_pred = model.predict(X_train) rmse_lin = np.sqrt(mean_squared_error(Y_train,Y_pred)) r2_lin = r2_score(Y_train,Y_pred) print("RMSE for Linear Regression:", rmse_lin) polynomial_features= PolynomialFeatures(degree=2) x_poly = polynomial_features.fit_transform(X_train) model = LinearRegression() model.fit(x_poly, Y_train) Y_poly_pred = model.predict(x_poly) rmse = np.sqrt(mean_squared_error(Y_train,Y_poly_pred)) r2 = r2_score(Y_train,Y_poly_pred) print("RMSE for Polynomial Regression:", rmse) Output RMSE for Linear Regression: 6.772942423315028 RMSE for Polynomial Regression: 6.420495127266883 We can see that the RMSE of the polynomial regression is slightly better than that of the linear regression.3 With the former having the better fit, it is the preferred model in this step. 6. Model tuning Finding the best combination of hyperparameters of a model can be treated as a search problem.4 This searching exercise is often known as model tuning and is one of the most important steps of model development. It is achieved by searching for the best parameters of the model by using techniques such as a grid search. In a grid search, you create a grid of all possible hyperparametercombinations and train the model using each one of them. Besides a grid search, there are several other Steps for Model Development in Python Ecosystem | 25 techniques for model tuning, including randomized search, Bayesian optimization, and hyperbrand. In the case studies presented in this book, we focus primarily on grid search for model tuning. Continuing on from the preceding example, with the polynomial as the best model: next, run a grid search for the model, refitting the polynomial regression with differ‐ ent degrees. We compare the RMSE results for all the models: Deg= [1,2,3,6,10] results=[] names=[] for deg in Deg: polynomial_features= PolynomialFeatures(degree=deg) x_poly = polynomial_features.fit_transform(X_train) model = LinearRegression() model.fit(x_poly, Y_train) Y_poly_pred = model.predict(x_poly) rmse = np.sqrt(mean_squared_error(Y_train,Y_poly_pred)) r2 = r2_score(Y_train,Y_poly_pred) results.append(rmse) names.append(deg) plt.plot(names, results,'o') plt.suptitle('Algorithm Comparison') Output The RMSE decreases when the degree increases, and the lowest RMSE is for the model with degree 10. However, models with degrees lower than 10 performed very well, and the test set will be used to finalize the best model. 26 | Chapter 2: Developing a Machine Learning Model in Python https://oreil.ly/ZGVPM While the generic set of input parameters for each algorithm provides a starting point for analysis, it may not have the optimal configurations for the particular dataset and business problem. 7. Finalize the model Here, we perform the final steps for selecting the model. First, we run predictions on the test dataset with the trained model. Then we try to understand the model intu‐ ition and save it for further usage. 7.1. Performance on the test set. The model selected during the training steps is further evaluated on the test set. The test set allows us to compare different models in an unbiased way, by basing the comparisons in data that were not used in any part of the training. The test results for the model developed in the previous step are shown in the following example: Deg= [1,2,3,6,8,10] for deg in Deg: polynomial_features= PolynomialFeatures(degree=deg) x_poly = polynomial_features.fit_transform(X_train) model = LinearRegression() model.fit(x_poly, Y_train) x_poly_test = polynomial_features.fit_transform(X_test) Y_poly_pred_test = model.predict(x_poly_test) rmse = np.sqrt(mean_squared_error(Y_test,Y_poly_pred_test)) r2 = r2_score(Y_test,Y_poly_pred_test) results_test.append(rmse) names_test.append(deg) plt.plot(names_test, results_test,'o') plt.suptitle('Algorithm Comparison') Output Steps for Model Development in Python Ecosystem | 27 In the training set we saw that the RMSE decreases with an increase in the degree of polynomial model, and the polynomial of degree 10 had the lowest RMSE. However, as shown in the preceding output for the polynomial of degree 10, although the train‐ ing set had the best results, the results in the test set are poor. For the polynomial of degree 8, the RMSE in the test set is relatively higher. The polynomial of degree 6 shows the best result in the test set (although the difference is small compared to other lower-degree polynomials in the test set) as well as good results in the training set. For these reasons, this is the preferred model. In addition to the model performance, there are several other factors to consider when selecting a model, such as simplicity, interpretability, and training time. These factors will be covered in the upcoming chapters. 7.2. Model/variable intuition. This step involves considering a holistic view of the approach taken to solve the problem, including the model’s limitations as it relates to the desired outcome, the variables used, and the selected model parameters. Details on model and variable intuition regarding different types of machine learning models are presented in the subsequent chapters and case studies. 7.3. Save/deploy. After finding an accurate machine learning model, it must be saved and loaded in order to ensure its usage later. Pickle is one of the packages for saving and loading a trained model in Python. Using pickle operations, trained machine learning models can be saved in the serialized for‐ mat to a file. Later, this serialized file can be loaded to de-serialize the model for its usage. The following sample code demonstrates how to save the model to a file and load it to make predictions on new data: # Save Model Using Pickle from pickle import dump from pickle import load # save the model to disk filename = 'finalized_model.sav' dump(model, open(filename, 'wb')) # load the model from disk loaded_model = load(filename) In recent years, frameworks such as AutoML have been built to automate the maximum number of steps in a machine learning model development process. Such frameworks allow the model developers to build ML models with high scale, efficiency, and pro‐ ductivity. Readers are encouraged to explore such frameworks. 28 | Chapter 2: Developing a Machine Learning Model in Python https://oreil.ly/ChjFb Chapter Summary Given its popularity, rate of adoption, and flexibility, Python is often the preferred language for machine learning development. There are many available Python pack‐ ages to perform numerous tasks, including data cleaning, visualization, and model development. Some of these key packages are Scikit-learn and Keras. The seven steps of model development mentioned in this chapter can be leveraged while developing any machine learning–based model in finance. Next Steps In the next chapter, we will cover the key algorithm for machine learning—the artifi‐ cial neural network. The artificial neural network is another building block of machine learning in finance and is used across all types of machine learning and deep learning algorithms. Chapter Summary | 29 CHAPTER 3 Artificial Neural Networks There are many different types of models used in machine learning. However, one class of machine learning models that stands out is artificial neural networks (ANNs). Given that artificial neural networks are used across all machine learning types, this chapter will cover the basics of ANNs. ANNs are computing systems based on a collection of connected units or nodes called artificial neurons, which loosely model the neurons in a biological brain. Each connection, like the synapses in a biological brain, can transmit a signal from one artificial neuron to another. An artificial neuron that receives a signal can process it and then signal additional artificial neurons connected to it. Deep learning involves the study of complex ANN-related algorithms. The complex‐ ity is attributed to elaborate patterns of how information flows throughout the model. Deep learning has the ability to represent the world as a nested hierarchy of concepts, with each concept defined in relation to a simpler concept. Deep learning techniques are extensively used in reinforcement learning and natural language pro‐ cessing applications that we will look at in Chapters 9 and 10. 31 1 Readers are encouraged to refer to the book Deep Learning by Aaron Courville, Ian Goodfellow, and Yoshua Bengio (MIT Press) for more details on ANN and deep learning. 2 Activation functions are described in detail later in this chapter. We will review detailed terminology and processes used in the field of ANNs1 and cover the following topics: • Architecture of ANNs: Neurons and layers • Training an ANN: Forward propagation, backpropagation and gradient descent • Hyperparameters of ANNs: Number of layers and nodes, activation function, loss function, learning rate, etc. • Defining and training a deep neural network–based model in Python • Improving the training speed of ANNs and deep learning models ANNs: Architecture, Training, and Hyperparameters ANNs contain multiple neurons arranged inlayers. An ANN goes through a training phase by comparing the modeled output to the desired output, where it learns to rec‐ ognize patterns in data. Let us go through the components of ANNs. Architecture An ANN architecture comprises neurons, layers, and weights. Neurons The building blocks for ANNs are neurons (also known as artificial neurons, nodes, or perceptrons). Neurons have one or more inputs and one output. It is possible to build a network of neurons to compute complex logical propositions. Activation functions in these neurons create complicated, nonlinear functional mappings between the inputs and the output.2 As shown in Figure 3-1, a neuron takes an input (x1, x2…xn), applies the learning parameters to generate a weighted sum (z), and then passes that sum to an activation function (f) that computes the output f(z). 32 | Chapter 3: Artificial Neural Networks Figure 3-1. An artificial neuron Layers The output f(z) from a single neuron (as shown in Figure 3-1) will not be able to model complex tasks. So, in order to handle more complex structures, we have multi‐ ple layers of such neurons. As we keep stacking neurons horizontally and vertically, the class of functions we can get becomes increasing complex. Figure 3-2 shows an architecture of an ANN with an input layer, an output layer, and a hidden layer. Figure 3-2. Neural network architecture Input layer. The input layer takes input from the dataset and is the exposed part of the network. A neural network is often drawn with an input layer of one neuron per input value (or column) in the dataset. The neurons in the input layer simply pass the input value though to the next layer. ANNs: Architecture, Training, and Hyperparameters | 33 Hidden layers. Layers after the input layer are called hidden layers because they are not directly exposed to the input. The simplest network structure is to have a single neuron in the hidden layer that directly outputs the value. A multilayer ANN is capable of solving more complex machine learning–related tasks due to its hidden layer(s). Given increases in computing power and efficient libraries, neural networks with many layers can be constructed. ANNs with many hidden layers (more than three) are known as deep neural networks. Multiple hidden layers allow deep neural networks to learn features of the data in a so-called feature hierarchy, because simple features recombine from one layer to the next to form more complex features. ANNs with many layers pass input data (features) through more mathematical operations than do ANNs with few layers and are therefore more computationally intensive to train. Output layer. The final layer is called the output layer; it is responsible for outputting a value or vector of values that correspond to the format required to solve the problem. Neuron weights A neuron weight represents the strength of the connection between units and measures the influence the input will have on the output. If the weight from neuron one to neuron two has greater magnitude, it means that neuron one has a greater influence over neuron two. Weights near zero mean changing this input will not change the output. Negative weights mean increasing this input will decrease the output. Training Training a neural network basically means calibrating all of the weights in the ANN. This optimization is performed using an iterative approach involving forward propa‐ gation and backpropagation steps. Forward propagation Forward propagation is a process of feeding input values to the neural network and getting an output, which we call predicted value. When we feed the input values to the neural network’s first layer, it goes without any operations. The second layer takes values from the first layer and applies multiplication, addition, and activation opera‐ tions before passing this value to the next layer. The same process repeats for any subsequent layers until an output value from the last layer is received. 34 | Chapter 3: Artificial Neural Networks 3 There are many available loss functions discussed in the next section. The nature of our problem dictates our choice of loss function. Backpropagation After forward propagation, we get a predicted value from the ANN. Suppose the desired output of a network is Y and the predicted value of the network from forward propagation is Y′. The difference between the predicted output and the desired out‐ put (Y–Y′ ) is converted into the loss (or cost) function J(w), where w represents the weights in ANN.3 The goal is to optimize the loss function (i.e., make the loss as small as possible) over the training set. The optimization method used is gradient descent. The goal of the gradient descent method is to find the gradient of J(w) with respect to w at the current point and take a small step in the direction of the negative gradient until the minimum value is reached, as shown in Figure 3-3. Figure 3-3. Gradient descent In an ANN, the function J(w) is essentially a composition of multiple layers, as explained in the preceding text. So, if layer one is represented as function p(), layer two as q(), and layer three as r(), then the overall function is J(w)=r(q(p())). w consists of all weights in all three layers. We want to find the gradient of J(w) with respect to each component of w. Skipping the mathematical details, the above essentially implies that the gradient of a component w in the first layer would depend on the gradients in the second and third layers. Similarly, the gradients in the second layer will depend on the gradients in the third layer. Therefore, we start computing the derivatives in the reverse direction, starting with the last layer, and use backpropagation to compute gradients of the pre‐ vious layer. ANNs: Architecture, Training, and Hyperparameters | 35 Overall, in the process of backpropagation, the model error (difference between pre‐ dicted and desired output) is propagated back through the network, one layer at a time, and the weights are updated according to the amount they contributed to the error. Almost all ANNs use gradient descent and backpropagation. Backpropagation is one of the cleanest and most efficient ways to find the gradient. Hyperparameters Hyperparameters are the variables that are set before the training process, and they cannot be learned during training. ANNs have a large number of hyperparameters, which makes them quite flexible. However, this flexibility makes the model tuning process difficult. Understanding the hyperparameters and the intuition behind them helps give an idea of what values are reasonable for each hyperparameter so we can restrict the search space. Let’s start with the number of hidden layers and nodes. Number of hidden layers and nodes More hidden layers or nodes per layer means more parameters in the ANN, allowing the model to fit more complex functions. To have a trained network that generalizes well, we need to pick an optimal number of hidden layers, as well as of the nodes in each hidden layer. Too few nodes and layers will lead to high errors for the system, as the predictive factors might be too complex for a small number of nodes to capture. Too many nodes and layers will overfit to the training data and not generalize well. There is no hard-and-fast rule to decide the number of layers and nodes. The number of hidden layers primarily depends on the complexity of the task. Very complex tasks, such as large image classification or speech recognition, typically require networks with dozens of layers and a huge amount of training data. For the majority of the problems, we can start with just one or two hidden layers and then gradually ramp up the number of hidden layers until we start overfitting the training set. The number of hidden nodes should have a relationship to the number of input and output nodes, the amount of training data available, and the complexity of the func‐ tion being modeled. As a rule of thumb, the number of hidden nodesin each layer should be somewhere between the size of the input layer and the size of the output layer, ideally the mean. The number of hidden nodes shouldn’t exceed twice the number of input nodes in order to avoid overfitting. Learning rate When we train ANNs, we use many iterations of forward propagation and backpro‐ pagation to optimize the weights. At each iteration we calculate the derivative of the 36 | Chapter 3: Artificial Neural Networks loss function with respect to each weight and subtract it from that weight. The learn‐ ing rate determines how quickly or slowly we want to update our weight (parameter) values. This learning rate should be high enough so that it converges in a reasonable amount of time. Yet it should be low enough so that it finds the minimum value of the loss function. Activation functions Activation functions (as shown in Figure 3-1) refer to the functions used over the weighted sum of inputs in ANNs to get the desired output. Activation functions allow the network to combine the inputs in more complex ways, and they provide a richer capability in the relationship they can model and the output they can produce. They decide which neurons will be activated—that is, what information is passed to further layers. Without activation functions, ANNs lose a bulk of their representation learning power. There are several activation functions. The most widely used are as follows: Linear (identity) function Represented by the equation of a straight line (i.e., f (x) = mx + c), where activa‐ tion is proportional to the input. If we have many layers, and all the layers are linear in nature, then the final activation function of the last layer is the same as the linear function of the first layer. The range of a linear function is –inf to +inf. Sigmoid function Refers to a function that is projected as an S-shaped graph (as shown in Figure 3-4). It is represented by the mathematical equation f (x) = 1 / (1 + e –x) and ranges from 0 to 1. A large positive input results in a large positive output; a large negative input results in a large negative output. It is also referred to as logistic activation function. Tanh function Similar to sigmoid activation function with a mathematical equation Tanh (x) = 2Sigmoid(2x) – 1, where Sigmoid represents the sigmoid function discussed above. The output of this function ranges from –1 to 1, with an equal mass on both sides of the zero-axis, as shown in Figure 3-4. ReLU function ReLU stands for the Rectified Linear Unit and is represented as f (x) = max(x, 0). So, if the input is a positive number, the function returns the number itself, and if the input is a negative number, then the function returns zero. It is the most commonly used function because of its simplicity. ANNs: Architecture, Training, and Hyperparameters | 37 4 Deriving a regression or classification output by changing the activation function of the output layer is described further in Chapter 4. Figure 3-4 shows a summary of the activation functions discussed in this section. Figure 3-4. Activation functions There is no hard-and-fast rule for activation function selection. The decision com‐ pletely relies on the properties of the problem and the relationships being modeled. We can try different activation functions and select the one that helps provide faster convergence and a more efficient training process. The choice of activation function in the output layer is strongly constrained by the type of problem that is modeled.4 Cost functions Cost functions (also known as loss functions) are a measure of the ANN perfor‐ mance, measuring how well the ANN fits empirical data. The two most common cost functions are: Mean squared error (MSE) This is the cost function used primarily for regression problems, where output is a continuous value. MSE is measured as the average of the squared difference 38 | Chapter 3: Artificial Neural Networks 5 Refer to https://oreil.ly/FSt-8 for more details on optimization. between predictions and actual observation. MSE is described further in Chapter 4. Cross-entropy (or log loss) This cost function is used primarily for classification problems, where output is a probability value between zero and one. Cross-entropy loss increases as the pre‐ dicted probability diverges from the actual label. A perfect model would have a cross-entropy of zero. Optimizers Optimizers update the weight parameters to minimize the loss function.5 Cost func‐ tion acts as a guide to the terrain, telling the optimizer if it is moving in the right direction to reach the global minimum. Some of the common optimizers are as follows: Momentum The momentum optimizer looks at previous gradients in addition to the current step. It will take larger steps if the previous updates and the current update move the weights in the same direction (gaining momentum). It will take smaller steps if the direction of the gradient is opposite. A clever way to visualize this is to think of a ball rolling down a valley—it will gain momentum as it approaches the valley bottom. AdaGrad (Adaptive Gradient Algorithm) AdaGrad adapts the learning rate to the parameters, performing smaller updates for parameters associated with frequently occurring features, and larger updates for parameters associated with infrequent features. RMSProp RMSProp stands for Root Mean Square Propagation. In RMSProp, the learning rate gets adjusted automatically, and it chooses a different learning rate for each parameter. Adam (Adaptive Moment Estimation) Adam combines the best properties of the AdaGrad and RMSProp algorithms to provide an optimization and is one of the most popular gradient descent optimi‐ zation algorithms. ANNs: Architecture, Training, and Hyperparameters | 39 https://oreil.ly/FSt-8 6 The steps and Python code related to implementing deep learning models using Keras, as demonstrated in this section, are used in several case studies in the subsequent chapters. Epoch One round of updating the network for the entire training dataset is called an epoch. A network may be trained for tens, hundreds, or many thousands of epochs depend‐ ing on the data size and computational constraints. Batch size The batch size is the number of training examples in one forward/backward pass. A batch size of 32 means that 32 samples from the training dataset will be used to esti‐ mate the error gradient before the model weights are updated. The higher the batch size, the more memory space is needed. Creating an Artificial Neural Network Model in Python In Chapter 2 we discussed the steps for end-to-end model development in Python. In this section, we dig deeper into the steps involved in building an ANN-based model in Python. Our first step will be to look at Keras, the Python package specifically built for ANN and deep learning. Installing Keras and Machine Learning Packages There are several Python libraries that allow building ANN and deep learning models easily and quickly without getting into the details of underlying algorithms. Keras is one of the most user-friendly packages that enables an efficient numerical computa‐ tion related to ANNs. Using Keras, complex deep learning models can be defined and implemented in a few lines of code. We will primarily be using Keras packages for implementing deep learning models in several of the book’s case studies. Keras is simply a wrapper around more complex numerical computation engines such as TensorFlow and Theano. In order to install Keras, TensorFlow or Theano needs to be installed first. This section describes the steps to define and compile an ANN-based model in Keras, with a focus on the following steps.6 Importing the packages Before you can start to build an ANN model, you need to import two modules from the Keras package: Sequential and Dense: 40 | Chapter 3: Artificial Neural Networks https://keras.io https://www.tensorflow.org https://oreil.ly/-XFJP from Keras.models import Sequential from Keras.layers import Dense import numpyas np Loading data This example makes use of the random module of NumPy to quickly generate some data and labels to be used by ANN that we build in the next step. Specifically, an array with size (1000,10) is first constructed. Next, we create a labels array that con‐ sists of zeros and ones with a size (1000,1): data = np.random.random((1000,10)) Y = np.random.randint(2,size= (1000,1)) model = Sequential() Model construction—defining the neural network architecture A quick way to get started is to use the Keras Sequential model, which is a linear stack of layers. We create a Sequential model and add layers one at a time until the network topology is finalized. The first thing to get right is to ensure the input layer has the right number of inputs. We can specify this when creating the first layer. We then select a dense or fully connected layer to indicate that we are dealing with an input layer by using the argument input_dim. We add a layer to the model with the add() function, and the number of nodes in each layer is specified. Finally, another dense layer is added as an output layer. The architecture for the model shown in Figure 3-5 is as follows: • The model expects rows of data with 10 variables (input_dim_=10 argument). • The first hidden layer has 32 nodes and uses the relu activation function. • The second hidden layer has 32 nodes and uses the relu activation function. • The output layer has one node and uses the sigmoid activation function. Creating an Artificial Neural Network Model in Python | 41 7 A detailed discussion of the evaluation metrics for classification models is presented in Chapter 4. Figure 3-5. An ANN architecture The Python code for the network in Figure 3-5 is shown below: model = Sequential() model.add(Dense(32, input_dim=10, activation= 'relu' )) model.add(Dense(32, activation= 'relu' )) model.add(Dense(1, activation= 'sigmoid')) Compiling the model With the model constructed, it can be compiled with the help of the compile() func‐ tion. Compiling the model leverages the efficient numerical libraries in the Theano or TensorFlow packages. When compiling, it is important to specify the additional properties required when training the network. Training a network means finding the best set of weights to make predictions for the problem at hand. So we must spec‐ ify the loss function used to evaluate a set of weights, the optimizer used to search through different weights for the network, and any optional metrics we would like to collect and report during training. In the following example, we use cross-entropy loss function, which is defined in Keras as binary_crossentropy. We will also use the adam optimizer, which is the default option. Finally, because it is a classification problem, we will collect and report the classification accuracy as the metric.7 The Python code follows: model.compile(loss= 'binary_crossentropy' , optimizer= 'adam' , \ metrics=[ 'accuracy' ]) 42 | Chapter 3: Artificial Neural Networks Fitting the model With our model defined and compiled, it is time to execute it on data. We can train or fit our model on our loaded data by calling the fit() function on the model. The training process will run for a fixed number of iterations (epochs) through the dataset, specified using the nb_epoch argument. We can also set the number of instances that are evaluated before a weight update in the network is performed. This is set using the batch_size argument. For this problem we will run a small number of epochs (10) and use a relatively small batch size of 32. Again, these can be chosen experimentally through trial and error. The Python code follows: model.fit(data, Y, nb_epoch=10, batch_size=32) Evaluating the model We have trained our neural network on the entire dataset and can evaluate the per‐ formance of the network on the same dataset. This will give us an idea of how well we have modeled the dataset (e.g., training accuracy) but will not provide insight on how well the algorithm will perform on new data. For this, we separate the data into train‐ ing and test datasets. The model is evaluated on the training dataset using the evalua tion() function. This will generate a prediction for each input and output pair and collect scores, including the average loss and any metrics configured, such as accu‐ racy. The Python code follows: scores = model.evaluate(X_test, Y_test) print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100)) Running an ANN Model Faster: GPU and Cloud Services For training ANNs (especially deep neural networks with many layers), a large amount of computation power is required. Available CPUs, or Central Processing Units, are responsible for processing and executing instructions on a local machine. Since CPUs are limited in the number of cores and take up the job sequentially, they cannot do rapid matrix computations for the large number of matrices required for training deep learning models. Hence, the training of deep learning models can be extremely slow on the CPUs. The following alternatives are useful for running ANNs that generally require a sig‐ nificant amount of time to run on a CPU: • Running notebooks locally on a GPU. • Running notebooks on Kaggle Kernels or Google Colaboratory. • Using Amazon Web Services. Creating an Artificial Neural Network Model in Python | 43 GPU A GPU is composed of hundreds of cores that can handle thousands of threads simultaneously. Running ANNs and deep learning models can be accelerated by the use of GPUs. GPUs are particularly adept at processing complex matrix operations. The GPU cores are highly specialized, and they massively accelerate processes such as deep learning training by offloading the processing from CPUs to the cores in the GPU subsystem. All the Python packages related to machine learning, including Tensorflow, Theano, and Keras, can be configured for the use of GPUs. Cloud services such as Kaggle and Google Colab If you have a GPU-enabled computer, you can run ANNs locally. If you do not, we recommend you use a service such as Kaggle Kernels, Google Colab, or AWS: Kaggle A popular data science website owned by Google that hosts Jupyter service and is also referred to as Kaggle Kernels. Kaggle Kernels are free to use and come with the most frequently used packages preinstalled. You can connect a kernel to any dataset hosted on Kaggle, or alternatively, you can just upload a new dataset on the fly. Google Colaboratory A free Jupyter Notebook environment provided by Google where you can use free GPUs. The features of Google Colaboratory are the same as Kaggle. Amazon Web Services (AWS) AWS Deep Learning provides an infrastructure to accelerate deep learning in the cloud, at any scale. You can quickly launch AWS server instances preinstalled with popular deep learning frameworks and interfaces to train sophisticated, cus‐ tom AI models, experiment with new algorithms, or learn new skills and techni‐ ques. These web servers can run longer than Kaggle Kernels. So for big projects, it might be worth using an AWS instead of a kernel. 44 | Chapter 3: Artificial Neural Networks https://www.kaggle.com https://oreil.ly/keqHk https://oreil.ly/gU84O Chapter Summary ANNs comprise a family of algorithms used across all types of machine learning. These models are inspired by the biological neural networks containing neurons and layers of neurons that constitute animal brains. ANNs with many layers are referred to as deep neural networks. Several steps, including forward propagation and back‐ propagation, are required for training these ANNs. Python packages such as Keras make the training of these ANNs possible in a few lines of code. The training of these deep neural networks require more computational power, and CPUs alone might not be enough. Alternatives include using a GPU or cloud service such as Kaggle Kernels, Google Colaboratory, or Amazon Web Services for training deep neural networks. Next Steps As a next step, we will be going intothe details of the machine learning concepts for supervised learning, followed by case studies using the concepts covered in this chapter. Chapter Summary | 45 PART II Supervised Learning CHAPTER 4 Supervised Learning: Models and Concepts Supervised learning is an area of machine learning where the chosen algorithm tries to fit a target using the given input. A set of training data that contains labels is sup‐ plied to the algorithm. Based on a massive set of data, the algorithm will learn a rule that it uses to predict the labels for new observations. In other words, supervised learning algorithms are provided with historical data and asked to find the relation‐ ship that has the best predictive power. There are two varieties of supervised learning algorithms: regression and classifica‐ tion algorithms. Regression-based supervised learning methods try to predict outputs based on input variables. Classification-based supervised learning methods identify which category a set of data items belongs to. Classification algorithms are probability-based, meaning the outcome is the category for which the algorithm finds the highest probability that the dataset belongs to it. Regression algorithms, in con‐ trast, estimate the outcome of problems that have an infinite number of solutions (continuous set of possible outcomes). In the context of finance, supervised learning models represent one of the most-used class of machine learning models. Many algorithms that are widely applied in algo‐ rithmic trading rely on supervised learning models because they can be efficiently trained, they are relatively robust to noisy financial data, and they have strong links to the theory of finance. Regression-based algorithms have been leveraged by academic and industry research‐ ers to develop numerous asset pricing models. These models are used to predict returns over various time periods and to identify significant factors that drive asset returns. There are many other use cases of regression-based supervised learning in portfolio management and derivatives pricing. 49 Classification-based algorithms, on the other hand, have been leveraged across many areas within finance that require predicting a categorical response. These include fraud detection, default prediction, credit scoring, directional forecast of asset price movement, and Buy/Sell recommendations. There are many other use cases of classification-based supervised learning in portfolio management and algorithmic trading. Many use cases of regression-based and classification-based supervised machine learning are presented in Chapters 5 and 6. Python and its libraries provide methods and ways to implement these supervised learning models in few lines of code. Some of these libraries were covered in Chap‐ ter 2. With easy-to-use machine learning libraries like Scikit-learn and Keras, it is straightforward to fit different machine learning models on a given predictive model‐ ing dataset. In this chapter, we present a high-level overview of supervised learning models. For a thorough coverage of the topics, the reader is referred to Hands-On Machine Learn‐ ing with Scikit-Learn, Keras, and TensorFlow, 2nd Edition, by Aurélien Géron (O’Reilly). The following topics are covered in this chapter: • Basic concepts of supervised learning models (both regression and classification). • How to implement different supervised learning models in Python. • How to tune the models and identify the optimal parameters of the models using grid search. • Overfitting versus underfitting and bias versus variance. • Strengths and weaknesses of several supervised learning models. • How to use ensemble models, ANN, and deep learning models for both regres‐ sion and classification. • How to select a model on the basis of several factors, including model performance. • Evaluation metrics for classification and regression models. • How to perform cross validation. 50 | Chapter 4: Supervised Learning: Models and Concepts Supervised Learning Models: An Overview Classification predictive modeling problems are different from regression predictive modeling problems, as classification is the task of predicting a discrete class label and regression is the task of predicting a continuous quantity. However, both share the same concept of utilizing known variables to make predictions, and there is a signifi‐ cant overlap between the two models. Hence, the models for classification and regres‐ sion are presented together in this chapter. Figure 4-1 summarizes the list of the models commonly used for classification and regression. Some models can be used for both classification and regression with small modifica‐ tions. These are K-nearest neighbors, decision trees, support vector, ensemble bag‐ ging/boosting methods, and ANNs (including deep neural networks), as shown in Figure 4-1. However, some models, such as linear regression and logistic regression, cannot (or cannot easily) be used for both problem types. Figure 4-1. Models for regression and classification This section contains the following details about the models: • Theory of the models. • Implementation in Scikit-learn or Keras. • Grid search for different models. • Pros and cons of the models. Supervised Learning Models: An Overview | 51 In finance, a key focus is on models that extract signals from previ‐ ously observed data in order to predict future values for the same time series. This family of time series models predicts continuous output and is more aligned with the supervised regression models. Time series models are covered separately in the supervised regres‐ sion chapter (Chapter 5). Linear Regression (Ordinary Least Squares) Linear regression (Ordinary Least Squares Regression or OLS Regression) is perhaps one of the most well-known and best-understood algorithms in statistics and machine learning. Linear regression is a linear model, e.g., a model that assumes a linear relationship between the input variables (x) and the single output variable (y). The goal of linear regression is to train a linear model to predict a new y given a pre‐ viously unseen x with as little error as possible. Our model will be a function that predicts y given x1, x2...xi: y = β0 + β1x1 + ... + βi xi where, β0 is called intercept and β1...βi are the coefficient of the regression. Implementation in Python from sklearn.linear_model import LinearRegression model = LinearRegression() model.fit(X, Y) In the following section, we cover the training of a linear regression model and grid search of the model. However, the overall concepts and related approaches are appli‐ cable to all other supervised learning models. Training a model As we mentioned in Chapter 3, training a model basically means retrieving the model parameters by minimizing the cost (loss) function. The two steps for training a linear regression model are: Define a cost function (or loss function) Measures how inaccurate the model’s predictions are. The sum of squared residu‐ als (RSS) as defined in Equation 4-1 measures the squared sum of the difference between the actual and predicted value and is the cost function for linear regression. 52 | Chapter 4: Supervised Learning: Models and Concepts Equation 4-1. Sum of squared residuals RSS = ∑ i=1 n (yi – β0 – ∑ j=1 n βj xij) 2 In this equation, β0 is the intercept; βj represents the coefficient; β1, .., βj are the coefficients of the regression; and xij represents the i th observation and j th variable. Find the parameters that minimize loss For example, make our model as accurate as possible. Graphically, in two dimen‐ sions, this results in a line of best fit as shown in Figure 4-2. In higher dimen‐ sions, we would have higher-dimensional hyperplanes. Mathematically, we look at the difference between each real data point (y) and our model’s prediction (ŷ). Square these differences to avoid negative numbers and penalize larger differ‐ ences, and then add them up and take the average.This is a measure of how well our data fits the line. Figure 4-2. Linear regression Grid search The overall idea of the grid search is to create a grid of all possible hyperparameter combinations and train the model using each one of them. Hyperparameters are the external characteristic of the model, can be considered the model’s settings, and are not estimated based on data-like model parameters. These hyperparameters are tuned during grid search to achieve better model performance. Supervised Learning Models: An Overview | 53 1 Cross validation will be covered in detail later in this chapter. Due to its exhaustive search, a grid search is guaranteed to find the optimal parame‐ ter within the grid. The drawback is that the size of the grid grows exponentially with the addition of more parameters or more considered values. The GridSearchCV class in the model_selection module of the sklearn package facil‐ itates the systematic evaluation of all combinations of the hyperparameter values that we would like to test. The first step is to create a model object. We then define a dictionary where the key‐ words name the hyperparameters and the values list the parameter settings to be tested. For linear regression, the hyperparameter is fit_intercept, which is a boolean variable that determines whether or not to calculate the intercept for this model. If set to False, no intercept will be used in calculations: model = LinearRegression() param_grid = {'fit_intercept': [True, False]} } The second step is to instantiate the GridSearchCV object and provide the estimator object and parameter grid, as well as a scoring method and cross validation choice, to the initialization method. Cross validation is a resampling procedure used to evaluate machine learning models, and scoring parameter is the evaluation metrics of the model:1 With all settings in place, we can fit GridSearchCV: grid = GridSearchCV(estimator=model, param_grid=param_grid, scoring= 'r2', \ cv=kfold) grid_result = grid.fit(X, Y) Advantages and disadvantages In terms of advantages, linear regression is easy to understand and interpret. How‐ ever, it may not work well when there is a nonlinear relationship between predicted and predictor variables. Linear regression is prone to overfitting (which we will dis‐ cuss in the next section) and when a large number of features are present, it may not handle irrelevant features well. Linear regression also requires the data to follow cer‐ tain assumptions, such as the absence of multicollinearity. If the assumptions fail, then we cannot trust the results obtained. 54 | Chapter 4: Supervised Learning: Models and Concepts https://oreil.ly/tNDnc Regularized Regression When a linear regression model contains many independent variables, their coeffi‐ cients will be poorly determined, and the model will have a tendency to fit extremely well to the training data (data used to build the model) but fit poorly to testing data (data used to test how good the model is). This is known as overfitting or high variance. One popular technique to control overfitting is regularization, which involves the addition of a penalty term to the error or loss function to discourage the coefficients from reaching large values. Regularization, in simple terms, is a penalty mechanism that applies shrinkage to model parameters (driving them closer to zero) in order to build a model with higher prediction accuracy and interpretation. Regularized regres‐ sion has two advantages over linear regression: Prediction accuracy The performance of the model working better on the testing data suggests that the model is trying to generalize from training data. A model with too many parameters might try to fit noise specific to the training data. By shrinking or set‐ ting some coefficients to zero, we trade off the ability to fit complex models (higher bias) for a more generalizable model (lower variance). Interpretation A large number of predictors may complicate the interpretation or communica‐ tion of the big picture of the results. It may be preferable to sacrifice some detail to limit the model to a smaller subset of parameters with the strongest effects. The common ways to regularize a linear regression model are as follows: L1 regularization or Lasso regression Lasso regression performs L1 regularization by adding a factor of the sum of the absolute value of coefficients in the cost function (RSS) for linear regression, as mentioned in Equation 4-1. The equation for lasso regularization can be repre‐ sented as follows: CostFunction = RSS + λ * ∑ j=1 p |βj| L1 regularization can lead to zero coefficients (i.e., some of the features are com‐ pletely neglected for the evaluation of output). The larger the value of λ, the more features are shrunk to zero. This can eliminate some features entirely and give us a subset of predictors, reducing model complexity. So Lasso regression not only helps in reducing overfitting, but also can help in feature selection. Predictors not shrunk toward zero signify that they are important, and thus L1 regularization allows for feature selection (sparse selection). The regularization parameter (λ) can be controlled, and a lambda value of zero produces the basic linear regression equation. Supervised Learning Models: An Overview | 55 A lasso regression model can be constructed using the Lasso class of the sklearn package of Python, as shown in the code snippet that follows: from sklearn.linear_model import Lasso model = Lasso() model.fit(X, Y) L2 regularization or Ridge regression Ridge regression performs L2 regularization by adding a factor of the sum of the square of coefficients in the cost function (RSS) for linear regression, as men‐ tioned in Equation 4-1. The equation for ridge regularization can be represented as follows: CostFunction = RSS + λ * ∑ j=1 p β j 2 Ridge regression puts constraint on the coefficients. The penalty term (λ) regu‐ larizes the coefficients such that if the coefficients take large values, the optimiza‐ tion function is penalized. So ridge regression shrinks the coefficients and helps to reduce the model complexity. Shrinking the coefficients leads to a lower var‐ iance and a lower error value. Therefore, ridge regression decreases the complex‐ ity of a model but does not reduce the number of variables; it just shrinks their effect. When λ is closer to zero, the cost function becomes similar to the linear regression cost function. So the lower the constraint (low λ) on the features, the more the model will resemble the linear regression model. A ridge regression model can be constructed using the Ridge class of the sklearn package of Python, as shown in the code snippet that follows: from sklearn.linear_model import Ridge model = Ridge() model.fit(X, Y) Elastic net Elastic nets add regularization terms to the model, which are a combination of both L1 and L2 regularization, as shown in the following equation: CostFunction = RSS + λ * ((1 – α) / 2 * ∑ j=1 p β j 2 + α * ∑ j=1 p |βj|) In addition to setting and choosing a λ value, an elastic net also allows us to tune the alpha parameter, where α = 0 corresponds to ridge and α = 1 to lasso. There‐ fore, we can choose an α value between 0 and 1 to optimize the elastic net. Effec‐ tively, this will shrink some coefficients and set some to 0 for sparse selection. An elastic net regression model can be constructed using the ElasticNet class of the sklearn package of Python, as shown in the following code snippet: from sklearn.linear_model import ElasticNet model = ElasticNet() model.fit(X, Y) 56 | Chapter 4: Supervised Learning: Models and Concepts 2 See the activation function section of Chapter 3 for details on the sigmoid function. 3 MLE is a method of estimating the parameters of a probability distribution so that under the assumed statisti‐ cal model the observed data is most probable. For all the regularized regression, λ is the key parameter to tune during grid search in Python. In anelastic net, α can be an additional parameter to tune. Logistic Regression Logistic regression is one of the most widely used algorithms for classification. The logistic regression model arises from the desire to model the probabilities of the out‐ put classes given a function that is linear in x, at the same time ensuring that output probabilities sum up to one and remain between zero and one as we would expect from probabilities. If we train a linear regression model on several examples where Y = 0 or 1, we might end up predicting some probabilities that are less than zero or greater than one, which doesn’t make sense. Instead, we use a logistic regression model (or logit model), which is a modification of linear regression that makes sure to output a prob‐ ability between zero and one by applying the sigmoid function.2 Equation 4-2 shows the equation for a logistic regression model. Similar to linear regression, input values (x) are combined linearly using weights or coefficient values to predict an output value (y). The output coming from Equation 4-2 is a probability that is transformed into a binary value (0 or 1) to get the model prediction. Equation 4-2. Logistic regression equation y = exp (β0 + β1x1 + .... + βi x1) 1 + exp (β0 + β1x1 + .... + βi x1) Where y is the predicted output, β0 is the bias or intercept term and B1 is the coeffi‐ cient for the single input value (x). Each column in the input data has an associated β coefficient (a constant real value) that must be learned from the training data. In logistic regression, the cost function is basically a measure of how often we predicted one when the true answer was zero, or vice versa. Training the logistic regression coefficients is done using techniques such as maximum likelihood estima‐ tion (MLE) to predict values close to 1 for the default class and close to 0 for the other class.3 Supervised Learning Models: An Overview | 57 https://oreil.ly/y9atF A logistic regression model can be constructed using the LogisticRegression class of the sklearn package of Python, as shown in the following code snippet: from sklearn.linear_model import LogisticRegression model = LogisticRegression() model.fit(X, Y) Hyperparameters Regularization (penalty in sklearn) Similar to linear regression, logistic regression can have regularization, which can be L1, L2, or elasticnet. The values in the sklearn library are [l1, l2, elasticnet]. Regularization strength (C in sklearn) This parameter controls the regularization strength. Good values of the penalty parameters can be [100, 10, 1.0, 0.1, 0.01]. Advantages and disadvantages In terms of the advantages, the logistic regression model is easy to implement, has good interpretability, and performs very well on linearly separable classes. The out‐ put of the model is a probability, which provides more insight and can be used for ranking. The model has small number of hyperparameters. Although there may be risk of overfitting, this may be addressed using L1/L2 regularization, similar to the way we addressed overfitting for the linear regression models. In terms of disadvantages, the model may overfit when provided with large numbers of features. Logistic regression can only learn linear functions and is less suitable to complex relationships between features and the target variable. Also, it may not han‐ dle irrelevant features well, especially if the features are strongly correlated. Support Vector Machine The objective of the support vector machine (SVM) algorithm is to maximize the mar‐ gin (shown as shaded area in Figure 4-3), which is defined as the distance between the separating hyperplane (or decision boundary) and the training samples that are closest to this hyperplane, the so-called support vectors. The margin is calculated as the perpendicular distance from the line to only the closest points, as shown in Figure 4-3. Hence, SVM calculates a maximum-margin boundary that leads to a homogeneous partition of all data points. 58 | Chapter 4: Supervised Learning: Models and Concepts Figure 4-3. Support vector machine In practice, the data is messy and cannot be separated perfectly with a hyperplane. The constraint of maximizing the margin of the line that separates the classes must be relaxed. This change allows some points in the training data to violate the separating line. An additional set of coefficients is introduced that give the margin wiggle room in each dimension. A tuning parameter is introduced, simply called C, that defines the magnitude of the wiggle allowed across all dimensions. The larger the value of C, the more violations of the hyperplane are permitted. In some cases, it is not possible to find a hyperplane or a linear decision boundary, and kernels are used. A kernel is just a transformation of the input data that allows the SVM algorithm to treat/process the data more easily. Using kernels, the original data is projected into a higher dimension to classify the data better. SVM is used for both classification and regression. We achieve this by converting the original optimization problem into a dual problem. For regression, the trick is to reverse the objective. Instead of trying to fit the largest possible street between two classes while limiting margin violations, SVM regression tries to fit as many instances as possible on the street (shaded area in Figure 4-3) while limiting margin violations. The width of the street is controlled by a hyperparameter. The SVM regression and classification models can be constructed using the sklearn package of Python, as shown in the following code snippets: Supervised Learning Models: An Overview | 59 Regression from sklearn.svm import SVR model = SVR() model.fit(X, Y) Classification from sklearn.svm import SVC model = SVC() model.fit(X, Y) Hyperparameters The following key parameters are present in the sklearn implementation of SVM and can be tweaked while performing the grid search: Kernels (kernel in sklearn) The choice of kernel controls the manner in which the input variables will be projected. There are many kernels to choose from, but linear and RBF are the most common. Penalty (C in sklearn) The penalty parameter tells the SVM optimization how much you want to avoid misclassifying each training example. For large values of the penalty parameter, the optimization will choose a smaller-margin hyperplane. Good values might be a log scale from 10 to 1,000. Advantages and disadvantages In terms of advantages, SVM is fairly robust against overfitting, especially in higher dimensional space. It handles the nonlinear relationships quite well, with many ker‐ nels to choose from. Also, there is no distributional requirement for the data. In terms of disadvantages, SVM can be inefficient to train and memory-intensive to run and tune. It doesn’t perform well with large datasets. It requires the feature scal‐ ing of the data. There are also many hyperparameters, and their meanings are often not intuitive. K-Nearest Neighbors K-nearest neighbors (KNN) is considered a “lazy learner,” as there is no learning required in the model. For a new data point, predictions are made by searching through the entire training set for the K most similar instances (the neighbors) and summarizing the output variable for those K instances. To determine which of the K instances in the training dataset are most similar to a new input, a distance measure is used. The most popular distance measure is Eucli‐ 60 | Chapter 4: Supervised Learning: Models and Concepts https://oreil.ly/XpBOi dean distance, which is calculated as the square root of the sum of the squared differ‐ ences between a point a and a point b across all input attributes i, and which is represented as d(a, b) = ∑i=1 n (ai – bi)2. Euclidean distance is a good distance meas‐ ure to use if the input variables are similar in type. Another distance metric is Manhattan distance, in which the distance between point a and point b is represented as d(a, b) = ∑i=1n | ai – bi | . Manhattan distance is a good measure to use if the input variables are not similar in type. The steps of KNN can be summarized as follows: 1. Choose the number of K and a distance metric. 2. Find the K-nearest neighbors of the sample that we want to classify. 3. Assign the class label by majority vote. KNN regression and classification models can be constructed using the sklearn pack‐ age of Python, as shown in the following code: Classification from sklearn.neighbors import KNeighborsClassifier model = KNeighborsClassifier() model.fit(X, Y) Regression from sklearn.neighbors import KNeighborsRegressor model = KNeighborsRegressor() model.fit(X, Y) Hyperparameters The following key parameters are present in the sklearn implementation of KNN and can be tweaked while performing the grid search: Number of neighbors (n_neighbors in sklearn) The most important hyperparameter for KNN is the number of neighbors (n_neighbors). Good values are between 1 and 20. Distance metric (metric in sklearn) It may also be interesting to test different distance metrics for choosing the com‐ position of the neighborhood. Good values are euclidean and manhattan. Advantages and disadvantages In terms of advantages, no training is involved and hence there is no learning phase. Since the algorithm requires no training before making predictions, new data can be added seamlessly without impacting the accuracy of the algorithm. It is intuitive and Supervised Learning Models: An Overview | 61 4 The approach of projecting data is similar to the PCA algorithm discussed in Chapter 7. easy to understand. The model naturally handles multiclass classification and can learn complex decision boundaries. KNN is effective if the training data is large. It is also robust to noisy data, and there is no need to filter the outliers. In terms of the disadvantages, the distance metric to choose is not obvious and diffi‐ cult to justify in many cases. KNN performs poorly on high dimensional datasets. It is expensive and slow to predict new instances because the distance to all neighbors must be recalculated. KNN is sensitive to noise in the dataset. We need to manually input missing values and remove outliers. Also, feature scaling (standardization and normalization) is required before applying the KNN algorithm to any dataset; other‐ wise, KNN may generate wrong predictions. Linear Discriminant Analysis The objective of the linear discriminant analysis (LDA) algorithm is to project the data onto a lower-dimensional space in a way that the class separability is maximized and the variance within a class is minimized.4 During the training of the LDA model, the statistical properties (i.e., mean and cova‐ riance matrix) of each class are computed. The statistical properties are estimated on the basis of the following assumptions about the data: • Data is normally distributed, so that each variable is shaped like a bell curve when plotted. • Each attribute has the same variance, and the values of each variable vary around the mean by the same amount on average. To make a prediction, LDA estimates the probability that a new set of inputs belongs to every class. The output class is the one that has the highest probability. Implementation in Python and hyperparameters The LDA classification model can be constructed using the sklearn package of Python, as shown in the following code snippet: from sklearn.discriminant_analysis import LinearDiscriminantAnalysis model = LinearDiscriminantAnalysis() model.fit(X, Y) The key hyperparameter for the LDA model is number of components for dimen‐ sionality reduction, which is represented by n_components in sklearn. 62 | Chapter 4: Supervised Learning: Models and Concepts https://oreil.ly/cuc7p Advantages and disadvantages In terms of advantages, LDA is a relatively simple model with fast implementation and is easy to implement. In terms of disadvantages, it requires feature scaling and involves complex matrix operations. Classification and Regression Trees In the most general terms, the purpose of an analysis via tree-building algorithms is to determine a set of if–then logical (split) conditions that permit accurate prediction or classification of cases. Classification and regression trees (or CART or decision tree classifiers) are attractive models if we care about interpretability. We can think of this model as breaking down our data and making a decision based on asking a series of questions. This algorithm is the foundation of ensemble methods such as random forest and gradient boosting method. Representation The model can be represented by a binary tree (or decision tree), where each node is an input variable x with a split point and each leaf contains an output variable y for prediction. Figure 4-4 shows an example of a simple classification tree to predict whether a per‐ son is a male or a female based on two inputs of height (in centimeters) and weight (in kilograms). Figure 4-4. Classification and regression tree example Learning a CART model Creating a binary tree is actually a process of dividing up the input space. A greedy approach called recursive binary splitting is used to divide the space. This is a numeri‐ cal procedure in which all the values are lined up and different split points are tried and tested using a cost (loss) function. The split with the best cost (lowest cost, because we minimize cost) is selected. All input variables and all possible split points Supervised Learning Models: An Overview | 63 are evaluated and chosen in a greedy manner (e.g., the very best split point is chosen each time). For regression predictive modeling problems, the cost function that is minimized to choose split points is the sum of squared errors across all training samples that fall within the rectangle: ∑i=1 n (yi – predictioni)2 where yi is the output for the training sample and prediction is the predicted output for the rectangle. For classification, the Gini cost function is used; it provides an indi‐ cation of how pure the leaf nodes are (i.e., how mixed the training data assigned to each node is) and is defined as: G = ∑i=1 n pk * (1 – pk ) where G is the Gini cost over all classes and pk is the number of training instances with class k in the rectangle of interest. A node that has all classes of the same type (perfect class purity) will have G = 0, while a node that has a 50–50 split of classes for a binary classification problem (worst purity) will have G = 0.5. Stopping criterion The recursive binary splitting procedure described in the preceding section needs to know when to stop splitting as it works its way down the tree with the training data. The most common stopping procedure is to use a minimum count on the number of training instances assigned to each leaf node. If the count is less than some minimum, then the split is not accepted and the node is taken as a final leaf node. Pruning the tree The stopping criterion is important as it strongly influences the performance of the tree. Pruning can be used after learning the tree to further lift performance. The com‐ plexity of a decision tree is defined as the number of splits in the tree. Simpler trees are preferred as they are faster to run and easy to understand, consume less memory during processing and storage, and are less likely to overfit the data. The fastest and simplest pruning method is to work through each leaf node in the tree and evaluate the effect of removing it using a test set. A leaf node is removed only if doing so results in a drop in the overall cost function on the entire test set. The removal of nodes can be stopped when no further improvements can be made. 64 | Chapter 4: Supervised Learning: Models and Concepts Implementation in Python CART regression and classification models can be constructed using the sklearn package of Python, as shown in the following code snippet: Classification from sklearn.tree import DecisionTreeClassifier model = DecisionTreeClassifier()model.fit(X, Y) Regression from sklearn.tree import DecisionTreeRegressor model = DecisionTreeRegressor () model.fit(X, Y) Hyperparameters CART has many hyperparameters. However, the key hyperparameter is the maxi‐ mum depth of the tree model, which is the number of components for dimensional‐ ity reduction, and which is represented by max_depth in the sklearn package. Good values can range from 2 to 30 depending on the number of features in the data. Advantages and disadvantages In terms of advantages, CART is easy to interpret and can adapt to learn complex relationships. It requires little data preparation, and data typically does not need to be scaled. Feature importance is built in due to the way decision nodes are built. It per‐ forms well on large datasets. It works for both regression and classification problems. In terms of disadvantages, CART is prone to overfitting unless pruning is used. It can be very nonrobust, meaning that small changes in the training dataset can lead to quite major differences in the hypothesis function that gets learned. CART generally has worse performance than ensemble models, which are covered next. Ensemble Models The goal of ensemble models is to combine different classifiers into a meta-classifier that has better generalization performance than each individual classifier alone. For example, assuming that we collected predictions from 10 experts, ensemble methods would allow us to strategically combine their predictions to come up with a predic‐ tion that is more accurate and robust than the experts’ individual predictions. The two most popular ensemble methods are bagging and boosting. Bagging (or boot‐ strap aggregation) is an ensemble technique of training several individual models in a parallel way. Each model is trained by a random subset of the data. Boosting, on the other hand, is an ensemble technique of training several individual models in a sequential way. This is done by building a model from the training data and then Supervised Learning Models: An Overview | 65 5 Bias and variance are described in detail later in this chapter. creating a second model that attempts to correct the errors of the first model. Models are added until the training set is predicted perfectly or a maximum number of mod‐ els is added. Each individual model learns from mistakes made by the previous model. Just like the decision trees themselves, bagging and boosting can be used for classification and regression problems. By combining individual models, the ensemble model tends to be more flexible (less bias) and less data-sensitive (less variance).5 Ensemble methods combine multiple, simpler algorithms to obtain better performance. In this section we will cover random forest, AdaBoost, the gradient boosting method, and extra trees, along with their implementation using sklearn package. Random forest. Random forest is a tweaked version of bagged decision trees. In order to understand a random forest algorithm, let us first understand the bagging algo‐ rithm. Assuming we have a dataset of one thousand instances, the steps of bagging are: 1. Create many (e.g., one hundred) random subsamples of our dataset. 2. Train a CART model on each sample. 3. Given a new dataset, calculate the average prediction from each model and aggregate the prediction by each tree to assign the final label by majority vote. A problem with decision trees like CART is that they are greedy. They choose the variable to split by using a greedy algorithm that minimizes error. Even after bagging, the decision trees can have a lot of structural similarities and result in high correla‐ tion in their predictions. Combining predictions from multiple models in ensembles works better if the predictions from the submodels are uncorrelated, or at best are weakly correlated. Random forest changes the learning algorithm in such a way that the resulting predictions from all of the subtrees have less correlation. In CART, when selecting a split point, the learning algorithm is allowed to look through all variables and all variable values in order to select the most optimal split point. The random forest algorithm changes this procedure such that each subtree can access only a random sample of features when selecting the split points. The number of features that can be searched at each split point (m) must be specified as a parameter to the algorithm. As the bagged decision trees are constructed, we can calculate how much the error function drops for a variable at each split point. In regression problems, this may be the drop in sum squared error, and in classification, this might be the Gini cost. The 66 | Chapter 4: Supervised Learning: Models and Concepts bagged method can provide feature importance by calculating and averaging the error function drop for individual variables. Implementation in Python. Random forest regression and classification models can be constructed using the sklearn package of Python, as shown in the following code: Classification from sklearn.ensemble import RandomForestClassifier model = RandomForestClassifier() model.fit(X, Y) Regression from sklearn.ensemble import RandomForestRegressor model = RandomForestRegressor() model.fit(X, Y) Hyperparameters. Some of the main hyperparameters that are present in the sklearn implementation of random forest and that can be tweaked while performing the grid search are: Maximum number of features (max_features in sklearn) This is the most important parameter. It is the number of random features to sample at each split point. You could try a range of integer values, such as 1 to 20, or 1 to half the number of input features. Number of estimators (n_estimators in sklearn) This parameter represents the number of trees. Ideally, this should be increased until no further improvement is seen in the model. Good values might be a log scale from 10 to 1,000. Advantages and disadvantages. The random forest algorithm (or model) has gained huge popularity in ML applications during the last decade due to its good perfor‐ mance, scalability, and ease of use. It is flexible and naturally assigns feature impor‐ tance scores, so it can handle redundant feature columns. It scales to large datasets and is generally robust to overfitting. The algorithm doesn’t need the data to be scaled and can model a nonlinear relationship. In terms of disadvantages, random forest can feel like a black box approach, as we have very little control over what the model does, and the results may be difficult to interpret. Although random forest does a good job at classification, it may not be good for regression problems, as it does not give a precise continuous nature predic‐ tion. In the case of regression, it doesn’t predict beyond the range in the training data and may overfit datasets that are particularly noisy. Supervised Learning Models: An Overview | 67 6 Split is the process of converting a nonhomogeneous parent node into two homogeneous child nodes (best possible). Extra trees Extra trees, otherwise known as extremely randomized trees, is a variant of a random forest; it builds multiple trees and splits nodes using random subsets of features simi‐ lar to random forest. However, unlike random forest, where observations are drawn with replacement, the observations are drawn without replacement in extra trees. So there is no repetition of observations. Additionally, random forest selects the best split to convert the parent into the two most homogeneous child nodes.6 However, extra trees selects a random split to divide the parent node into two random child nodes. In extra trees, randomness doesn’t come from bootstrapping the data; it comes from the random splits of all observations. In real-world cases, performance is comparable to an ordinary random forest, some‐ times a bit better. The advantages and disadvantages of extra trees are similar to those of random forest. Implementation in Python. Extra trees regression and classificationmodels can be con‐ structed using the sklearn package of Python, as shown in the following code snippet. The hyperparameters of extra trees are similar to random forest, as shown in the pre‐ vious section: Classification from sklearn.ensemble import ExtraTreesClassifier model = ExtraTreesClassifier() model.fit(X, Y) Regression from sklearn.ensemble import ExtraTreesRegressor model = ExtraTreesRegressor() model.fit(X, Y) Adaptive Boosting (AdaBoost) Adaptive Boosting or AdaBoost is a boosting technique in which the basic idea is to try predictors sequentially, and each subsequent model attempts to fix the errors of its predecessor. At each iteration, the AdaBoost algorithm changes the sample distri‐ bution by modifying the weights attached to each of the instances. It increases the weights of the wrongly predicted instances and decreases the ones of the correctly predicted instances. 68 | Chapter 4: Supervised Learning: Models and Concepts The steps of the AdaBoost algorithm are: 1. Initially, all observations are given equal weights. 2. A model is built on a subset of data, and using this model, predictions are made on the whole dataset. Errors are calculated by comparing the predictions and actual values. 3. While creating the next model, higher weights are given to the data points that were predicted incorrectly. Weights can be determined using the error value. For instance, the higher the error, the more weight is assigned to the observation. 4. This process is repeated until the error function does not change, or until the maximum limit of the number of estimators is reached. Implementation in Python. AdaBoost regression and classification models can be con‐ structed using the sklearn package of Python, as shown in the following code snippet: Classification from sklearn.ensemble import AdaBoostClassifier model = AdaBoostClassifier() model.fit(X, Y) Regression from sklearn.ensemble import AdaBoostRegressor model = AdaBoostRegressor() model.fit(X, Y) Hyperparameters. Some of the main hyperparameters that are present in the sklearn implementation of AdaBoost and that can be tweaked while performing the grid search are as follows: Learning rate (learning_rate in sklearn) Learning rate shrinks the contribution of each classifier/regressor. It can be con‐ sidered on a log scale. The sample values for grid search can be 0.001, 0.01, and 0.1. Number of estimators (n_estimators in sklearn) This parameter represents the number of trees. Ideally, this should be increased until no further improvement is seen in the model. Good values might be a log scale from 10 to 1,000. Advantages and disadvantages. In terms of advantages, AdaBoost has a high degree of precision. AdaBoost can achieve similar results to other models with much less tweaking of parameters or settings. The algorithm doesn’t need the data to be scaled and can model a nonlinear relationship. Supervised Learning Models: An Overview | 69 In terms of disadvantages, the training of AdaBoost is time consuming. AdaBoost can be sensitive to noisy data and outliers, and data imbalance leads to a decrease in clas‐ sification accuracy Gradient boosting method Gradient boosting method (GBM) is another boosting technique similar to AdaBoost, where the general idea is to try predictors sequentially. Gradient boosting works by sequentially adding the previous underfitted predictions to the ensemble, ensuring the errors made previously are corrected. The following are the steps of the gradient boosting algorithm: 1. A model (which can be referred to as the first weak learner) is built on a subset of data. Using this model, predictions are made on the whole dataset. 2. Errors are calculated by comparing the predictions and actual values, and the loss is calculated using the loss function. 3. A new model is created using the errors of the previous step as the target vari‐ able. The objective is to find the best split in the data to minimize the error. The predictions made by this new model are combined with the predictions of the previous. New errors are calculated using this predicted value and actual value. 4. This process is repeated until the error function does not change or until the maximum limit of the number of estimators is reached. Contrary to AdaBoost, which tweaks the instance weights at every interaction, this method tries to fit the new predictor to the residual errors made by the previous predictor. Implementation in Python and hyperparameters. Gradient boosting method regression and classification models can be constructed using the sklearn package of Python, as shown in the following code snippet. The hyperparameters of gradient boosting method are similar to AdaBoost, as shown in the previous section: Classification from sklearn.ensemble import GradientBoostingClassifier model = GradientBoostingClassifier() model.fit(X, Y) Regression from sklearn.ensemble import GradientBoostingRegressor model = GradientBoostingRegressor() model.fit(X, Y) 70 | Chapter 4: Supervised Learning: Models and Concepts Advantages and disadvantages. In terms of advantages, gradient boosting method is robust to missing data, highly correlated features, and irrelevant features in the same way as random forest. It naturally assigns feature importance scores, with slightly better performance than random forest. The algorithm doesn’t need the data to be scaled and can model a nonlinear relationship. In terms of disadvantages, it may be more prone to overfitting than random forest, as the main purpose of the boosting approach is to reduce bias and not variance. It has many hyperparameters to tune, so model development may not be as fast. Also, fea‐ ture importance may not be robust to variation in the training dataset. ANN-Based Models In Chapter 3 we covered the basics of ANNs, along with the architecture of ANNs and their training and implementation in Python. The details provided in that chap‐ ter are applicable across all areas of machine learning, including supervised learning. However, there are a few additional details from the supervised learning perspective, which we will cover in this section. Neural networks are reducible to a classification or regression model with the activa‐ tion function of the node in the output layer. In the case of a regression problem, the output node has linear activation function (or no activation function). A linear func‐ tion produces a continuous output ranging from -inf to +inf. Hence, the output layer will be the linear function of the nodes in the layer before the output layer, and it will be a regression-based model. In the case of a classification problem, the output node has a sigmoid or softmax acti‐ vation function. A sigmoid or softmax function produces an output ranging from zero to one to represent the probability of target value. Softmax function can also be used for multiple groups for classification. ANN using sklearn ANN regression and classification models can be constructed using the sklearn pack‐ age of Python, as shown in the following code snippet: Classification from sklearn.neural_network import MLPClassifier model = MLPClassifier() model.fit(X, Y) Regression from sklearn.neural_network import MLPRegressor model = MLPRegressor() model.fit(X, Y) Supervised Learning Models: An Overview | 71 Hyperparameters As we saw in Chapter 3, ANN has many hyperparameters. Some of the hyperparame‐ ters that are present in the sklearn implementation of ANN and can be tweaked while performing the grid search are: Hidden Layers (hidden_layer_sizes in sklearn) It represents the number of layers and nodes in the ANN architecture. In sklearn implementation of ANN, the ith element represents the number of neurons in the ith hidden layer. A sample value for grid search in the sklearn implementa‐ tion can be [(20,), (50,), (20, 20), (20, 30, 20)]. Activation Function (activation in sklearn) It represents the activation function of a hidden layer. Some of the activationfunctions defined in Chapter 3, such as sigmoid, relu, or tanh, can be used. Deep neural network ANNs with more than a single hidden layer are often called deep networks. We pre‐ fer using the library Keras to implement such networks, given the flexibility of the library. The detailed implementation of a deep neural network in Keras was shown in Chapter 3. Similar to MLPClassifier and MLPRegressor in sklearn for classification and regression, Keras has modules called KerasClassifier and KerasRegressor that can be used for creating classification and regression models with deep network. A popular problem in finance is time series prediction, which is predicting the next value of a time series based on a historical overview. Some of the deep neural net‐ works, such as recurrent neural network (RNN), can be directly used for time series prediction. The details of this approach are provided in Chapter 5. Advantages and disadvantages The main advantage of an ANN is that it captures the nonlinear relationship between the variables quite well. ANN can more easily learn rich representations and is good with a large number of input features with a large dataset. ANN is flexible in how it can be used. This is evident from its use across a wide variety of areas in machine learning and AI, including reinforcement learning and NLP, as discussed in Chapter 3. The main disadvantage of ANN is the interpretability of the model, which is a draw‐ back that often cannot be ignored and is sometimes the determining factor when choosing a model. ANN is not good with small datasets and requires a lot of tweaking and guesswork. Choosing the right topology/algorithms to solve a problem is diffi‐ cult. Also, ANN is computationally expensive and can take a lot of time to train. 72 | Chapter 4: Supervised Learning: Models and Concepts Using ANNs for supervised learning in finance If a simple model such as linear or logistic regression perfectly fits your problem, don’t bother with ANN. However, if you are modeling a complex dataset and feel a need for better prediction power, give ANN a try. ANN is one of the most flexible models in adapting itself to the shape of the data, and using it for supervised learning problems can be an interesting and valuable exercise. Model Performance In the previous section, we discussed grid search as a way to find the right hyperpara‐ meter to achieve better performance. In this section, we will expand on that process by discussing the key components of evaluating the model performance, which are overfitting, cross validation, and evaluation metrics. Overfitting and Underfitting A common problem in machine learning is overfitting, which is defined by learning a function that perfectly explains the training data that the model learned from but doesn’t generalize well to unseen test data. Overfitting happens when a model over‐ learns from the training data to the point that it starts picking up idiosyncrasies that aren’t representative of patterns in the real world. This becomes especially problem‐ atic as we make our models increasingly more complex. Underfitting is a related issue in which the model is not complex enough to capture the underlying trend in the data. Figure 4-5 illustrates overfitting and underfitting. The left-hand panel of Figure 4-5 shows a linear regression model; a straight line clearly underfits the true function. The middle panel shows that a high degree polynomial approximates the true relationship reasonably well. On the other hand, a polynomial of a very high degree fits the small sample almost perfectly, and performs best on the training data, but this doesn’t generalize, and it would do a horrible job at explaining a new data point. The concepts of overfitting and underfitting are closely linked to bias-variance trade- off. Bias refers to the error due to overly simplistic assumptions or faulty assumptions in the learning algorithm. Bias results in underfitting of the data, as shown in the left- hand panel of Figure 4-5. A high bias means our learning algorithm is missing important trends among the features. Variance refers to the error due to an overly complex model that tries to fit the training data as closely as possible. In high var‐ iance cases, the model’s predicted values are extremely close to the actual values from the training set. High variance gives rise to overfitting, as shown in the right-hand panel of Figure 4-5. Ultimately, in order to have a good model, we need low bias and low variance. Model Performance | 73 Figure 4-5. Overfitting and underfitting There can be two ways to combat overfitting: Using more training data The more training data we have, the harder it is to overfit the data by learning too much from any single training example. Using regularization Adding a penalty in the loss function for building a model that assigns too much explanatory power to any one feature, or allows too many features to be taken into account. The concept of overfitting and the ways to combat it are applicable across all the supervised learning models. For example, regularized regressions address overfitting in linear regression, as discussed earlier in this chapter. Cross Validation One of the challenges of machine learning is training models that are able to general‐ ize well to unseen data (overfitting versus underfitting or a bias-variance trade-off). The main idea behind cross validation is to split the data one time or several times so that each split is used once as a validation set and the remainder is used as a training set: part of the data (the training sample) is used to train the algorithm, and the remaining part (the validation sample) is used for estimating the risk of the algo‐ rithm. Cross validation allows us to obtain reliable estimates of the model’s generali‐ zation error. It is easiest to understand it with an example. When doing k-fold cross validation, we randomly split the training data into k folds. Then we train the model using k-1 folds and evaluate the performance on the kth fold. We repeat this process k times and average the resulting scores. Figure 4-6 shows an example of cross validation, where the data is split into five sets and in each round one of the sets is used for validation. 74 | Chapter 4: Supervised Learning: Models and Concepts Figure 4-6. Cross validation A potential drawback of cross validation is the computational cost, especially when paired with a grid search for hyperparameter tuning. Cross validation can be per‐ formed in a couple of lines using the sklearn package; we will perform cross valida‐ tion in the supervised learning case studies. In the next section, we cover the evaluation metrics for the supervised learning mod‐ els that are used to measure and compare the models’ performance. Evaluation Metrics The metrics used to evaluate the machine learning algorithms are very important. The choice of metrics to use influences how the performance of machine learning algorithms is measured and compared. The metrics influence both how you weight the importance of different characteristics in the results and your ultimate choice of algorithm. The main evaluation metrics for regression and classification are illustrated in Figure 4-7. Figure 4-7. Evaluation metrics for regression and classification Let us first look at the evaluation metrics for supervised regression. Model Performance | 75 Mean absolute error The mean absolute error (MAE) is the sum of the absolute differences between pre‐ dictions and actual values. The MAE is a linear score, which means that all the indi‐ vidual differences are weighted equally in the average. It gives an idea of how wrong the predictions were. The measure gives an idea of the magnitude of the error, but no idea of the direction (e.g., over- or underpredicting). Mean squared error The mean squared error (MSE) represents the sample standard deviation of the dif‐ ferences between predicted values and observed values (called residuals). Thisis much like the mean absolute error in that it provides a gross idea of the magnitude of the error. Taking the square root of the mean squared error converts the units back to the original units of the output variable and can be meaningful for description and presentation. This is called the root mean squared error (RMSE). R² metric The R² metric provides an indication of the “goodness of fit” of the predictions to actual value. In statistical literature this measure is called the coefficient of determi‐ nation. This is a value between zero and one, for no-fit and perfect fit, respectively. Adjusted R² metric Just like R², adjusted R² also shows how well terms fit a curve or line but adjusts for the number of terms in a model. It is given in the following formula: Radj 2 = 1 – (1 – R 2)(n – 1)) n – k – 1 where n is the total number of observations and k is the number of predictors. Adjusted R² will always be less than or equal to R². Selecting an evaluation metric for supervised regression In terms of a preference among these evaluation metrics, if the main goal is predictive accuracy, then RMSE is best. It is computationally simple and is easily differentiable. The loss is symmetric, but larger errors weigh more in the calculation. The MAEs are symmetric but do not weigh larger errors more. R² and adjusted R² are often used for explanatory purposes by indicating how well the selected independent variable(s) explains the variability in the dependent variable(s). Let us first look at the evaluation metrics for supervised classification. 76 | Chapter 4: Supervised Learning: Models and Concepts Classification For simplicity, we will mostly discuss things in terms of a binary classification prob‐ lem (i.e., only two outcomes, such as true or false); some common terms are: True positives (TP) Predicted positive and are actually positive. False positives (FP) Predicted positive and are actually negative. True negatives (TN) Predicted negative and are actually negative. False negatives (FN) Predicted negative and are actually positive. The difference between three commonly used evaluation metrics for classification, accuracy, precision, and recall, is illustrated in Figure 4-8. Figure 4-8. Accuracy, precision, and recall Accuracy As shown in Figure 4-8, accuracy is the number of correct predictions made as a ratio of all predictions made. This is the most common evaluation metric for classification problems and is also the most misused. It is most suitable when there are an equal number of observations in each class (which is rarely the case) and when all predic‐ tions and the related prediction errors are equally important, which is often not the case. Precision Precision is the percentage of positive instances out of the total predicted positive instances. Here, the denominator is the model prediction done as positive from the whole given dataset. Precision is a good measure to determine when the cost of false positives is high (e.g., email spam detection). Model Performance | 77 Recall Recall (or sensitivity or true positive rate) is the percentage of positive instances out of the total actual positive instances. Therefore, the denominator (true positive + false negative) is the actual number of positive instances present in the dataset. Recall is a good measure when there is a high cost associated with false negatives (e.g., fraud detection). In addition to accuracy, precision, and recall, some of the other commonly used eval‐ uation metrics for classification are discussed in the following sections. Area under ROC curve Area under ROC curve (AUC) is an evaluation metric for binary classification prob‐ lems. ROC is a probability curve, and AUC represents degree or measure of separa‐ bility. It tells how much the model is capable of distinguishing between classes. The higher the AUC, the better the model is at predicting zeros as zeros and ones as ones. An AUC of 0.5 means that the model has no class separation capacity whatsoever. The probabilistic interpretation of the AUC score is that if you randomly choose a positive case and a negative case, the probability that the positive case outranks the negative case according to the classifier is given by the AUC. Confusion matrix A confusion matrix lays out the performance of a learning algorithm. The confusion matrix is simply a square matrix that reports the counts of the true positive (TP), true negative (TN), false positive (FP), and false negative (FN) predictions of a classifier, as shown in Figure 4-9. Figure 4-9. Confusion matrix The confusion matrix is a handy presentation of the accuracy of a model with two or more classes. The table presents predictions on the x-axis and accuracy outcomes on the y-axis. The cells of the table are the number of predictions made by the model. For example, a model can predict zero or one, and each prediction may actually have been a zero or a one. Predictions for zero that were actually zero appear in the cell for 78 | Chapter 4: Supervised Learning: Models and Concepts prediction = 0 and actual = 0, whereas predictions for zero that were actually one appear in the cell for prediction = 0 and actual = 1. Selecting an evaluation metric for supervised classification The evaluation metric for classification depends heavily on the task at hand. For example, recall is a good measure when there is a high cost associated with false nega‐ tives such as fraud detection. We will further examine these evaluation metrics in the case studies. Model Selection Selecting the perfect machine learning model is both an art and a science. Looking at machine learning models, there is no one solution or approach that fits all. There are several factors that can affect your choice of a machine learning model. The main cri‐ teria in most of the cases is the model performance that we discussed in the previous section. However, there are many other factors to consider while performing model selection. In the following section, we will go over all such factors, followed by a dis‐ cussion of model trade-offs. Factors for Model Selection The factors considered for the model selection process are as follows: Simplicity The degree of simplicity of the model. Simplicity usually results in quicker, more scalable, and easier to understand models and results. Training time Speed, performance, memory usage and overall time taken for model training. Handle nonlinearity in the data The ability of the model to handle the nonlinear relationship between the variables. Robustness to overfitting The ability of the model to handle overfitting. Size of the dataset The ability of the model to handle large number of training examples in the dataset. Number of features The ability of the model to handle high dimensionality of the feature space. Model Selection | 79 7 In this table we do not include AdaBoost and extra trees as their overall behavior across all the parameters are similar to Gradient Boosting and Random Forest, respectively. Model interpretation How explainable is the model? Model interpretability is important because it allows us to take concrete actions to solve the underlying problem. Feature scaling Does the model require variables to be scaled or normally distributed? Figure 4-10 compares the supervised learning models on the factors mentioned pre‐ viously and outlines a general rule-of-thumb to narrow down the search for the best machine learning algorithm7 for a given problem. The table is based on the advan‐ tages and disadvantages of different models discussed in the individual model section in this chapter. Figure 4-10. Model selection We can see from the table that relatively simple models include linear and logistic regression and as we move towards the ensemble and ANN, the complexity increases. In terms of the training time, the linear models and CART are relatively faster to train as compared to ensemble methods and ANN. Linear and logistic regression can’t handle nonlinear relationships,while all other models can. SVM can handle the nonlinear relationship between dependent and independent variables with nonlinear kernels. 80 | Chapter 4: Supervised Learning: Models and Concepts SVM and random forest tend to overfit less as compared to the linear regression, logistic regression, gradient boosting, and ANN. The degree of overfitting also depends on other parameters, such as size of the data and model tuning, and can be checked by looking at the results of the test set for each model. Also, the boosting methods such as gradient boosting have higher overfitting risk compared to the bag‐ ging methods, such as random forest. Recall the focus of gradient boosting is to mini‐ mize the bias and not variance. Linear and logistic regressions are not able to handle large datasets and large number of features well. However, CART, ensemble methods, and ANN are capable of han‐ dling large datasets and many features quite well. The linear and logistic regression generally perform better than other models in case the size of the dataset is small. Application of variable reduction techniques (shown in Chapter 7) enables the linear models to handle large datasets. The performance of ANN increases with an increase in the size of the dataset. Given linear regression, logistic regression, and CART are relatively simpler models, they have better model interpretation as compared to the ensemble models and ANN. Model Trade-off Often, it’s a trade-off between different factors when selecting a model. ANN, SVM, and some ensemble methods can be used to create very accurate predictive models, but they may lack simplicity and interpretability and may take a significant amount of resources to train. In terms of selecting the final model, models with lower interpretability may be pre‐ ferred when predictive performance is the most important goal, and it’s not necessary to explain how the model works and makes predictions. In some cases, however, model interpretability is mandatory. Interpretability-driven examples are often seen in the financial industry. In many cases, choosing a machine learning algorithm has less to do with the optimization or the technical aspects of the algorithm and more to do with business decisions. Sup‐ pose a machine learning algorithm is used to accept or reject an individual’s credit card application. If the applicant is rejected and decides to file a complaint or take legal action, the financial institution will need to explain how that decision was made. While that can be nearly impossible for ANN, it’s relatively straightforward for deci‐ sion tree–based models. Different classes of models are good at modeling different types of underlying pat‐ terns in data. So a good first step is to quickly test out a few different classes of mod‐ els to know which ones capture the underlying structure of the dataset most Model Selection | 81 efficiently. We will follow this approach while performing model selection in all our supervised learning–based case studies. Chapter Summary In this chapter, we discussed the importance of supervised learning models in finance, followed by a brief introduction to several supervised learning models, including linear and logistic regression, SVM, decision trees, ensemble, KNN, LDA, and ANN. We demonstrated training and tuning of these models in a few lines of code using sklearn and Keras libraries. We discussed the most common error metrics for regression and classification mod‐ els, explained the bias-variance trade-off, and illustrated the various tools for manag‐ ing the model selection process using cross validation. We introduced the strengths and weaknesses of each model and discussed the factors to consider when selecting the best model. We also discussed the trade-off between model performance and interpretability. In the following chapter, we will dive into the case studies for regression and classifi‐ cation. All case studies in the next two chapters leverage the concepts presented in this chapter and in the previous two chapters. 82 | Chapter 4: Supervised Learning: Models and Concepts CHAPTER 5 Supervised Learning: Regression (Including Time Series Models) Supervised regression–based machine learning is a predictive form of modeling in which the goal is to model the relationship between a target and the predictor vari‐ able(s) in order to estimate a continuous set of possible outcomes. These are the most used machine learning models in finance. One of the focus areas of analysts in financial institutions (and finance in general) is to predict investment opportunities, typically predictions of asset prices and asset returns. Supervised regression–based machine learning models are inherently suit‐ able in this context. These models help investment and financial managers under‐ stand the properties of the predicted variable and its relationship with other variables, and help them identify significant factors that drive asset returns. This helps investors estimate return profiles, trading costs, technical and financial investment required in infrastructure, and thus ultimately the risk profile and profitability of a strategy or portfolio. With the availability of large volumes of data and processing techniques, supervised regression–based machine learning isn’t just limited to asset price prediction. These models are applied to a wide range of areas within finance, including portfolio man‐ agement, insurance pricing, instrument pricing, hedging, and risk management. In this chapter we cover three supervised regression–based case studies that span diverse areas, including asset price prediction, instrument pricing, and portfolio management. All of the case studies follow the standardized seven-step model devel‐ opment process presented in Chapter 2; those steps include defining the problem, loading the data, exploratory data analysis, data preparation, model evaluation, and 83 1 There may be reordering or renaming of the steps or substeps based on the appropriateness and intuitiveness of the steps/substeps. 2 An exogenous variable is one whose value is determined outside the model and imposed on the model. model tuning.1 The case studies are designed not only to cover a diverse set of topics from the finance standpoint but also to cover multiple machine learning and model‐ ing concepts, including models from basic linear regression to advanced deep learn‐ ing that were presented in Chapter 4. A substantial amount of asset modeling and prediction problems in the financial industry involve a time component and estimation of a continuous output. As such, it is also important to address time series models. In its broadest form, time series analysis is about inferring what has happened to a series of data points in the past and attempting to predict what will happen to it in the future. There have been a lot of comparisons and debates in academia and the industry regarding the differences between supervised regression and time series models. Most time series models are parametric (i.e., a known function is assumed to represent the data), while the major‐ ity of supervised regression models are nonparametric. Time series models primarily use historical data of the predicted variables for prediction, and supervised learning algorithms use exogenous variables as predictor variables.2 However, supervised regression can embed the historical data of the predicted variable through a time- delay approach (covered later in this chapter), and a time series model (such as ARI‐ MAX, also covered later in this chapter) can use exogenous variables for prediction. Hence, time series and supervised regression models are similar in the sense that both can use exogenous variables as well as historical data of the predicted variable for forecasting. In terms of the final output, both supervised regression and time series models estimate a continuous set of possible outcomes of a variable. In Chapter 4, we covered the concepts of modelsthat are common between super‐ vised regression and supervised classification. Given that time series models are more closely aligned with supervised regression than supervised classification, we will go through the concepts of time series models separately in this chapter. We will also demonstrate how we can use time series models on financial data to predict future values. Comparison of time series models against the supervised regression models will be presented in the case studies. Additionally, some machine learning and deep learning models (such as LSTM) can be directly used for time series forecasting, and those will be discussed as well. 84 | Chapter 5: Supervised Learning: Regression (Including Time Series Models) In “Case Study 1: Stock Price Prediction” on page 95, we demonstrate one of the most popular prediction problems in finance, that of predicting stock returns. In addition to predicting future stock prices accurately, the purpose of this case study is to dis‐ cuss the machine learning–based framework for general asset class price prediction in finance. In this case study we will discuss several machine learning and time series concepts, along with focusing on visualization and model tuning. In “Case Study 2: Derivative Pricing” on page 114, we will delve into derivative pricing using supervised regression and show how to deploy machine learning techniques in the context of traditional quant problems. As compared to traditional derivative pric‐ ing models, machine learning techniques can lead to faster derivative pricing without relying on the several impractical assumptions. Efficient numerical computation using machine learning can be increasingly beneficial in areas such as financial risk management, where a trade-off between efficiency and accuracy is often inevitable. In “Case Study 3: Investor Risk Tolerance and Robo-Advisors” on page 125, we illus‐ trate supervised regression–based framework to estimate the risk tolerance of invest‐ ors. In this case study, we build a robo-advisor dashboard in Python and implement this risk tolerance prediction model in the dashboard. We demonstrate how such models can lead to the automation of portfolio management processes, including the use of robo-advisors for investment management. The purpose is also to illustrate how machine learning can efficiently be used to overcome the problem of traditional risk tolerance profiling or risk tolerance questionnaires that suffer from several behavioral biases. In “Case Study 4: Yield Curve Prediction” on page 141, we use a supervised regression– based framework to forecast different yield curve tenors simultaneously. We demon‐ strate how we can produce multiple tenors at the same time to model the yield curve using machine learning models. In this chapter, we will learn about the following concepts related to supervised regression and time series techniques: • Application and comparison of different time series and machine learning models. • Interpretation of the models and results. Understanding the potential overfitting and underfitting and intuition behind linear versus nonlinear models. • Performing data preparation and transformations to be used in machine learning models. • Performing feature selection and engineering to improve model performance. • Using data visualization and data exploration to understand outcomes. Supervised Learning: Regression (Including Time Series Models) | 85 3 These models are discussed later in this chapter. 4 There may be reordering or renaming of the steps or substeps based on the appropriateness and intuitiveness of the steps/substeps. • Algorithm tuning to improve model performance. Understanding, implement‐ ing, and tuning time series models such as ARIMA for prediction. • Framing a problem statement related to portfolio management and behavioral finance in a regression-based machine learning framework. • Understanding how deep learning–based models such as LSTM can be used for time series prediction. The models used for supervised regression were presented in Chapters 3 and 4. Prior to the case studies, we will discuss time series models. We highly recommend readers turn to Time Series Analysis and Its Applications, 4th Edition, by Robert H. Shumway and David S. Stoffer (Springer) for a more in-depth understanding of time series con‐ cepts, and to Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition, by Aurélien Géron (O’Reilly) for more on concepts in supervised regres‐ sion models. This Chapter’s Code Repository A Python-based master template for supervised regression, a time series model template, and the Jupyter notebook for all case studies presented in this chapter are included in the folder Chapter 5 - Sup. Learning - Regression and Time Series models of the code repository for this book. For any new supervised regression–based case study, use the com‐ mon template from the code repository, modify the elements spe‐ cific to the case study, and borrow the concepts and insights from the case studies presented in this chapter. The template also includes the implementation and tuning of the ARIMA and LSTM models.3 The templates are designed to run on the cloud (i.e., Kag‐ gle, Google Colab, and AWS). All the case studies have been designed on a uniform regression template.4 Time Series Models A time series is a sequence of numbers that are ordered by a time index. In this section we will cover the following aspects of time series models, which we further leverage in the case studies: 86 | Chapter 5: Supervised Learning: Regression (Including Time Series Models) https://oreil.ly/sJFV0 https://oreil.ly/sJFV0 • The components of a time series • Autocorrelation and stationarity of time series • Traditional time series models (e.g., ARIMA) • Use of deep learning models for time series modeling • Conversion of time series data for use in a supervised learning framework Time Series Breakdown A time series can be broken down into the following components: Trend Component A trend is a consistent directional movement in a time series. These trends will be either deterministic or stochastic. The former allows us to provide an underly‐ ing rationale for the trend, while the latter is a random feature of a series that we will be unlikely to explain. Trends often appear in financial series, and many trading models use sophisticated trend identification algorithms. Seasonal Component Many time series contain seasonal variation. This is particularly true in series representing business sales or climate levels. In quantitative finance we often see seasonal variation, particularly in series related to holiday seasons or annual tem‐ perature variation (such as natural gas). We can write the components of a time series yt as yt = St + T t + Rt where St is the seasonal component, T t is the trend component, and Rt represents the remainder component of the time series not captured by seasonal or trend component. The Python code for breaking down a time series (Y) into its component is as follows: import statsmodels.api as sm sm.tsa.seasonal_decompose(Y,freq=52).plot() Figure 5-1 shows the time series broken down into trend, seasonality, and remainder components. Breaking down a time series into these components may help us better understand the time series and identify its behavior for better prediction. The three time series components are shown separately in the bottom three panels. These components can be added together to reconstruct the actual time series shown in the top panel (shown as “observed”). Notice that the time series shows a trending component after 2017. Hence, the prediction model for this time series should incor‐ Time Series Models | 87 porate the information regarding the trending behavior after 2017. In terms of sea‐ sonality there is some increase in the magnitude in the beginning of the calendar year. The residual component shown in the bottom panel is what is left over when theseasonal and trend components have been subtracted from the data. The residual component is mostly flat with some spikes and noise around 2018 and 2019. Also, each of the plots are on different scales, and the trend component has maximum range as shown by the scale on the plot. Figure 5-1. Time series components Autocorrelation and Stationarity When we are given one or more time series, it is relatively straightforward to decom‐ pose the time series into trend, seasonality, and residual components. However, there are other aspects that come into play when dealing with time series data, particularly in finance. Autocorrelation There are many situations in which consecutive elements of a time series exhibit cor‐ relation. That is, the behavior of sequential points in the series affect each other in a dependent manner. Autocorrelation is the similarity between observations as a func‐ tion of the time lag between them. Such relationships can be modeled using an autor‐ egression model. The term autoregression indicates that it is a regression of the variable against itself. In an autoregression model, we forecast the variable of interest using a linear combi‐ nation of past values of the variable. 88 | Chapter 5: Supervised Learning: Regression (Including Time Series Models) 5 A white noise process is a random process of random variables that are uncorrelated and have a mean of zero and a finite variance. Thus, an autoregressive model of order p can be written as yt = c + ϕ1yt –1 + ϕ2yt –2 + ....ϕp yt – p + ϵ where ϵt is white noise.5 An autoregressive model is like a multiple regression but with lagged values of yt as predictors. We refer to this as an AR(p) model, an autore‐ gressive model of order p. Autoregressive models are remarkably flexible at handling a wide range of different time series patterns. Stationarity A time series is said to be stationary if its statistical properties do not change over time. Thus a time series with trend or with seasonality is not stationary, as the trend and seasonality will affect the value of the time series at different times. On the other hand, a white noise series is stationary, as it does not matter when you observe it; it should look similar at any point in time. Figure 5-2 shows some examples of nonstationary series. Figure 5-2. Nonstationary plots In the first plot, we can clearly see that the mean varies (increases) with time, result‐ ing in an upward trend. Thus this is a nonstationary series. For a series to be classi‐ fied as stationary, it should not exhibit a trend. Moving on to the second plot, we certainly do not see a trend in the series, but the variance of the series is a function of time. A stationary series must have a constant variance; hence this series is a nonsta‐ tionary series as well. In the third plot, the spread becomes closer as the time increa‐ ses, which implies that the covariance is a function of time. The three examples shown in Figure 5-2 represent nonstationary time series. Now look at a fourth plot, as shown in Figure 5-3. Time Series Models | 89 Figure 5-3. Stationary plot In this case, the mean, variance, and covariance are constant with time. This is what a stationary time series looks like. Predicting future values using this fourth plot would be easier. Most statistical models require the series to be stationary to make effective and precise predictions. The two major reasons behind nonstationarity of a time series are trend and season‐ ality, as shown in Figure 5-2. In order to use time series forecasting models, we gener‐ ally convert any nonstationary series to a stationary series, making it easier to model since statistical properties don’t change over time. Differencing Differencing is one of the methods used to make a time series stationary. In this method, we compute the difference of consecutive terms in the series. Differencing is typically performed to get rid of the varying mean. Mathematically, differencing can be written as: yt ′ = yt – yt –1 where yt is the value at a time t. When the differenced series is white noise, the original series is referred to as a non‐ stationary series of degree one. Traditional Time Series Models (Including the ARIMA Model) There are many ways to model a time series in order to make predictions. Most of the time series models aim at incorporating the trend, seasonality, and remainder com‐ ponents while addressing the autocorrelation and stationarity embedded in the time series. For example, the autoregressive (AR) model discussed in the previous section addresses the autocorrelation in the time series. One of the most widely used models in time series forecasting is the ARIMA model. 90 | Chapter 5: Supervised Learning: Regression (Including Time Series Models) ARIMA If we combine stationarity with autoregression and a moving average model (dis‐ cussed further on in this section), we obtain an ARIMA model. ARIMA is an acro‐ nym for AutoRegressive Integrated Moving Average, and it has the following components: AR(p) It represents autoregression, i.e., regression of the time series onto itself, as dis‐ cussed in the previous section, with an assumption that current series values depend on its previous values with some lag (or several lags). The maximum lag in the model is referred to as p. I(d) It represents order of integration. It is simply the number of differences needed to make the series stationary. MA(q) It represents moving average. Without going into detail, it models the error of the time series; again, the assumption is that current error depends on the previ‐ ous with some lag, which is referred to as q. The moving average equation is written as: yt = c + ϵt + θ1ϵt –1 + θ2ϵt –2 where, ϵt is white noise. We refer to this as an MA(q) model of order q. Combining all the components, the full ARIMA model can be written as: yt ′ = c + ϕ1yt –1 ′ + ⋯ + ϕp yt – p ′ + θ1εt –1 + ⋯ + θqεt –q + εt where yt ' is the differenced series (it may have been differenced more than once). The predictors on the right-hand side include both lagged values of yt ' and lagged errors. We call this an ARIMA(p,d,q) model, where p is the order of the autoregressive part, d is the degree of first differencing involved, and q is the order of the moving average part. The same stationarity and invertibility conditions that are used for autoregres‐ sive and moving average models also apply to an ARIMA model. The Python code to fit the ARIMA model of the order (1,0,0) is shown in the following: from statsmodels.tsa.arima_model import ARIMA model=ARIMA(endog=Y_train,order=[1,0,0]) Time Series Models | 91 The family of ARIMA models has several variants, and some of them are as follows: ARIMAX ARIMA models with exogenous variables included. We will be using this model in case study 1. SARIMA “S” in this model stands for seasonal, and this model is targeted at modeling the seasonality component embedded in the time series, along with other components. VARMA This is the extension of the model to multivariate case, when there are many vari‐ ables to be predicted simultaneously. We predict many variables simultaneously in “Case Study 4: Yield Curve Prediction” on page 141. Deep Learning Approach to Time Series Modeling The traditional time series models such as ARIMA are well understood and effective on many problems. However, these traditional methods also suffer from several limi‐ tations. Traditional time series models are linear functions, or simple transformations of linear functions, and they require manually diagnosed parameters, such as time dependence, and don’t perform well with corrupt or missing data. If we look at the advancements in the field of deep learning for time series prediction, we see that recurrent neural network (RNN) has gained increasing attention in recent years. These methods can identify structure and patterns such as nonlinearity, can seamlessly model problems with multiple input variables, and are relatively robust to missingdata. The RNN models can retain state from one iteration to the next by using their own output as input for the next step. These deep learning models can be referred to as time series models, as they can make future predictions using the data points in the past, similar to traditional time series models such as ARIMA. There‐ fore, there are a wide range of applications in finance where these deep learning mod‐ els can be leveraged. Let us look at the deep learning models for time series forecasting. RNNs Recurrent neural networks (RNNs) are called “recurrent” because they perform the same task for every element of a sequence, with the output being dependent on the previous computations. RNN models have a memory, which captures information about what has been calculated so far. As shown in Figure 5-4, a recurrent neural net‐ work can be thought of as multiple copies of the same network, each passing a mes‐ sage to a successor. 92 | Chapter 5: Supervised Learning: Regression (Including Time Series Models) 6 A detailed explanation of LSTM models can be found in this blog post by Christopher Olah . Figure 5-4. Recurrent Neural Network In Figure 5-4: • Xt is the input at time step t. • Ot is the output at time step t. • St is the hidden state at time step t. It’s the memory of the network. It is calcula‐ ted based on the previous hidden state and the input at the current step. The main feature of an RNN is this hidden state, which captures some information about a sequence and uses it accordingly whenever needed. Long short-term memory Long short-term memory (LSTM) is a special kind of RNN explicitly designed to avoid the long-term dependency problem. Remembering information for long peri‐ ods of time is practically default behavior for an LSTM model.6 These models are composed of a set of cells with features to memorize the sequence of data. These cells capture and store the data streams. Further, the cells interconnect one module of the past to another module of the present to convey information from several past time instants to the present one. Due to the use of gates in each cell, data in each cell can be disposed, filtered, or added for the next cells. The gates, based on artificial neural network layers, enable the cells to optionally let data pass through or be disposed. Each layer yields numbers in the range of zero to one, depicting the amount of every segment of data that ought to be let through in each cell. More precisely, an estimation of zero value implies “let nothing pass through.” An estimation of one indicates “let everything pass through.” Three types of gates are involved in each LSTM, with the goal of controlling the state of each cell: Time Series Models | 93 https://oreil.ly/4PDhr 7 An ARIMA model and a Keras-based LSTM model will be demonstrated in one of the case studies. Forget Gate Outputs a number between zero and one, where one shows “completely keep this” and zero implies “completely ignore this.” This gate conditionally decides whether the past should be forgotten or preserved. Input Gate Chooses which new data needs to be stored in the cell. Output Gate Decides what will yield out of each cell. The yielded value will be based on the cell state along with the filtered and newly added data. Keras wraps the efficient numerical computation libraries and functions and allows us to define and train LSTM neural network models in a few short lines of code. In the following code, LSTM module from keras.layers is used for implementing LSTM network. The network is trained with the variable X_train_LSTM. The network has a hidden layer with 50 LSTM blocks or neurons and an output layer that makes a single value prediction. Also refer to Chapter 3 for a more detailed description of all the terms (i.e., sequential, learning rate, momentum, epoch, and batch size). A sample Python code for implementing an LSTM model in Keras is shown below: from keras.models import Sequential from keras.layers import Dense from keras.optimizers import SGD from keras.layers import LSTM def create_LSTMmodel(learn_rate = 0.01, momentum=0): # create model model = Sequential() model.add(LSTM(50, input_shape=(X_train_LSTM.shape[1],\ X_train_LSTM.shape[2]))) #More number of cells can be added if needed model.add(Dense(1)) optimizer = SGD(lr=learn_rate, momentum=momentum) model.compile(loss='mse', optimizer='adam') return model LSTMModel = create_LSTMmodel(learn_rate = 0.01, momentum=0) LSTMModel_fit = LSTMModel.fit(X_train_LSTM, Y_train_LSTM, validation_data=\ (X_test_LSTM, Y_test_LSTM),epochs=330, batch_size=72, verbose=0, shuffle=False) In terms of both learning and implementation, LSTM provides considerably more options for fine-tuning compared to ARIMA models. Although deep learning models have several advantages over traditional time series models, deep learning models are more complicated and difficult to train.7 94 | Chapter 5: Supervised Learning: Regression (Including Time Series Models) Modifying Time Series Data for Supervised Learning Models A time series is a sequence of numbers that are ordered by a time index. Supervised learning is where we have input variables (X) and an output variable (Y). Given a sequence of numbers for a time series dataset, we can restructure the data into a set of predictor and predicted variables, just like in a supervised learning problem. We can do this by using previous time steps as input variables and using the next time step as the output variable. Let’s make this concrete with an example. We can restructure a time series shown in the left table in Figure 5-5 as a supervised learning problem by using the value at the previous time step to predict the value at the next time step. Once we’ve reorganized the time series dataset this way, the data would look like the table on the right. Figure 5-5. Modifying time series for supervised learning models We can see that the previous time step is the input (X) and the next time step is the output (Y) in our supervised learning problem. The order between the observations is preserved and must continue to be preserved when using this dataset to train a super‐ vised model. We will delete the first and last row while training our supervised model as we don’t have values for either X or Y. In Python, the main function to help transform time series data into a supervised learning problem is the shift() function from the Pandas library. We will demon‐ strate this approach in the case studies. The use of prior time steps to predict the next time step is called the sliding window, time delay, or lag method. Having discussed all the concepts of supervised learning and time series models, let us move to the case studies. Case Study 1: Stock Price Prediction One of the biggest challenges in finance is predicting stock prices. However, with the onset of recent advancements in machine learning applications, the field has been evolving to utilize nondeterministic solutions that learn what is going on in order to make more accurate predictions. Machine learning techniques naturally lend Case Study 1: Stock Price Prediction | 95 themselves to stock price prediction based on historical data. Predictions can be made for a single time point ahead or for a set of future time points. As a high-level overview, other than the historical price of the stock itself, the features that are generally useful for stock price prediction are as follows: Correlated assets An organization depends on and interacts with many external factors, including its competitors, clients, the global economy, the geopolitical situation, fiscal and monetary policies, access to capital, and so on. Hence, its stock price may be cor‐ related not only with the stock price of other companies but also with other assets such as commodities, FX, broad-based indices, or even fixed income securities. Technical indicators A lot of investors follow technical indicators. Moving average, exponential mov‐ ingaverage, and momentum are the most popular indicators. Fundamental analysis Two primary data sources to glean features that can be used in fundamental analysis include: Performance reports Annual and quarterly reports of companies can be used to extract or deter‐ mine key metrics, such as ROE (Return on Equity) and P/E (Price-to- Earnings). News News can indicate upcoming events that can potentially move the stock price in a certain direction. In this case study, we will use various supervised learning–based models to predict the stock price of Microsoft using correlated assets and its own historical data. By the end of this case study, readers will be familiar with a general machine learning approach to stock prediction modeling, from gathering and cleaning data to building and tuning different models. In this case study, we will focus on: • Looking at various machine learning and time series models, ranging in com‐ plexity, that can be used to predict stock returns. • Visualization of the data using different kinds of charts (i.e., density, correlation, scatterplot, etc.) • Using deep learning (LSTM) models for time series forecasting. 96 | Chapter 5: Supervised Learning: Regression (Including Time Series Models) 8 Refer to “Case Study 3: Bitcoin Trading Strategy” on page 179 presented in Chapter 6 and “Case Study 1: NLP and Sentiment Analysis–Based Trading Strategies” on page 362 presented in Chapter 10 to understand the usage of technical indicators and news-based fundamental analysis as features in the price prediction. 9 Equity markets have trading holidays, while currency markets do not. However, the alignment of the dates across all the time series is ensured before any modeling or analysis. • Implementation of the grid search for time series models (i.e., ARIMA model). • Interpretation of the results and examining potential overfitting and underfitting of the data across the models. Blueprint for Using Supervised Learning Models to Predict a Stock Price 1. Problem definition In the supervised regression framework used for this case study, the weekly return of Microsoft stock is the predicted variable. We need to understand what affects Micro‐ soft stock price and incorporate as much information into the model. Out of correla‐ ted assets, technical indicators, and fundamental analysis (discussed in the section before), we will focus on correlated assets as features in this case study.8 For this case study, other than the historical data of Microsoft, the independent vari‐ ables used are the following potentially correlated assets: Stocks IBM (IBM) and Alphabet (GOOGL) Currency9 USD/JPY and GBP/USD Indices S&P 500, Dow Jones, and VIX The dataset used for this case study is extracted from Yahoo Finance and the FRED website. In addition to predicting the stock price accurately, this case study will also demonstrate the infrastructure and framework for each step of time series and super‐ vised regression–based modeling for stock price prediction. We will use the daily closing price of the last 10 years, from 2010 onward. Case Study 1: Stock Price Prediction | 97 https://fred.stlouisfed.org https://fred.stlouisfed.org 2. Getting started—loading the data and Python packages 2.1. Loading the Python packages. The list of the libraries used for data loading, data analysis, data preparation, model evaluation, and model tuning are shown below. The packages used for different purposes have been segregated in the Python code that follows. The details of most of these packages and functions were provided in Chap‐ ters 2 and 4. The use of these packages will be demonstrated in different steps of the model development process. Function and modules for the supervised regression models from sklearn.linear_model import LinearRegression from sklearn.linear_model import Lasso from sklearn.linear_model import ElasticNet from sklearn.tree import DecisionTreeRegressor from sklearn.neighbors import KNeighborsRegressor from sklearn.svm import SVR from sklearn.ensemble import RandomForestRegressor from sklearn.ensemble import GradientBoostingRegressor from sklearn.ensemble import ExtraTreesRegressor from sklearn.ensemble import AdaBoostRegressor from sklearn.neural_network import MLPRegressor Function and modules for data analysis and model evaluation from sklearn.model_selection import train_test_split from sklearn.model_selection import KFold from sklearn.model_selection import cross_val_score from sklearn.model_selection import GridSearchCV from sklearn.metrics import mean_squared_error from sklearn.feature_selection import SelectKBest from sklearn.feature_selection import chi2, f_regression Function and modules for deep learning models from keras.models import Sequential from keras.layers import Dense from keras.optimizers import SGD from keras.layers import LSTM from keras.wrappers.scikit_learn import KerasRegressor Function and modules for time series models from statsmodels.tsa.arima_model import ARIMA import statsmodels.api as sm Function and modules for data preparation and visualization # pandas, pandas_datareader, numpy and matplotlib import numpy as np import pandas as pd import pandas_datareader.data as web from matplotlib import pyplot 98 | Chapter 5: Supervised Learning: Regression (Including Time Series Models) 10 In different case studies across the book we will demonstrate loading the data through different sources (e.g., CSV, and external websites like quandl). from pandas.plotting import scatter_matrix import seaborn as sns from sklearn.preprocessing import StandardScaler from pandas.plotting import scatter_matrix from statsmodels.graphics.tsaplots import plot_acf 2.2. Loading the data. One of the most important steps in machine learning and pre‐ dictive modeling is gathering good data. The following steps demonstrate the loading of data from the Yahoo Finance and FRED websites using the Pandas DataReader function:10 stk_tickers = ['MSFT', 'IBM', 'GOOGL'] ccy_tickers = ['DEXJPUS', 'DEXUSUK'] idx_tickers = ['SP500', 'DJIA', 'VIXCLS'] stk_data = web.DataReader(stk_tickers, 'yahoo') ccy_data = web.DataReader(ccy_tickers, 'fred') idx_data = web.DataReader(idx_tickers, 'fred') Next, we define our dependent (Y) and independent (X) variables. The predicted variable is the weekly return of Microsoft (MSFT). The number of trading days in a week is assumed to be five, and we compute the return using five trading days. For independent variables we use the correlated assets and the historical return of MSFT at different frequencies. The variables used as independent variables are lagged five-day return of stocks (IBM and GOOG), currencies (USD/JPY and GBP/USD), and indices (S&P 500, Dow Jones, and VIX), along with lagged 5-day, 15-day, 30-day and 60-day return of MSFT. The lagged five-day variables embed the time series component by using a time-delay approach, where the lagged variable is included as one of the independent variables. This step is reframing the time series data into a supervised regression–based model framework. return_period = 5 Y = np.log(stk_data.loc[:, ('Adj Close', 'MSFT')]).diff(return_period).\ shift(-return_period) Y.name = Y.name[-1]+'_pred' X1 = np.log(stk_data.loc[:, ('Adj Close', ('GOOGL', 'IBM'))]).diff(return_period) X1.columns = X1.columns.droplevel() X2 = np.log(ccy_data).diff(return_period) X3 = np.log(idx_data).diff(return_period) X4 = pd.concat([np.log(stk_data.loc[:, ('Adj Close', 'MSFT')]).diff(i) \ Case Study 1: Stock Price Prediction | 99 for i in [return_period, return_period*3,\ return_period*6, return_period*12]], axis=1).dropna() X4.columns = ['MSFT_DT', 'MSFT_3DT', 'MSFT_6DT', 'MSFT_12DT'] X = pd.concat([X1, X2, X3, X4], axis=1) dataset = pd.concat([Y, X], axis=1).dropna().iloc[::return_period, :] Y = dataset.loc[:, Y.name] X = dataset.loc[:, X.columns] 3. Exploratory data analysis We will look at descriptive statistics, data visualization, and time series analysis in this section. 3.1. Descriptivestatistics. Let’s have a look at the dataset we have: dataset.head() Output The variable MSFT_pred is the return of Microsoft stock and is the predicted vari‐ able. The dataset contains the lagged series of other correlated stocks, currencies, and indices. Additionally, it also consists of the lagged historical returns of MSFT. 3.2. Data visualization. The fastest way to learn more about the data is to visualize it. The visualization involves independently understanding each attribute of the dataset. We will look at the scatterplot and the correlation matrix. These plots give us a sense of the interdependence of the data. Correlation can be calculated and displayed for each pair of the variables by creating a correlation matrix. Hence, besides the rela‐ tionship between independent and dependent variables, it also shows the correlation among the independent variables. This is useful to know because some machine learning algorithms like linear and logistic regression can have poor performance if there are highly correlated input variables in the data: correlation = dataset.corr() pyplot.figure(figsize=(15,15)) pyplot.title('Correlation Matrix') sns.heatmap(correlation, vmax=1, square=True,annot=True,cmap='cubehelix') 100 | Chapter 5: Supervised Learning: Regression (Including Time Series Models) Output Looking at the correlation plot (full-size version available on GitHub), we see some correlation of the predicted variable with the lagged 5-day, 15-day, 30-day, and 60- day returns of MSFT. Also, we see a higher negative correlation of many asset returns versus VIX, which is intuitive. Next, we can visualize the relationship between all the variables in the regression using the scatterplot matrix shown below: pyplot.figure(figsize=(15,15)) scatter_matrix(dataset,figsize=(12,12)) pyplot.show() Case Study 1: Stock Price Prediction | 101 https://oreil.ly/g3wVU Output Looking at the scatterplot (full-size version available on GitHub), we see some linear relationship of the predicted variable with the lagged 15-day, 30-day, and 60-day returns of MSFT. Otherwise, we do not see any special relationship between our pre‐ dicted variable and the features. 3.3. Time series analysis. Next, we delve into the time series analysis and look at the decomposition of the time series of the predicted variable into trend and seasonality components: res = sm.tsa.seasonal_decompose(Y,freq=52) fig = res.plot() fig.set_figheight(8) fig.set_figwidth(15) pyplot.show() 102 | Chapter 5: Supervised Learning: Regression (Including Time Series Models) https://oreil.ly/g3wVU 11 The time series is not the stock price but stock return, so the trend is mild compared to the stock price series. Output We can see that for MSFT there has been a general upward trend in the return series. This may be due to the large run-up of MSFT in the recent years, causing more posi‐ tive weekly return data points than negative.11 The trend may show up in the con‐ stant/bias terms in our models. The residual (or white noise) term is relatively small over the entire time series. 4. Data preparation This step typically involves data processing, data cleaning, looking at feature impor‐ tance, and performing feature reduction. The data obtained for this case study is rela‐ tively clean and doesn’t require further processing. Feature reduction might be useful here, but given the relatively small number of variables considered, we will keep all of them as is. We will demonstrate data preparation in some of the subsequent case studies in detail. 5. Evaluate models 5.1. Train-test split and evaluation metrics. As described in Chapter 2, it is a good idea to partition the original dataset into a training set and a test set. The test set is a sam‐ ple of the data that we hold back from our analysis and modeling. We use it right at the end of our project to confirm the performance of our final model. It is the final test that gives us confidence on our estimates of accuracy on unseen data. We will use Case Study 1: Stock Price Prediction | 103 80% of the dataset for modeling and use 20% for testing. With time series data, the sequence of values is important. So we do not distribute the dataset into training and test sets in random fashion, but we select an arbitrary split point in the ordered list of observations and create two new datasets: validation_size = 0.2 train_size = int(len(X) * (1-validation_size)) X_train, X_test = X[0:train_size], X[train_size:len(X)] Y_train, Y_test = Y[0:train_size], Y[train_size:len(X)] 5.2. Test options and evaluation metrics. To optimize the various hyperparameters of the models, we use ten-fold cross validation (CV) and recalculate the results ten times to account for the inherent randomness in some of the models and the CV process. We will evaluate algorithms using the mean squared error metric. This metric gives an idea of the performance of the supervised regression models. All these concepts, including cross validation and evaluation metrics, have been described in Chapter 4: num_folds = 10 scoring = 'neg_mean_squared_error' 5.3. Compare models and algorithms. Now that we have completed the data loading and designed the test harness, we need to choose a model. 5.3.1. Machine learning models from Scikit-learn. In this step, the supervised regression models are implemented using the sklearn package: Regression and tree regression algorithms models = [] models.append(('LR', LinearRegression())) models.append(('LASSO', Lasso())) models.append(('EN', ElasticNet())) models.append(('KNN', KNeighborsRegressor())) models.append(('CART', DecisionTreeRegressor())) models.append(('SVR', SVR())) Neural network algorithms models.append(('MLP', MLPRegressor())) Ensemble models # Boosting methods models.append(('ABR', AdaBoostRegressor())) models.append(('GBR', GradientBoostingRegressor())) # Bagging methods models.append(('RFR', RandomForestRegressor())) models.append(('ETR', ExtraTreesRegressor())) 104 | Chapter 5: Supervised Learning: Regression (Including Time Series Models) Once we have selected all the models, we loop over each of them. First, we run the k- fold analysis. Next, we run the model on the entire training and testing dataset. All the algorithms use default tuning parameters. We will calculate the mean and standard deviation of the evaluation metric for each algorithm and collect the results for model comparison later: names = [] kfold_results = [] test_results = [] train_results = [] for name, model in models: names.append(name) ## k-fold analysis: kfold = KFold(n_splits=num_folds, random_state=seed) #converted mean squared error to positive. The lower the better cv_results = -1* cross_val_score(model, X_train, Y_train, cv=kfold, \ scoring=scoring) kfold_results.append(cv_results) # Full Training period res = model.fit(X_train, Y_train) train_result = mean_squared_error(res.predict(X_train), Y_train) train_results.append(train_result) # Test results test_result = mean_squared_error(res.predict(X_test), Y_test) test_results.append(test_result) Let’s compare the algorithms by looking at the cross validation results: Cross validation results fig = pyplot.figure() fig.suptitle('Algorithm Comparison: Kfold results') ax = fig.add_subplot(111) pyplot.boxplot(kfold_results) ax.set_xticklabels(names) fig.set_size_inches(15,8) pyplot.show() Case Study 1: Stock Price Prediction | 105 Output Although the results of a couple of the models look good, we see that the linear regression and the regularized regression including the lasso regression (LASSO) and elastic net (EN) seem to perform best. This indicates a strong linear relationship between the dependent and independent variables. Going back to the exploratory analysis, we saw a good correlation and linear relationship of the target variables with the different lagged MSFT variables. Let us look at the errors of the test set as well: Trainingand test error # compare algorithms fig = pyplot.figure() ind = np.arange(len(names)) # the x locations for the groups width = 0.35 # the width of the bars fig.suptitle('Algorithm Comparison') ax = fig.add_subplot(111) pyplot.bar(ind - width/2, train_results, width=width, label='Train Error') pyplot.bar(ind + width/2, test_results, width=width, label='Test Error') fig.set_size_inches(15,8) pyplot.legend() ax.set_xticks(ind) ax.set_xticklabels(names) pyplot.show() 106 | Chapter 5: Supervised Learning: Regression (Including Time Series Models) Output Examining the training and test error, we still see a stronger performance from the linear models. Some of the algorithms, such as the decision tree regressor (CART), overfit on the training data and produced very high error on the test set. Ensemble models such as gradient boosting regression (GBR) and random forest regression (RFR) have low bias but high variance. We also see that the artificial neural network algorithm (shown as MLP in the chart) shows higher errors in both the training and test sets. This is perhaps due to the linear relationship of the variables not captured accurately by ANN, improper hyperparameters, or insufficient training of the model. Our original intuition from the cross validation results and the scatterplots also seem to demonstrate a better performance of linear models. We now look at some of the time series and deep learning models that can be used. Once we are done creating these, we will compare their performance against that of the supervised regression–based models. Due to the nature of time series models, we are not able to run a k-fold analysis. We can still compare our results to the other models based on the full training and testing results. 5.3.2. Time series–based models: ARIMA and LSTM. The models used so far already embed the time series component by using a time-delay approach, where the lagged variable is included as one of the independent variables. However, for the time ser‐ ies–based models we do not need the lagged variables of MSFT as the independent variables. Hence, as a first step we remove MSFT’s previous returns for these models. We use all other variables as the exogenous variables in these models. Let us first prepare the dataset for ARIMA models by having only the correlated var‐ riables as exogenous variables: Case Study 1: Stock Price Prediction | 107 X_train_ARIMA=X_train.loc[:, ['GOOGL', 'IBM', 'DEXJPUS', 'SP500', 'DJIA', \ 'VIXCLS']] X_test_ARIMA=X_test.loc[:, ['GOOGL', 'IBM', 'DEXJPUS', 'SP500', 'DJIA', \ 'VIXCLS']] tr_len = len(X_train_ARIMA) te_len = len(X_test_ARIMA) to_len = len (X) We now configure the ARIMA model with the order (1,0,0) and use the independent variables as the exogenous variables in the model. The version of the ARIMA model where the exogenous variables are also used is known as the ARIMAX model, where "X" represents exogenous variables: modelARIMA=ARIMA(endog=Y_train,exog=X_train_ARIMA,order=[1,0,0]) model_fit = modelARIMA.fit() Now we fit the ARIMA model: error_Training_ARIMA = mean_squared_error(Y_train, model_fit.fittedvalues) predicted = model_fit.predict(start = tr_len -1 ,end = to_len -1, \ exog = X_test_ARIMA)[1:] error_Test_ARIMA = mean_squared_error(Y_test,predicted) error_Test_ARIMA Output 0.0005931919240399084 Error of this ARIMA model is reasonable. Now let’s prepare the dataset for the LSTM model. We need the data in the form of arrays of all the input variables and the output variables. The logic behind the LSTM is that data is taken from the previous day (the data of all the other features for that day—correlated assets and the lagged variables of MSFT) and we try to predict the next day. Then we move the one-day window with one day and again predict the next day. We iterate like this over the whole dataset (of course in batches). The code below will create a dataset in which X is the set of independent variables at a given time (t) and Y is the target variable at the next time (t + 1): seq_len = 2 #Length of the seq for the LSTM Y_train_LSTM, Y_test_LSTM = np.array(Y_train)[seq_len-1:], np.array(Y_test) X_train_LSTM = np.zeros((X_train.shape[0]+1-seq_len, seq_len, X_train.shape[1])) X_test_LSTM = np.zeros((X_test.shape[0], seq_len, X.shape[1])) for i in range(seq_len): X_train_LSTM[:, i, :] = np.array(X_train)[i:X_train.shape[0]+i+1-seq_len, :] X_test_LSTM[:, i, :] = np.array(X)\ [X_train.shape[0]+i-1:X.shape[0]+i+1-seq_len, :] In the next step, we create the LSTM architecture. As we can see, the input of the LSTM is in X_train_LSTM, which goes into 50 hidden units in the LSTM layer and then is transformed to a single output—the stock return value. The hyperparameters 108 | Chapter 5: Supervised Learning: Regression (Including Time Series Models) (i.e., learning rate, optimizer, activation function, etc.) were discussed in Chapter 3 of the book: # LSTM Network def create_LSTMmodel(learn_rate = 0.01, momentum=0): # create model model = Sequential() model.add(LSTM(50, input_shape=(X_train_LSTM.shape[1],\ X_train_LSTM.shape[2]))) #More cells can be added if needed model.add(Dense(1)) optimizer = SGD(lr=learn_rate, momentum=momentum) model.compile(loss='mse', optimizer='adam') return model LSTMModel = create_LSTMmodel(learn_rate = 0.01, momentum=0) LSTMModel_fit = LSTMModel.fit(X_train_LSTM, Y_train_LSTM, \ validation_data=(X_test_LSTM, Y_test_LSTM),\ epochs=330, batch_size=72, verbose=0, shuffle=False) Now we fit the LSTM model with the data and look at the change in the model per‐ formance metric over time simultaneously in the training set and the test set: pyplot.plot(LSTMModel_fit.history['loss'], label='train', ) pyplot.plot(LSTMModel_fit.history['val_loss'], '--',label='test',) pyplot.legend() pyplot.show() Output error_Training_LSTM = mean_squared_error(Y_train_LSTM,\ LSTMModel.predict(X_train_LSTM)) predicted = LSTMModel.predict(X_test_LSTM) error_Test_LSTM = mean_squared_error(Y_test,predicted) Case Study 1: Stock Price Prediction | 109 Now, in order to compare the time series and the deep learning models, we append the result of these models to the results of the supervised regression–based models: test_results.append(error_Test_ARIMA) test_results.append(error_Test_LSTM) train_results.append(error_Training_ARIMA) train_results.append(error_Training_LSTM) names.append("ARIMA") names.append("LSTM") Output Looking at the chart, we find the time series–based ARIMA model comparable to the linear supervised regression models: linear regression (LR), lasso regression (LASSO), and elastic net (EN). This can primarily be due to the strong linear relationship as discussed before. The LSTM model performs decently; however, the ARIMA model outperforms the LSTM model in the test set. Hence, we select the ARIMA model for model tuning. 6. Model tuning and grid search Let us perform the model tuning of the ARIMA model. 110 | Chapter 5: Supervised Learning: Regression (Including Time Series Models) Model Tuning for the Supervised Learning or Time Series Models The detailed implementation of grid search for all the supervised learning–based models, along with the ARIMA and LSTM models, is provided in the Regression-Master template under the GitHub repository for this book. For the grid search of the ARIMA and LSTM models, refer to the “ARIMA and LSTM Grid Search” sec‐ tion of the Regression-Master template. The ARIMA model is generally represented as ARIMA(p,d,q) model, where p is the order of the autoregressive part, d is the degree of first differencing involved, and q is the order of the moving average part. The order of the ARIMA model was set to (1,0,0). So we perform a grid search with different p, d, and q combinations in the ARIMA model’s order and select the combination that minimizes the fitting error: def evaluate_arima_model(arima_order): #predicted = list() modelARIMA=ARIMA(endog=Y_train,exog=X_train_ARIMA,order=arima_order)model_fit = modelARIMA.fit() error = mean_squared_error(Y_train, model_fit.fittedvalues) return error # evaluate combinations of p, d and q values for an ARIMA model def evaluate_models(p_values, d_values, q_values): best_score, best_cfg = float("inf"), None for p in p_values: for d in d_values: for q in q_values: order = (p,d,q) try: mse = evaluate_arima_model(order) if mse < best_score: best_score, best_cfg = mse, order print('ARIMA%s MSE=%.7f' % (order,mse)) except: continue print('Best ARIMA%s MSE=%.7f' % (best_cfg, best_score)) # evaluate parameters p_values = [0, 1, 2] d_values = range(0, 2) q_values = range(0, 2) warnings.filterwarnings("ignore") evaluate_models(p_values, d_values, q_values) Output ARIMA(0, 0, 0) MSE=0.0009879 ARIMA(0, 0, 1) MSE=0.0009721 ARIMA(1, 0, 0) MSE=0.0009696 ARIMA(1, 0, 1) MSE=0.0009685 Case Study 1: Stock Price Prediction | 111 https://oreil.ly/9S8h_ https://oreil.ly/9S8h_ ARIMA(2, 0, 0) MSE=0.0009684 ARIMA(2, 0, 1) MSE=0.0009683 Best ARIMA(2, 0, 1) MSE=0.0009683 We see that the ARIMA model with the order (2,0,1) is the best performer out of all the combinations tested in the grid search, although there isn’t a significant differ‐ ence in the mean squared error (MSE) with other combinations. This means that the model with the autoregressive lag of two and moving average of one yields the best result. We should not forget the fact that there are other exogenous variables in the model that influence the order of the best ARIMA model as well. 7. Finalize the model In the last step we will check the finalized model on the test set. 7.1. Results on the test dataset. # prepare model modelARIMA_tuned=ARIMA(endog=Y_train,exog=X_train_ARIMA,order=[2,0,1]) model_fit_tuned = modelARIMA_tuned.fit() # estimate accuracy on validation set predicted_tuned = model_fit.predict(start = tr_len -1 ,\ end = to_len -1, exog = X_test_ARIMA)[1:] print(mean_squared_error(Y_test,predicted_tuned)) Output 0.0005970582461404503 The MSE of the model on the test set looks good and is actually less than that of the training set. In the last step, we will visualize the output of the selected model and compare the modeled data against the actual data. In order to visualize the chart, we convert the return time series to a price time series. We also assume the price at the beginning of the test set as one for the sake of simplicity. Let us look at the plot of actual versus predicted data: # plotting the actual data versus predicted data predicted_tuned.index = Y_test.index pyplot.plot(np.exp(Y_test).cumprod(), 'r', label='actual',) # plotting t, a separately pyplot.plot(np.exp(predicted_tuned).cumprod(), 'b--', label='predicted') pyplot.legend() pyplot.rcParams["figure.figsize"] = (8,5) pyplot.show() 112 | Chapter 5: Supervised Learning: Regression (Including Time Series Models) Looking at the chart, we clearly see the trend has been captured perfectly by the model. The predicted series is less volatile compared to the actual time series, and it aligns with the actual data for the first few months of the test set. A point to note is that the purpose of the model is to compute the next day’s return given the data observed up to the present day, and not to predict the stock price several days in the future given the current data. Hence, a deviation from the actual data is expected as we move away from the beginning of the test set. The model seems to perform well for the first few months, with deviation from the actual data increasing six to seven months after the beginning of the test set. Conclusion We can conclude that simple models—linear regression, regularized regression (i.e., Lasso and elastic net)—along with the time series models, such as ARIMA, are prom‐ ising modeling approaches for stock price prediction problems. This approach helps us deal with overfitting and underfitting, which are some of the key challenges in pre‐ dicting problems in finance. We should also note that we can use a wider set of indicators, such as P/E ratio, trad‐ ing volume, technical indicators, or news data, which might lead to better results. We will demonstrate this in some of the future case studies in the book. Overall, we created a supervised-regression and time series modeling framework that allows us to perform stock price prediction using historical data. This framework generates results to analyze risk and profitability before risking any capital. Case Study 1: Stock Price Prediction | 113 Case Study 2: Derivative Pricing In computational finance and risk management, several numerical methods (e.g., finite differences, fourier methods, and Monte Carlo simulation) are commonly used for the valuation of financial derivatives. The Black-Scholes formula is probably one of the most widely cited and used models in derivative pricing. Numerous variations and extensions of this formula are used to price many kinds of financial derivatives. However, the model is based on several assumptions. It assumes a specific form of movement for the derivative price, namely a Geometric Brownian Motion (GBM). It also assumes a conditional payment at maturity of the option and economic constraints, such as no-arbitrage. Several other derivative pricing models have similarly impractical model assumptions. Finance practitioners are well aware that these assumptions are violated in practice, and prices from these models are further adjusted using practitioner judgment. Another aspect of the many traditional derivative pricing models is model calibra‐ tion, which is typically done not by historical asset prices but by means of derivative prices (i.e., by matching the market prices of heavily traded options to the derivative prices from the mathematical model). In the process of model calibration, thousands of derivative prices need to be determined in order to fit the parameters of the model, and the overall process is time consuming. Efficient numerical computation is increasingly important in financial risk management, especially when we deal with real-time risk management (e.g., high frequency trading). However, due to the requirement of a highly efficient computation, certain high-quality asset models and methodologies are discarded during model calibration of traditional derivative pric‐ ing models. Machine learning can potentially be used to tackle these drawbacks related to imprac‐ tical model assumptions and inefficient model calibration. Machine learning algo‐ rithms have the ability to tackle more nuances with very few theoretical assumptions and can be effectively used for derivative pricing, even in a world with frictions. With the advancements in hardware, we can train machine learning models on high per‐ formance CPUs, GPUs, and other specialized hardware to achieve a speed increase of several orders of magnitude as compared to the traditional derivative pricing models. Additionally, market data is plentiful, so it is possible to train a machine learning algorithm to learn the function that is collectively generating derivative prices in the market. Machine learning models can capture subtle nonlinearities in the data that are not obtainable through other statistical approaches. In this case study, we look at derivative pricing from a machine learning standpoint and use a supervised regression–based model to price an option from simulated data. The main idea here is to come up with a machine learning framework for derivative pricing. Achieving a machine learning model with high accuracy would mean that we 114 | Chapter 5: Supervised Learning: Regression (Including Time Series Models) 12 The predicted variable, which is the option price, should ideally be directly obtained for the market. Given this case study is more for demonstration purposes, we use model-generated option price for thesake of con‐ venience. can leverage the efficient numerical calculation of machine learning for derivative pricing with fewer underlying model assumptions. In this case study, we will focus on: • Developing a machine learning–based framework for derivative pricing. • Comparison of linear and nonlinear supervised regression models in the context of derivative pricing. Blueprint for Developing a Machine Learning Model for Derivative Pricing 1. Problem definition In the supervised regression framework we used for this case study, the predicted variable is the price of the option, and the predictor variables are the market data used as inputs to the Black-Scholes option pricing model. The variables selected to estimate the market price of the option are stock price, strike price, time to expiration, volatility, interest rate, and dividend yield. The predicted variable for this case study was generated using random inputs and feeding them into the well-known Black-Scholes model.12 The price of a call option per the Black-Scholes option pricing model is defined in Equation 5-1. Equation 5-1. Black-Scholes equation for call option Se –qτΦ(d1) – e –rτ KΦ(d2) with d1 = ln (S / K ) + (r – q + σ 2 / 2)τ σ τ and Case Study 2: Derivative Pricing | 115 d2 = ln (S / K ) + (r – q – σ 2 / 2)τ σ τ = d1 – σ τ where we have stock price S ; strike price K ; risk-free rate r ; annual dividend yield q; time to maturity τ = T – t (represented as a unitless fraction of one year); and vola‐ tility σ. To make the logic simpler, we define moneyness as M = K / S and look at the prices in terms of per unit of current stock price. We also set q as 0. This simplifies the formula to the following: e –qτΦ( – ln (M ) + (r + σ 2 / 2)τ σ τ ) – e –rτMΦ( – ln (M ) + (r – σ 2 / 2)τ σ τ ) Looking at the equation above, the parameters that feed into the Black-Scholes option pricing model are moneyness, risk-free rate, volatility, and time to maturity. The parameter that plays the central role in derivative market is volatility, as it is directly related to the movement of the stock prices. With the increase in the volatil‐ ity, the range of share price movements becomes much wider than that of a low vola‐ tility stock. In the options market, there isn’t a single volatility used to price all the options. This volatility depends on the option moneyness and time to maturity. In general, the vol‐ atility increases with higher time to maturity and with moneyness. This behavior is referred to as volatility smile/skew. We often derive the volatility from the price of the options existing in the market, and this volatility is referred to as “implied” volatility. In this exercise, we assume the structure of the volatility surface and use function in Equation 5-2, where volatility depends on the option moneyness and time to matur‐ ity to generate the option volatility surface. Equation 5-2. Equation for vloatility σ(M , τ) = σ0 + ατ + β(M – 1)2 2. Getting started—loading the data and Python packages 2.1. Loading the Python packages. The loading of Python packages is similar to case study 1 in this chapter. Please refer to the Jupyter notebook of this case study for more details. 2.2. Defining functions and parameters. To generate the dataset, we need to simulate the input parameters and then create the predicted variable. 116 | Chapter 5: Supervised Learning: Regression (Including Time Series Models) 13 When the spot price is equal to the strike price, at-the-money option. As a first step we define the constant parameters. The constant parameters required for the volatility surface are defined below. These parameters are not expected to have a significant impact on the option price; therefore, these parameters are set to some meaningful values: true_alpha = 0.1 true_beta = 0.1 true_sigma0 = 0.2 The risk-free rate, which is an input to the Black-Scholes option pricing model, is defined as follows: risk_free_rate = 0.05 Volatility and option pricing functions. In this step we define the function to compute the volatility and price of a call option as per Equations 5-1 and 5-2: def option_vol_from_surface(moneyness, time_to_maturity): return true_sigma0 + true_alpha * time_to_maturity +\ true_beta * np.square(moneyness - 1) def call_option_price(moneyness, time_to_maturity, option_vol): d1=(np.log(1/moneyness)+(risk_free_rate+np.square(option_vol))*\ time_to_maturity)/ (option_vol*np.sqrt(time_to_maturity)) d2=(np.log(1/moneyness)+(risk_free_rate-np.square(option_vol))*\ time_to_maturity)/(option_vol*np.sqrt(time_to_maturity)) N_d1 = norm.cdf(d1) N_d2 = norm.cdf(d2) return N_d1 - moneyness * np.exp(-risk_free_rate*time_to_maturity) * N_d2 2.3. Data generation. We generate the input and output variables in the following steps: • Time to maturity (Ts) is generated using the np.random.random function, which generates a uniform random variable between zero and one. • Moneyness (Ks) is generated using the np.random.randn function, which gener‐ ates a normally distributed random variable. The random number multiplied by 0.25 generates the deviation of strike from spot price,13 and the overall equation ensures that the moneyness is greater than zero. • Volatility (sigma) is generated as a function of time to maturity and moneyness using Equation 5-2. • The option price is generated using Equation 5-1 for the Black-Scholes option price. Case Study 2: Derivative Pricing | 117 14 Refer to the Jupyter notebook of this case study to go through other charts such as histogram plot and corre‐ lation plot. In total we generate 10,000 data points (N): N = 10000 Ks = 1+0.25*np.random.randn(N) Ts = np.random.random(N) Sigmas = np.array([option_vol_from_surface(k,t) for k,t in zip(Ks,Ts)]) Ps = np.array([call_option_price(k,t,sig) for k,t,sig in zip(Ks,Ts,Sigmas)]) Now we create the variables for predicted and predictor variables: Y = Ps X = np.concatenate([Ks.reshape(-1,1), Ts.reshape(-1,1), Sigmas.reshape(-1,1)], \ axis=1) dataset = pd.DataFrame(np.concatenate([Y.reshape(-1,1), X], axis=1), columns=['Price', 'Moneyness', 'Time', 'Vol']) 3. Exploratory data analysis Let’s have a look at the dataset we have. 3.1. Descriptive statistics. dataset.head() Output Price Moneyness Time Vol 0 1.390e-01 0.898 0.221 0.223 1 3.814e-06 1.223 0.052 0.210 2 1.409e-01 0.969 0.391 0.239 3 1.984e-01 0.950 0.628 0.263 4 2.495e-01 0.914 0.810 0.282 The dataset contains price—which is the price of the option and is the predicted vari‐ able—along with moneyness (the ratio of strike and spot price), time to maturity, and volatility, which are the features in the model. 3.2. Data visualization. In this step we look at scatterplot to understand the interaction between different variables:14 pyplot.figure(figsize=(15,15)) scatter_matrix(dataset,figsize=(12,12)) pyplot.show() 118 | Chapter 5: Supervised Learning: Regression (Including Time Series Models) Output The scatterplot reveals very interesting dependencies and relationships between the variables. Let us look at the first row of the chart to see the relationship of price to different variables. We observe that as moneyness decreases (i.e., strike price decrea‐ ses as compared to the stock price), there is an increase in the price, which is in line with the rationale described in the previous section. Looking at the price versus time to maturity, we see an increase in the option price. The price versus volatility chart also shows an increase in the price with the volatility. However, option price seems to exhibit a nonlinear relationship with most of the variables. This means that we expect our nonlinear models to do a better job than our linear models. Another interesting relationship is between volatility and strike. The more we deviate from the moneyness of one, the higher the volatility we observe. This behavior is shown due to the volatility function we defined before and illustratesthe volatility smile/skew. Case Study 2: Derivative Pricing | 119 4. Data preparation and analysis We performed most of the data preparation steps (i.e., getting the dependent and independent variables) in the preceding sections. In this step we look at the feature importance. 4.1. Univariate feature selection. We start by looking at each feature individually and, using the single variable regression fit as the criteria, look at the most important vari‐ ables: bestfeatures = SelectKBest(k='all', score_func=f_regression) fit = bestfeatures.fit(X,Y) dfscores = pd.DataFrame(fit.scores_) dfcolumns = pd.DataFrame(['Moneyness', 'Time', 'Vol']) #concat two dataframes for better visualization featureScores = pd.concat([dfcolumns,dfscores],axis=1) featureScores.columns = ['Specs','Score'] #naming the dataframe columns featureScores.nlargest(10,'Score').set_index('Specs') Output Moneyness : 30282.309 Vol : 2407.757 Time : 1597.452 We observe that the moneyness is the most important variable for the option price, followed by volatility and time to maturity. Given there are only three predictor vari‐ ables, we retain all the variables for modeling. 5. Evaluate models 5.1. Train-test split and evaluation metrics. First, we separate the training set and test set: validation_size = 0.2 train_size = int(len(X) * (1-validation_size)) X_train, X_test = X[0:train_size], X[train_size:len(X)] Y_train, Y_test = Y[0:train_size], Y[train_size:len(X)] We use the prebuilt sklearn models to run a k-fold analysis on our training data. We then train the model on the full training data and use it for prediction of the test data. We will evaluate algorithms using the mean squared error metric. The parameters for the k-fold analysis and evaluation metrics are defined as follows: num_folds = 10 seed = 7 scoring = 'neg_mean_squared_error' 120 | Chapter 5: Supervised Learning: Regression (Including Time Series Models) 5.2. Compare models and algorithms. Now that we have completed the data loading and have designed the test harness, we need to choose a model out of the suite of the supervised regression models. Linear models and regression trees models = [] models.append(('LR', LinearRegression())) models.append(('KNN', KNeighborsRegressor())) models.append(('CART', DecisionTreeRegressor())) models.append(('SVR', SVR())) Artificial neural network models.append(('MLP', MLPRegressor())) Boosting and bagging methods # Boosting methods models.append(('ABR', AdaBoostRegressor())) models.append(('GBR', GradientBoostingRegressor())) # Bagging methods models.append(('RFR', RandomForestRegressor())) models.append(('ETR', ExtraTreesRegressor())) Once we have selected all the models, we loop over each of them. First, we run the k- fold analysis. Next, we run the model on the entire training and testing dataset. The algorithms use default tuning parameters. We will calculate the mean and stan‐ dard deviation of error metric and save the results for use later. Output Case Study 2: Derivative Pricing | 121 The Python code for the k-fold analysis step is similar to that used in case study 1. Readers can also refer to the Jupyter notebook of this case study in the code reposi‐ tory for more details. Let us look at the performance of the models in the training set. We see clearly that the nonlinear models, including classification and regression tree (CART), ensemble models, and artificial neural network (represented by MLP in the chart above), perform a lot better that the linear algorithms. This is intuitive given the nonlinear relationships we observed in the scatterplot. Artificial neural networks (ANN) have the natural ability to model any function with fast experimentation and deployment times (definition, training, testing, inference). ANN can effectively be used in complex derivative pricing situations. Hence, out of all the models with good performance, we choose ANN for further analysis. 6. Model tuning and finalizing the model Determining the proper number of nodes for the middle layer of an ANN is more of an art than a science, as discussed in Chapter 3. Too many nodes in the middle layer, and thus too many connections, produce a neural network that memorizes the input data and lacks the ability to generalize. Therefore, increasing the number of nodes in the middle layer will improve performance on the training set, while decreasing the number of nodes in the middle layer will improve performance on a new dataset. As discussed in Chapter 3, the ANN model has several other hyperparameters such as learning rate, momentum, activation function, number of epochs, and batch size. All these hyperparameters can be tuned during the grid search process. However, in this step, we stick to performing grid search on the number of hidden layers for the pur‐ pose of simplicity. The approach to perform grid search on other hyperparameters is the same as described in the following code snippet: ''' hidden_layer_sizes : tuple, length = n_layers - 2, default (100,) The ith element represents the number of neurons in the ith hidden layer. ''' param_grid={'hidden_layer_sizes': [(20,), (50,), (20,20), (20, 30, 20)]} model = MLPRegressor() kfold = KFold(n_splits=num_folds, random_state=seed) grid = GridSearchCV(estimator=model, param_grid=param_grid, scoring=scoring, \ cv=kfold) grid_result = grid.fit(X_train, Y_train) print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_)) means = grid_result.cv_results_['mean_test_score'] stds = grid_result.cv_results_['std_test_score'] params = grid_result.cv_results_['params'] for mean, stdev, param in zip(means, stds, params): print("%f (%f) with: %r" % (mean, stdev, param)) 122 | Chapter 5: Supervised Learning: Regression (Including Time Series Models) Output Best: -0.000024 using {'hidden_layer_sizes': (20, 30, 20)} -0.000580 (0.000601) with: {'hidden_layer_sizes': (20,)} -0.000078 (0.000041) with: {'hidden_layer_sizes': (50,)} -0.000090 (0.000140) with: {'hidden_layer_sizes': (20, 20)} -0.000024 (0.000011) with: {'hidden_layer_sizes': (20, 30, 20)} The best model has three layers, with 20, 30, and 20 nodes in each hidden layer, respectively. Hence, we prepare a model with this configuration and check its perfor‐ mance on the test set. This is a crucial step, because a greater number of layers may lead to overfitting and have poor performance in the test set. # prepare model model_tuned = MLPRegressor(hidden_layer_sizes=(20, 30, 20)) model_tuned.fit(X_train, Y_train) # estimate accuracy on validation set # transform the validation dataset predictions = model_tuned.predict(X_test) print(mean_squared_error(Y_test, predictions)) Output 3.08127276609567e-05 We see that the root mean squared error (RMSE) is 3.08e–5, which is less than one cent. Hence, the ANN model does an excellent job of fitting the Black-Scholes option pricing model. A greater number of layers and tuning of other hyperparameters may enable the ANN model to capture the complex relationship and nonlinearity in the data even better. Overall, the results suggest that ANN may be used to train an option pricing model that matches market prices. 7. Additional analysis: removing the volatility data As an additional analysis, we make the process harder by trying to predict the price without the volatility data. If the model performance is good, we will eliminate the need to have a volatility function as described before. In this step, we further compare the performance of the linear and nonlinear models. In the following code snippet, we remove the volatility variable from the dataset of the predictor variable and define the training set and test set again: X = X[:, :2] validation_size = 0.2 train_size = int(len(X) * (1-validation_size)) X_train, X_test = X[0:train_size], X[train_size:len(X)] Y_train, Y_test = Y[0:train_size], Y[train_size:len(X)] Next, we run the suite of the models (except the regularized regression model) with the new dataset, with the same parametersand similar Python code as before. The performance of all the models after removing the volatility data is as follows: Case Study 2: Derivative Pricing | 123 Looking at the result, we have a similar conclusion as before and see a poor perfor‐ mance of the linear regression and good performance of the ensemble and ANN models. The linear regression now does even a worse job than before. However, the performance of ANN and other ensemble models does not deviate much from their previous performance. This implies the information of the volatility is likely captured in other variables, such as moneyness and time to maturity. Overall, it is good news as it means that fewer variables might be needed to achieve the same performance. Conclusion We know that derivative pricing is a nonlinear problem. As expected, our linear regression model did not do as well as our nonlinear models, and the non-linear models have a very good overall performance. We also observed that removing the volatility increases the difficulty of the prediction problem for the linear regression. However, the nonlinear models such as ensemble models and ANN are still able to do well at the prediction process. This does indicate that one might be able to side‐ step the development of an option volatility surface and achieve a good prediction with a smaller number of variables. We saw that an artificial neural network (ANN) can reproduce the Black-Scholes option pricing formula for a call option to a high degree of accuracy, meaning we can leverage efficient numerical calculation of machine learning in derivative pricing without relying on the impractical assumptions made in the traditional derivative pricing models. The ANN and the related machine learning architecture can easily be extended to pricing derivatives in the real world, with no knowledge of the theory of derivative pricing. The use of machine learning techniques can lead to much faster derivative pricing compared to traditional derivative pricing models. The price we 124 | Chapter 5: Supervised Learning: Regression (Including Time Series Models) might have to pay for this extra speed is some loss of accuracy. However, this reduced accuracy is often well within reasonable limits and acceptable from a practical point of view. New technology has commoditized the use of ANN, so it might be worth‐ while for banks, hedge funds, and financial institutions to explore these models for derivative pricing. Case Study 3: Investor Risk Tolerance and Robo-Advisors The risk tolerance of an investor is one of the most important inputs to the portfolio allocation and rebalancing steps of the portfolio management process. There is a wide variety of risk profiling tools that take varied approaches to understanding the risk tolerance of an investor. Most of these approaches include qualitative judgment and involve significant manual effort. In most of the cases, the risk tolerance of an investor is decided based on a risk tolerance questionnaire. Several studies have shown that these risk tolerance questionnaires are prone to error, as investors suffer from behavioral biases and are poor judges of their own risk perception, especially during stressed markets. Also, given that these questionnaires must be manually completed by investors, they eliminate the possibility of automat‐ ing the entire investment management process. So can machine learning provide a better understanding of an investor’s risk profile than a risk tolerance questionnaire can? Can machine learning contribute to auto‐ mating the entire portfolio management process by cutting the client out of the loop? Could an algorithm be written to develop a personality profile for the client that would be a better representation of how they would deal with different market scenarios? The goal of this case study is to answer these questions. We first build a supervised regression–based model to predict the risk tolerance of an investor. We then build a robo-advisor dashboard in Python and implement the risk tolerance prediction model in the dashboard. The overall purpose is to demonstrate the automation of the manual steps in the portfolio management process with the help of machine learning. This can prove to be immensely useful, specifically for robo-advisors. A dashboard is one of the key features of a robo-advisor as it provides access to important information and allows users to interact with their accounts free of any human dependency, making the portfolio management process highly efficient. Figure 5-6 provides a quick glance at the robo-advisor dashboard built for this case study. The dashboard performs end-to-end asset allocation for an investor, embed‐ ding the machine learning–based risk tolerance model constructed in this case study. Case Study 3: Investor Risk Tolerance and Robo-Advisors | 125 Figure 5-6. Robo-advisor dashboard This dashboard has been built in Python and is described in detail in an additional step in this case study. Although it has been built in the context of robo-advisors, it can be extended to other areas in finance and can embed the machine learning mod‐ els discussed in other case studies, providing finance decision makers with a graphi‐ cal interface for analyzing and interpreting model results. In this case study, we will focus on: • Feature elimination and feature importance/intuition. • Using machine learning to automate manual processes involved in portfolio management process. • Using machine learning to quantify and model the behavioral bias of investors/ individuals. • Embedding machine learning models into user interfaces or dashboards using Python. 126 | Chapter 5: Supervised Learning: Regression (Including Time Series Models) 15 Given that the primary purpose of the model is to be used in the portfolio management context, the individ‐ ual is also referred to as investor in the case study. Blueprint for Modeling Investor Risk Tolerance and Enabling a Machine Learning–Based Robo-Advisor 1. Problem definition In the supervised regression framework used for this case study, the predicted vari‐ able is the “true” risk tolerance of an individual,15 and the predictor variables are demographic, financial, and behavioral attributes of an individual. The data used for this case study is from the Survey of Consumer Finances (SCF), which is conducted by the Federal Reserve Board. The survey includes responses about household demographics, net worth, financial, and nonfinancial assets for the same set of individuals in 2007 (precrisis) and 2009 (postcrisis). This enables us to see how each household’s allocation changed after the 2008 global financial crisis. Refer to the data dictionary for more information on this survey. 2. Getting started—loading the data and Python packages 2.1. Loading the Python packages. The details on loading the standard Python packages were presented in the previous case studies. Refer to the Jupyter notebook for this case study for more details. 2.2. Loading the data. In this step we load the data from the Survey of Consumer Finances and look at the data shape: # load dataset dataset = pd.read_excel('SCFP2009panel.xlsx') Let us look at the size of the data: dataset.shape Output (19285, 515) As we can see, the dataset has a total of 19,285 observations with 515 columns. The number of columns represents the number of features. Case Study 3: Investor Risk Tolerance and Robo-Advisors | 127 https://oreil.ly/2vxJ6 https://oreil.ly/_L8vS 16 There potentially can be several ways of computing the risk tolerance. In this case study, we use the intuitive ways to measure the risk tolerance of an individual. 3. Data preparation and feature selection In this step we prepare the predicted and predictor variables to be used for modeling. 3.1. Preparing the predicted variable. In the first step, we prepare the predicted variable, which is the true risk tolerance. The steps to compute the true risk tolerance are as follows: 1. Computethe risky assets and the risk-free assets for all the individuals in the sur‐ vey data. Risky and risk-free assets are defined as follows: Risky assets Investments in mutual funds, stocks, and bonds. Risk-free assets Checking and savings balances, certificates of deposit, and other cash balan‐ ces and equivalents. 2. Take the ratio of risky assets to total assets (where total assets is the sum of risky and risk-free assets) of an individual and consider that as a measure of the indi‐ vidual’s risk tolerance.16 From the SCF, we have the data of risky and risk-free assets for the individuals in 2007 and 2009. We use this data and normalize the risky assets with the price of a stock index (S&P500) in 2007 versus 2009 to get risk tolerance. 3. Identify the “intelligent” investors. Some literature describes an intelligent investor as one who does not change their risk tolerance during changes in the market. So we consider the investors who changed their risk tolerance by less than 10% between 2007 and 2009 as the intelligent investors. Of course, this is a qualitative judgment, and there can be several other ways of defining an intelli‐ gent investor. However, as mentioned before, beyond coming up with a precise definition of true risk tolerance, the purpose of this case study is to demonstrate the usage of machine learning and provide a machine learning–based framework in portfolio management that can be further leveraged for more detailed analysis. Let us compute the predicted variable. First, we get the risky and risk-free assets and compute the risk tolerance for 2007 and 2009 in the following code snippet: # Compute the risky assets and risk-free assets for 2007 dataset['RiskFree07']= dataset['LIQ07'] + dataset['CDS07'] + dataset['SAVBND07']\ + dataset['CASHLI07'] dataset['Risky07'] = dataset['NMMF07'] + dataset['STOCKS07'] + dataset['BOND07'] 128 | Chapter 5: Supervised Learning: Regression (Including Time Series Models) # Compute the risky assets and risk-free assets for 2009 dataset['RiskFree09']= dataset['LIQ09'] + dataset['CDS09'] + dataset['SAVBND09']\ + dataset['CASHLI09'] dataset['Risky09'] = dataset['NMMF09'] + dataset['STOCKS09'] + dataset['BOND09'] # Compute the risk tolerance for 2007 dataset['RT07'] = dataset['Risky07']/(dataset['Risky07']+dataset['RiskFree07']) #Average stock index for normalizing the risky assets in 2009 Average_SP500_2007=1478 Average_SP500_2009=948 # Compute the risk tolerance for 2009 dataset['RT09'] = dataset['Risky09']/(dataset['Risky09']+dataset['RiskFree09'])*\ (Average_SP500_2009/Average_SP500_2007) Let us look at the details of the data: dataset.head() Output The data above displays some of the columns out of the 521 columns of the dataset. Let us compute the percentage change in risk tolerance between 2007 and 2009: dataset['PercentageChange'] = np.abs(dataset['RT09']/dataset['RT07']-1) Next, we drop the rows containing “NA” or “NaN”: # Drop the rows containing NA dataset=dataset.dropna(axis=0) dataset=dataset[~dataset.isin([np.nan, np.inf, -np.inf]).any(1)] Let us investigate the risk tolerance behavior of individuals in 2007 versus 2009. First we look at the risk tolerance in 2007: sns.distplot(dataset['RT07'], hist=True, kde=False, bins=int(180/5), color = 'blue', hist_kws={'edgecolor':'black'}) Case Study 3: Investor Risk Tolerance and Robo-Advisors | 129 Output Looking at the risk tolerance in 2007, we see that a significant number of individuals had a risk tolerance close to one, meaning investments were skewed more toward the risky assets. Now let us look at the risk tolerance in 2009: sns.distplot(dataset['RT09'], hist=True, kde=False, bins=int(180/5), color = 'blue', hist_kws={'edgecolor':'black'}) Output Clearly, the behavior of the individuals reversed after the crisis. Overall risk tolerance decreased, which is shown by the outsized proportion of households having risk tol‐ erance close to zero in 2009. Most of the investments of these individuals were in risk-free assets. 130 | Chapter 5: Supervised Learning: Regression (Including Time Series Models) In the next step, we pick the intelligent investors whose change in risk tolerance between 2007 and 2009 was less than 10%, as described in “3.1. Preparing the predic‐ ted variable” on page 128: dataset3 = dataset[dataset['PercentageChange']<=.1] We assign the true risk tolerance as the average risk tolerance of these intelligent investors between 2007 and 2009: dataset3['TrueRiskTolerance'] = (dataset3['RT07'] + dataset3['RT09'])/2 This is the predicted variable for this case study. Let us drop other labels that might not be needed for the prediction: dataset3.drop(labels=['RT07', 'RT09'], axis=1, inplace=True) dataset3.drop(labels=['PercentageChange'], axis=1, inplace=True) 3.2. Feature selection—limit the feature space. In this section, we will explore ways to condense the feature space. 3.2.1. Feature elimination. To filter the features further, we check the description in the data dictionary and keep only the features that are relevant. Looking at the entire data, we have more than 500 features in the dataset. However, academic literature and industry practice indicate risk tolerance is heavily influenced by investor demographic, financial, and behavioral attributes, such as age, current income, net worth, and willingness to take risk. All these attributes were available in the dataset and are summarized in the following section. These attributes are used as features to predict investors’ risk tolerance. In the dataset, each of the columns contains a numeric value corresponding to the value of the attribute. The details are as follows: AGE There are six age categories, where 1 represents age less than 35 and 6 represents age more than 75. Case Study 3: Investor Risk Tolerance and Robo-Advisors | 131 https://oreil.ly/_L8vS EDUC There are four education categories, where 1 represents no high school and 4 represents college degree. MARRIED There are two categories to represent marital status, where 1 represents married and 2 represents unmarried. OCCU This represents occupation category. A value of 1 represents managerial status and 4 represents unemployed. KIDS Number of children. WSAVED This represents the individual’s spending versus income, split into three cate‐ gories. For example, 1 represents spending exceeded income. NWCAT This represents net worth category. There are five categories, where 1 represents net worth less than the 25th percentile and 5 represents net worth more than the 90th percentile. INCCL This represents income category. There are five categories, where 1 represents income less than $10,000 and 5 represents income more than $100,000. RISK This represents the willingness to take risk on a scale of 1 to 4, where 1 represents the highest level of willingness to take risk. We keep only the intuitive features as of 2007 and remove all the intermediate fea‐ tures and features related to 2009, as the variables of 2007 are the only ones required for predicting the risk tolerance: keep_list2 = ['AGE07','EDCL07','MARRIED07','KIDS07','OCCAT107','INCOME07',\ 'RISK07','NETWORTH07','TrueRiskTolerance'] drop_list2 = [col for col in dataset3.columns if col not in keep_list2] dataset3.drop(labels=drop_list2, axis=1, inplace=True) 132 | Chapter 5: Supervised Learning: Regression (Including Time Series Models) Now let us look at the correlation among the features: # correlation correlation = dataset3.corr() plt.figure(figsize=(15,15)) plt.title('Correlation Matrix') sns.heatmap(correlation, vmax=1, square=True,annot=True,cmap='cubehelix') Output Looking at the correlation chart (full-size version available on GitHub), net worth and income are positively correlated with risk tolerance. With a greater number of kids and marriage, risk tolerance decreases. As the willingness to take risks decreases, the risk tolerance decreases. With age there isa positive relationship of the risk Case Study 3: Investor Risk Tolerance and Robo-Advisors | 133 https://oreil.ly/iQpk4 17 We could have chosen RMSE as the evaluation metric; however, R2 was chosen as the evaluation metric given that we already used RMSE as the evaluation metric in the previous case studies. tolerance. As per Hui Wang and Sherman Hanna’s paper “Does Risk Tolerance Decrease with Age?,” risk tolerance increases as people age (i.e., the proportion of net wealth invested in risky assets increases as people age) when other variables are held constant. So in summary, the relationship of these variables with risk tolerance seems intuitive. 4. Evaluate models 4.1. Train-test split. Let us split the data into training and test set: Y= dataset3["TrueRiskTolerance"] X = dataset3.loc[:, dataset3.columns != 'TrueRiskTolerance'] validation_size = 0.2 seed = 3 X_train, X_validation, Y_train, Y_validation = \ train_test_split(X, Y, test_size=validation_size, random_state=seed) 4.2. Test options and evaluation metrics. We use R2 as the evaluation metric and select 10 as the number of folds for cross validation.17 num_folds = 10 scoring = 'r2' 4.3. Compare models and algorithms. Next, we select the suite of the regression model and perform the k-folds cross validation. Regression Models # spot-check the algorithms models = [] models.append(('LR', LinearRegression())) models.append(('LASSO', Lasso())) models.append(('EN', ElasticNet())) models.append(('KNN', KNeighborsRegressor())) models.append(('CART', DecisionTreeRegressor())) models.append(('SVR', SVR())) #Ensemble Models # Boosting methods models.append(('ABR', AdaBoostRegressor())) models.append(('GBR', GradientBoostingRegressor())) # Bagging methods models.append(('RFR', RandomForestRegressor())) models.append(('ETR', ExtraTreesRegressor())) 134 | Chapter 5: Supervised Learning: Regression (Including Time Series Models) The Python code for the k-fold analysis step is similar to that of previous case studies. Readers can also refer to the Jupyter notebook of this case study in the code reposi‐ tory for more details. Let us look at the performance of the models in the training set. The nonlinear models perform better than the linear models, which means that there is a nonlinear relationship between the risk tolerance and the variables used to pre‐ dict it. Given random forest regression is one of the best methods, we use it for fur‐ ther grid search. 5. Model tuning and grid search As discussed in Chapter 4, random forest has many hyperparameters that can be tweaked while performing the grid search. However, we will confine our grid search to number of estimators (n_estimators) as it is one of the most important hyper‐ parameters. It represents the number of trees in the random forest model. Ideally, this should be increased until no further improvement is seen in the model: # 8. Grid search : RandomForestRegressor ''' n_estimators : integer, optional (default=10) The number of trees in the forest. ''' param_grid = {'n_estimators': [50,100,150,200,250,300,350,400]} model = RandomForestRegressor() kfold = KFold(n_splits=num_folds, random_state=seed) grid = GridSearchCV(estimator=model, param_grid=param_grid, scoring=scoring, \ cv=kfold) grid_result = grid.fit(X_train, Y_train) print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_)) means = grid_result.cv_results_['mean_test_score'] Case Study 3: Investor Risk Tolerance and Robo-Advisors | 135 stds = grid_result.cv_results_['std_test_score'] params = grid_result.cv_results_['params'] Output Best: 0.738632 using {'n_estimators': 250} Random forest with number of estimators as 250 is the best model after grid search. 6. Finalize the model Let us look at the results on the test dataset and check the feature importance. 6.1. Results on the test dataset. We prepare the random forest model with the number of estimators as 250: model = RandomForestRegressor(n_estimators = 250) model.fit(X_train, Y_train) Let us look at the performance in the training set: from sklearn.metrics import r2_score predictions_train = model.predict(X_train) print(r2_score(Y_train, predictions_train)) Output 0.9640632406817223 The R2 of the training set is 96%, which is a good result. Now let us look at the perfor‐ mance in the test set: predictions = model.predict(X_validation) print(mean_squared_error(Y_validation, predictions)) print(r2_score(Y_validation, predictions)) Output 0.007781840953471237 0.7614494526639909 From the mean squared error and R2 of 76% shown above for the test set, the random forest model does an excellent job of fitting the risk tolerance. 6.2. Feature importance and features intuition Let us look into the feature importance of the variables within the random forest model: import pandas as pd import numpy as np model = RandomForestRegressor(n_estimators= 200,n_jobs=-1) model.fit(X_train,Y_train) #use inbuilt class feature_importances of tree based classifiers #plot graph of feature importances for better visualization 136 | Chapter 5: Supervised Learning: Regression (Including Time Series Models) feat_importances = pd.Series(model.feature_importances_, index=X.columns) feat_importances.nlargest(10).plot(kind='barh') plt.show() Output In the chart, the x-axis represents the magnitude of the importance of a feature. Hence, income and net worth, followed by age and willingness to take risk, are the key variables in determining risk tolerance. 6.3. Save model for later use. In this step we save the model for later use. The saved model can be used directly for prediction given the set of input variables. The model is saved as finalized_model.sav using the dump module of the pickle package. This saved model can be loaded using the load module. Let’s save the model as the first step: # Save Model Using Pickle from pickle import dump from pickle import load # save the model to disk filename = 'finalized_model.sav' dump(model, open(filename, 'wb')) Now let’s load the saved model and use it for prediction: # load the model from disk loaded_model = load(open(filename, 'rb')) # estimate accuracy on validation set predictions = loaded_model.predict(X_validation) result = mean_squared_error(Y_validation, predictions) print(r2_score(Y_validation, predictions)) print(result) Case Study 3: Investor Risk Tolerance and Robo-Advisors | 137 Output 0.7683894847939692 0.007555447734714956 7. Additional step: robo-advisor dashboard We mentioned the robo-advisor dashboard in the beginning of this case study. The robo-advisor dashboard performs an automation of the portfolio management pro‐ cess and aims to overcome the problem of traditional risk tolerance profiling. Python Code for Robo-Advisor Dashboard This robo-advisor dashboard is built in Python using the plotly dash package. Dash is a productive Python framework for building web applications with good user interfaces. The code for the robo- advisor dashboard is added to the code repository for this book. The code is in a Jupyter notebook called “Sample Robo-advisor”. A detailed description of the code is outside the scope of this case study. However, the codebase can be leveraged for creation of any new machine learning–enabled dashboard. The dashboard has two panels: • Inputs for investor characteristics • Asset allocation and portfolio performance Input for investor characteristics. Figure 5-7 shows the input panel for the investor characteristics. This panel takes all the input regarding the investor’s demographic, financial, and behavioral attributes. These inputs are for the predicted variables we used in the risk tolerance model created in the preceding steps. The interface is designed to input the categorical and continuous variables in the correct format. Once the inputs are submitted, we leverage the model saved in “6.3. Save model for later use” on page 137. This model takes all the inputs and produces the risk tolerance of an investor (referto the predict_riskTolerance function of the “Sample Robo- advisor” Jupyter notebook in the code repository for this book for more details). The risk tolerance prediction model is embedded in this dashboard and is triggered once the “Calculate Risk Tolerance” button is pressed after submitting the inputs. 138 | Chapter 5: Supervised Learning: Regression (Including Time Series Models) https://dash.plot.ly https://oreil.ly/8fTDy Figure 5-7. Robo-advisor input panel 7.2 Asset allocation and portfolio performance. Figure 5-8 shows the “Asset Allocation and Portfolio Performance” panel, which performs the following functionalities: • Once the risk tolerance is computed using the model, it is displayed on the top of this panel. • In the next step, we pick the assets for our portfolio from the dropdown. • Once the list of assets are submitted, the traditional mean-variance portfolio allo‐ cation model is used to allocate the portfolio among the assets selected. Risk Case Study 3: Investor Risk Tolerance and Robo-Advisors | 139 tolerance is one of the key inputs for this process. (Refer to the get_asset_allo cation function of the “Sample Robo-advisor” Jupyter notebook in the code repository for this book for more details.) • The dashboard also shows the historical performance of the allocated portfolio for an initial investment of $100. Figure 5-8. Robo-advisor asset allocation and portfolio performance panel Although the dashboard is a basic version of the robo-advisor dashboard, it performs end-to-end asset allocation for an investor and provides the portfolio view and his‐ torical performance of the portfolio over a selected period. There are several potential enhancements to this prototype in terms of the interface and underlying models used. The dashboard can be enhanced to include additional instruments and incor‐ porate additional features such as real-time portfolio monitoring, portfolio rebalanc‐ ing, and investment advisory. In terms of the underlying models used for asset allocation, we have used the traditional mean-variance optimization method, but it can be further enhanced to use the allocation algorithms based on machine learning techniques such as eigen-portfolio, hierarchical risk parity, or reinforcement learn‐ ing–based models, described in Chapters 7, 8 and 9, respectively. The risk tolerance model can be further enhanced by using additional features or using the actual data of the investors rather than using data from the Survey of Consumer Finances. 140 | Chapter 5: Supervised Learning: Regression (Including Time Series Models) Conclusion In this case study, we introduced the regression-based algorithm applied to compute an investor’s risk tolerance, followed by a demonstration of the model in a robo- advisor setup. We showed that machine learning models might be able to objectively analyze the behavior of different investors in a changing market and attribute these changes to variables involved in determining risk appetite. With an increase in the volume of investors’ data and the availability of rich machine learning infrastructure, such models might prove to be more useful than existing manual processes. We saw that there is a nonlinear relationship between the variables and the risk toler‐ ance. We analyzed the feature importance and found that results of the case study are quite intuitive. Income and net worth, followed by age and willingness to take risk, are the key variables to deciding risk tolerance. These variables have been considered key variables to model risk tolerance across academic and industry literature. Through the robo-advisor dashboard powered by machine learning, we demon‐ strated an effective combination of data science and machine learning implementa‐ tion in wealth management. Robo-advisors and investment managers could leverage such models and platforms to enhance the portfolio management process with the help of machine learning. Case Study 4: Yield Curve Prediction A yield curve is a line that plots yields (interest rates) of bonds having equal credit quality but differing maturity dates. This yield curve is used as a benchmark for other debt in the market, such as mortgage rates or bank lending rates. The most frequently reported yield curve compares the 3-months, 2-years, 5-years, 10-years, and 30-years U.S. Treasury debt. The yield curve is the centerpiece in a fixed income market. Fixed income markets are important sources of finance for governments, national and supranational insti‐ tutions, banks, and private and public corporations. In addition, yield curves are very important to investors in pension funds and insurance companies. The yield curve is a key representation of the state of the bond market. Investors watch the bond market closely as it is a strong predictor of future economic activity and levels of inflation, which affect prices of goods, financial assets, and real estate. The slope of the yield curve is an important indicator of short-term interest rates and is followed closely by investors. Hence, an accurate yield curve forecasting is of critical importance in financial appli‐ cations. Several statistical techniques and tools commonly used in econometrics and finance have been applied to model the yield curve. Case Study 4: Yield Curve Prediction | 141 In this case study we will use supervised learning–based models to predict the yield curve. This case study is inspired by the paper Artificial Neural Networks in Fixed Income Markets for Yield Curve Forecasting by Manuel Nunes et al. (2018). In this case study, we will focus on: • Simultaneous modeling (producing multiple outputs at the same time) of the interest rates. • Comparison of neural network versus linear regression models. • Modeling a time series in a supervised regression–based framework. • Understanding the variable intuition and feature selection. Overall, the case study is similar to the stock price prediction case study presented earlier in this chapter, with the following differences: • We predict multiple outputs simultaneously, rather than a single output. • The predicted variable in this case study is not the return variable. • Given that we already covered time series models in case study 1, we focus on artificial neural networks for prediction in this case study. Blueprint for Using Supervised Learning Models to Predict the Yield Curve 1. Problem definition In the supervised regression framework used for this case study, three tenors (1M, 5Y, and 30Y) of the yield curve are the predicted variables. These tenors represent short-term, medium-term, and long-term tenors of the yield curve. We need to understand what affects the movement of the yield curve and hence incorporate as much information into our model as we can. As a high-level overview, other than the historical price of the yield curve itself, we look at other correlated variables that can influence the yield curve. The independent or predictor variables we consider are: • Previous value of the treasury curve for different tenors. The tenors used are 1- month, 3-month, 1-year, 2-year, 5-year, 7-year, 10-year, and 30-year yields. 142 | Chapter 5: Supervised Learning: Regression (Including Time Series Models) • Percentage of the federal debt held by the public, foreign governments, and the federal reserve. • Corporate spread on Baa-rated debt relative to the 10-year treasury rate. The federal debt and corporate spread are correlated variables and can be potentially useful in modeling the yield curve. The dataset used for this case study is extracted from Yahoo Finance and FRED. We will use the daily data of the last 10 years, from 2010 onward. By the end of this case study, readers will be familiar with a general machine learning approach to yield curve modeling, from gathering and cleaning data to building and tuning different models. 2. Getting started—loading the data and Python packages 2.1. Loading the Python packages. The loading of Python packagesis similar to other case studies in this chapter. Refer to the Jupyter notebook of this case study for more details. 2.2. Loading the data. The following steps demonstrate the loading of data using Pan‐ das’s DataReader function: # Get the data by webscraping using pandas datareader tsy_tickers = ['DGS1MO', 'DGS3MO', 'DGS1', 'DGS2', 'DGS5', 'DGS7', 'DGS10', 'DGS30', 'TREAST', # Treasury securities held by the Federal Reserve ($MM) 'FYGFDPUN', # Federal Debt Held by the Public ($MM) 'FDHBFIN', # Federal Debt Held by International Investors ($BN) 'GFDEBTN', # Federal Debt: Total Public Debt ($BN) 'BAA10Y', # Baa Corporate Bond Yield Relative to Yield on 10-Year ] tsy_data = web.DataReader(tsy_tickers, 'fred').dropna(how='all').ffill() tsy_data['FDHBFIN'] = tsy_data['FDHBFIN'] * 1000 tsy_data['GOV_PCT'] = tsy_data['TREAST'] / tsy_data['GFDEBTN'] tsy_data['HOM_PCT'] = tsy_data['FYGFDPUN'] / tsy_data['GFDEBTN'] tsy_data['FOR_PCT'] = tsy_data['FDHBFIN'] / tsy_data['GFDEBTN'] Next, we define our dependent (Y) and independent (X) variables. The predicted variables are the rate for three tenors of the yield curve (i.e., 1M, 5Y, and 30Y) as mentioned before. The number of trading days in a week is assumed to be five, and we compute the lagged version of the variables mentioned in the problem definition section as independent variables using five trading day lag. The lagged five-day variables embed the time series component by using a time-delay approach, where the lagged variable is included as one of the independent variables. This step reframes the time series data into a supervised regression–based model framework. Case Study 4: Yield Curve Prediction | 143 https://fred.stlouisfed.org 3. Exploratory data analysis. We will look at descriptive statistics and data visualization in this section. 3.1. Descriptive statistics. Let us look at the shape and the columns in the dataset: dataset.shape Output (505, 15) The data contains around 500 observations with 15 columns. 3.2. Data visualization. Let us first plot the predicted variables and see their behavior: Y.plot(style=['-','--',':']) Output In the plot, we see that the deviation among the short-term, medium-term, and long- term rates was higher in 2010 and has been decreasing since then. There was a drop in the long-term and medium-term rates during 2011, and they also have been declining since then. The order of the rates has been in line with the tenors. However, for a few months in recent years, the 5Y rate has been lower than the 1M rate. In the time series of all the tenors, we can see that the mean varies with time, resulting in an upward trend. Thus these series are nonstationary time series. 144 | Chapter 5: Supervised Learning: Regression (Including Time Series Models) In some cases, the linear regression for such nonstationary dependent variables might not be valid. However, we are using the lagged variables, which are also nonstation‐ ary as independent variables. So we are effectively modeling a nonstationary time ser‐ ies against another nonstationary time series, which might still be valid. Next, we look at the scatterplots (a correlation plot is skipped for this case study as it has a similar interpretation to that of a scatterplot). We can visualize the relationship between all the variables in the regression using the scatter matrix shown below: # Scatterplot Matrix pyplot.figure(figsize=(15,15)) scatter_matrix(dataset,figsize=(15,16)) pyplot.show() Output Case Study 4: Yield Curve Prediction | 145 Looking at the scatterplot (full-size version available on GitHub), we see a significant linear relationship of the predicted variables with their lags and other tenors of the yield curve. There is also a linear relationship, with negative slope between 1M, 5Y rates versus corporate spread and changes in foreign government purchases. The 30Y rate shows a linear relationship with these variables, although the slope is negative. Overall, we see a lot of linear relationships, and we expect the linear models to per‐ form well. 4. Data preparation and analysis We performed most of the data preparation steps (i.e., getting the dependent and independent variables) in the preceding steps, and so we’ll skip this step. 5. Evaluate models In this step we evaluate the models. The Python code for this step is similar to dthat in case study 1, and some of the repetitive code is skipped. Readers can also refer to the Jupyter notebook of this case study in the code repository for this book for more details. 5.1. Train-test split and evaluation metrics. We will use 80% of the dataset for modeling and use 20% for testing. We will evaluate algorithms using the mean squared error metric. All the algorithms use default tuning parameters. 5.2. Compare models and algorithms. In this case study, the primary purpose is to com‐ pare the linear models with the artificial neural network in yield curve modeling. So we stick to the linear regression (LR), regularized regression (LASSO and EN), and artificial neural network (shown as MLP). We also include a few other models such as KNN and CART, as these models are simpler with good interpretation, and if there is a nonlinear relationship between the variables, the CART and KNN models will be able to capture it and provide a good comparison benchmark for ANN. Looking at the training and test error, we see a good performance of the linear regres‐ sion model. We see that lasso and elastic net perform poorly. These are regularized regression models, and they reduce the number of variables in case they are not important. A decrease in the number of variables might have caused a loss of infor‐ mation leading to poor model performance. KNN and CART are good, but looking closely, we see that the test errors are higher than the training error. We also see that the performance of the artificial neural network (MLP) algorithm is comparable to the linear regression model. Despite its simplicity, the linear regression is a tough benchmark to beat for one-step-ahead forecasting when there is a significant linear relationship between the variables. 146 | Chapter 5: Supervised Learning: Regression (Including Time Series Models) https://oreil.ly/XIsvu Output 6. Model tuning and grid search. Similar to case study 2 of this chapter, we perform a grid search of the ANN model with different combinations of hidden layers. Several other hyperparameters such as learning rate, momentum, activation function, number of epochs, and batch size can be tuned during the grid search process, similar to the steps mentioned below. ''' hidden_layer_sizes : tuple, length = n_layers - 2, default (100,) The ith element represents the number of neurons in the ith hidden layer. ''' param_grid={'hidden_layer_sizes': [(20,), (50,), (20,20), (20, 30, 20)]} model = MLPRegressor() kfold = KFold(n_splits=num_folds, random_state=seed) grid = GridSearchCV(estimator=model, param_grid=param_grid, scoring=scoring, \ cv=kfold) grid_result = grid.fit(X_train, Y_train) print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_)) means = grid_result.cv_results_['mean_test_score'] stds = grid_result.cv_results_['std_test_score'] params = grid_result.cv_results_['params'] for mean, stdev, param in zip(means, stds, params): print("%f (%f) with: %r" % (mean, stdev, param)) Output Best: -0.018006 using {'hidden_layer_sizes': (20, 30, 20)} -0.036433 (0.019326) with: {'hidden_layer_sizes': (20,)} Case Study 4: Yield Curve Prediction | 147 -0.020793 (0.007075) with: {'hidden_layer_sizes': (50,)} -0.026638 (0.010154) with: {'hidden_layer_sizes': (20, 20)} -0.018006 (0.005637) with: {'hidden_layer_sizes': (20, 30, 20)} The best model is the model with three layers, with 20, 30, and 20 nodes in each hid‐ den layer, respectively. Hence, we prepare a model with this configuration and check itsperformance on the test set. This is a crucial step, as a greater number of layers may lead to overfitting and have poor performance in the test set. Prediction comparison. In the last step we look at the prediction plot of actual data versus the prediction from both linear regression and ANN models. Refer to the Jupyter notebook of this case study for the Python code of this section. 148 | Chapter 5: Supervised Learning: Regression (Including Time Series Models) Looking at the charts above, we see that the predictions of the linear regression and ANN are comparable. For 1M tenor, the fitting with ANN is slightly poor compared to the regression. However, for 5Y and 30Y tenors the ANN performs as well as the regression model. Conclusion In this case study, we applied supervised regression to the prediction of several tenors of yield curve. The linear regression model, despite its simplicity, is a tough bench‐ mark to beat for such one-step-ahead forecasting, given the dominant characteristic of the last available value of the variable to predict. The ANN results in this case study are comparable to the linear regression models. An additional benefit of ANN is that it is more flexible to changing market conditions. Also, ANN models can be enhanced by performing grid search on several other hyperparameters and the option of incorporating recurrent neural networks, such as LSTM. Overall, we built a machine learning–based model using ANN with an encouraging outcome, in the context of fixed income instruments. This allows us to perform pre‐ dictions using historical data to generate results and analyze risk and profitability before risking any actual capital in the fixed income market. Chapter Summary In “Case Study 1: Stock Price Prediction” on page 95, we covered a machine learning and time series–based framework for stock price prediction. We demonstrated the significance of visualization and compared time series against the machine learning Chapter Summary | 149 models. In “Case Study 2: Derivative Pricing” on page 114, we explored the use of machine learning for a traditional derivative pricing problem and demonstrated a high model performance. In “Case Study 3: Investor Risk Tolerance and Robo- Advisors” on page 125, we demonstrated how supervised learning models can be used to model the risk tolerance of investors, which can lead to automation of the portfolio management process. “Case Study 4: Yield Curve Prediction” on page 141 was similar to the stock price prediction case study, providing another example of comparison of linear and nonlinear models in the context of fixed income markets. We saw that time series and linear supervised learning models worked well for asset price prediction problems (i.e., case studies 1 and 4), where the predicted variable had a significant linear relationship with its lagged component. However, in deriva‐ tive pricing and risk tolerance prediction, where there are nonlinear relationships, ensemble and ANN models performed better. Readers who are interested in imple‐ menting a case study using supervised regression or time series models are encour‐ aged to understand the nuances in the variable relationships and model intuition before proceeding to model selection. Overall, the concepts in Python, machine learning, time series, and finance presented in this chapter through the case studies can used as a blueprint for any other super‐ vised regression–based problem in finance. Exercises • Using the concepts and framework of machine learning and time series models specified in case study 1, develop a predictive model for another asset class—cur‐ rency pair (EUR/USD, for example) or bitcoin. • In case study 1, add some technical indicators, such as trend or momentum, and check the enhancement in the model performance. Some of the ideas of the tech‐ nical indicators can be borrowed from “Case Study 3: Bitcoin Trading Strategy” on page 179 in Chapter 6. • Using the concepts in “Case Study 2: Derivative Pricing” on page 114, develop a machine learning–based model to price American options. • Incorporate multivariate time series modeling using a variant of the ARIMA model, such as VARMAX, for rates prediction in the yield curve prediction case study and compare the performance against the machine learning–based models. • Enhance the robo-advisor dashboard presented in “Case Study 3: Investor Risk Tolerance and Robo-Advisors” on page 125 to incorporate instruments other than equities. 150 | Chapter 5: Supervised Learning: Regression (Including Time Series Models) https://oreil.ly/EMUXv https://oreil.ly/t7s8q CHAPTER 6 Supervised Learning: Classification Here are some of the key questions that financial analysts attempt to solve: • Is a borrower going to repay their loan or default on it? • Will the instrument price go up or down? • Is this credit card transaction a fraud or not? All of these problem statements, in which the goal is to predict the categorical class labels, are inherently suitable for classification-based machine learning. Classification-based algorithms have been used across many areas within finance that require predicting a qualitative response. These include fraud detection, default pre‐ diction, credit scoring, directional forecasting of asset price movement, and buy/sell recommendations. There are many other use cases of classification-based supervised learning in portfolio management and algorithmic trading. In this chapter we cover three such classification-based case studies that span a diverse set of areas, including fraud detection, loan default probability, and formulat‐ ing a trading strategy. In “Case Study 1: Fraud Detection” on page 153, we use a classification-based algorithm to predict whether a transaction is fraudulent. The focus of this case study is also to deal with an unbalanced dataset, given that the fraud dataset is highly unbalanced with a small number of fraudulent observations. In “Case Study 2: Loan Default Probability” on page 166, we use a classification-based algorithm to predict whether a loan will default. The case study focuses on various techniques and concepts of data processing, feature selection, and exploratory analysis. 151 1 There may be reordering or renaming of the steps or substeps based on the appropriateness and intuitiveness of the steps/substeps. In “Case Study 3: Bitcoin Trading Strategy” on page 179, we use classification-based algorithms to predict whether the current trading signal of bitcoin is to buy or sell depending on the relationship between the short-term and long-term price. We pre‐ dict the trend of bitcoin’s price using technical indicators. The prediction model can easily be transformed into a trading bot that can perform buy, sell, or hold actions without human intervention. In addition to focusing on different problem statements in finance, these case studies will help you understand: • How to develop new features such as technical indicators for an investment strat‐ egy using feature engineering, and how to improve model performance. • How to use data preparation and data transformation, and how to perform fea‐ ture reduction and use feature importance. • How to use data visualization and exploratory data analysis for feature reduction and to improve model performance. • How to use algorithm tuning and grid search across various classification-based models to improve model performance. • How to handle unbalanced data. • How to use the appropriate evaluation metrics for classification. This Chapter’s Code Repository A Python-based master template for supervised classification model, along with the Jupyter notebook for the case studies presen‐ ted in this chapter, is included in the folder Chapter 6 - Sup. Learn‐ ing - Classification models in the code repository for this book. All of the case studies presented in this chapter use the standardized seven-step model development process presented in Chapter 2.1 For any new classification-basedproblem, the master template from the code repository can be modified with the elements spe‐ cific to the problem. The templates are designed to run on cloud infrastructure (e.g., Kaggle, Google Colab, or AWS). In order to run the template on the local machine, all the packages used within the template must be installed successfully. 152 | Chapter 6: Supervised Learning: Classification https://oreil.ly/y19Yc https://oreil.ly/y19Yc Case Study 1: Fraud Detection Fraud is one of the most significant issues the finance sector faces. It is incredibly costly. According to one study, it is estimated that the typical organization loses 5% of its annual revenue to fraud each year. When applied to the 2017 estimated Gross World Product of $79.6 trillion, this translates to potential global losses of up to $4 trillion. Fraud detection is a task inherently suitable for machine learning, as machine learn‐ ing–based models can scan through huge transactional datasets, detect unusual activ‐ ity, and identify all cases that might be prone to fraud. Also, the computations of these models are faster compared to traditional rule-based approaches. By collecting data from various sources and then mapping them to trigger points, machine learn‐ ing solutions are able to discover the rate of defaulting or fraud propensity for each potential customer and transaction, providing key alerts and insights for the financial institutions. In this case study, we will use various classification-based models to detect whether a transaction is a normal payment or a fraud. The focuses of this case study are: • Handling unbalanced data by downsampling/upsampling the data. • Selecting the right evaluation metric, given that one of the main goals is to reduce false negatives (cases in which fraudulent transactions incorrectly go unnoticed). Blueprint for Using Classification Models to Determine Whether a Transaction Is Fraudulent 1. Problem definition In the classification framework defined for this case study, the response (or target) variable has the column name “Class.” This column has a value of 1 in the case of fraud and a value of 0 otherwise. The dataset used is obtained from Kaggle. This dataset holds transactions by Euro‐ pean cardholders that occurred over two days in September 2013, with 492 cases of fraud out of 284,807 transactions. Case Study 1: Fraud Detection | 153 https://oreil.ly/CeFRs The dataset has been anonymized for privacy reasons. Given that certain feature names are not provided (i.e., they are called V1, V2, V3, etc.), the visualization and feature importance will not give much insight into the behavior of the model. By the end of this case study, readers will be familiar with a general approach to fraud modeling, from gathering and cleaning data to building and tuning a classifier. 2. Getting started—loading the data and Python packages 2.1. Loading the Python packages. The list of the libraries used for data loading, data analysis, data preparation, model evaluation, and model tuning are shown below. The packages used for different purposes have been separated in the Python code below. The details of most of these packages and functions have been provided in Chapter 2 and Chapter 4: Packages for data loading, data analysis, and data preparation import numpy as np import pandas as pd import seaborn as sns from matplotlib import pyplot from pandas import read_csv, set_option from pandas.plotting import scatter_matrix from sklearn.preprocessing import StandardScaler Packages for model evaluation and classification models from sklearn.model_selection import train_test_split, KFold,\ cross_val_score, GridSearchCV from sklearn.linear_model import LogisticRegression from sklearn.tree import DecisionTreeClassifier from sklearn.neighbors import KNeighborsClassifier from sklearn.discriminant_analysis import LinearDiscriminantAnalysis from sklearn.naive_bayes import GaussianNB from sklearn.svm import SVC from sklearn.neural_network import MLPClassifier from sklearn.pipeline import Pipeline from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier, from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier from sklearn.metrics import classification_report, confusion_matrix,\ accuracy_score Packages for deep learning models from keras.models import Sequential from keras.layers import Dense from keras.wrappers.scikit_learn import KerasClassifier 154 | Chapter 6: Supervised Learning: Classification Packages for saving the model from pickle import dump from pickle import load 3. Exploratory data analysis The following sections walk through some high-level data inspection. 3.1. Descriptive statistics. The first thing we must do is gather a basic sense of our data. Remember, except for the transaction and amount, we do not know the names of other columns. The only thing we know is that the values of those columns have been scaled. Let’s look at the shape and columns of the data: # shape dataset.shape Output (284807, 31) #peek at data set_option('display.width', 100) dataset.head(5) Output 5 rows × 31 columns As shown, the variable names are nondescript (V1, V2, etc.). Also, the data type for the entire dataset is float, except Class, which is of type integer. How many are fraud and how many are not fraud? Let us check: class_names = {0:'Not Fraud', 1:'Fraud'} print(dataset.Class.value_counts().rename(index = class_names)) Output Not Fraud 284315 Fraud 492 Name: Class, dtype: int64 Notice the stark imbalance of the data labels. Most of the transactions are nonfraud. If we use this dataset as the base for our modeling, most models will not place enough emphasis on the fraud signals; the nonfraud data points will drown out any weight Case Study 1: Fraud Detection | 155 the fraud signals provide. As is, we may encounter difficulties modeling the predic‐ tion of fraud, with this imbalance leading the models to simply assume all transac‐ tions are nonfraud. This would be an unacceptable result. We will explore some ways of dealing with this issue in the subsequent sections. 3.2. Data visualization. Since the feature descriptions are not provided, visualizing the data will not lead to much insight. This step will be skipped in this case study. 4. Data preparation This data is from Kaggle and is already in a cleaned format without any empty rows or columns. Data cleaning or categorization is unnecessary. 5. Evaluate models Now we are ready to split the data and evaluate the models. 5.1. Train-test split and evaluation metrics. As described in Chapter 2, it is a good idea to partition the original dataset into training and test sets. The test set is a sample of the data that we hold back from our analysis and modeling. We use it at the end of our project to confirm the accuracy of our final model. It is the final test that gives us confidence in our estimates of accuracy on unseen data. We will use 80% of the data‐ set for model training and 20% for testing: Y= dataset["Class"] X = dataset.loc[:, dataset.columns != 'Class'] validation_size = 0.2 seed = 7 X_train, X_validation, Y_train, Y_validation =\ train_test_split(X, Y, test_size=validation_size, random_state=seed) 5.2. Checking models. In this step, we will evaluate different machine learning models. To optimize the various hyperparameters of the models, we use ten-fold cross valida‐ tion and recalculate the results ten times to account for the inherent randomness in some of the models and the CV process. All of these models, including cross valida‐ tion, are described in Chapter 4. Let us design our test harness. We will evaluate algorithms using the accuracy metric. This is a gross metric that will give us a quick idea of how correct a given model is. It is useful on binary classification problems. # test options for classification num_folds = 10 scoring = 'accuracy' Let’s create a baseline of performance for this problem and spot-check a number of differentalgorithms. The selected algorithms include: 156 | Chapter 6: Supervised Learning: Classification Linear algorithms Logistic regression (LR) and linear discriminant analysis (LDA). Nonlinear algorithms Classification and regression trees (CART) and K-nearest neighbors (KNN). There are good reasons for selecting these models. These models are simpler and faster models with good interpretation for problems with large datasets. CART and KNN will be able to discern any nonlinear relationships between the variables. The key problem here is using an unbalanced sample. Unless we resolve that, more com‐ plex models, such as ensemble and ANNs, will have poor prediction. We will focus on addressing this later in the case study and then will evaluate the performance of these types of models. # spot-check basic Classification algorithms models = [] models.append(('LR', LogisticRegression())) models.append(('LDA', LinearDiscriminantAnalysis())) models.append(('KNN', KNeighborsClassifier())) models.append(('CART', DecisionTreeClassifier())) All the algorithms use default tuning parameters. We will display the mean and stan‐ dard deviation of accuracy for each algorithm as we calculate and collect the results for use later. results = [] names = [] for name, model in models: kfold = KFold(n_splits=num_folds, random_state=seed) cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, \ scoring=scoring) results.append(cv_results) names.append(name) msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std()) print(msg) Output LR: 0.998942 (0.000229) LDA: 0.999364 (0.000136) KNN: 0.998310 (0.000290) CART: 0.999175 (0.000193) # compare algorithms fig = pyplot.figure() fig.suptitle('Algorithm Comparison') ax = fig.add_subplot(111) pyplot.boxplot(results) ax.set_xticklabels(names) fig.set_size_inches(8,4) pyplot.show() Case Study 1: Fraud Detection | 157 The accuracy of the overall result is quite high. But let us check how well it predicts the fraud cases. Choosing one of the model CART from the results above and looking at the result on the test set: # prepare model model = DecisionTreeClassifier() model.fit(X_train, Y_train) # estimate accuracy on validation set predictions = model.predict(X_validation) print(accuracy_score(Y_validation, predictions)) print(classification_report(Y_validation, predictions)) Output 0.9992275552122467 precision recall f1-score support 0 1.00 1.00 1.00 56862 1 0.77 0.79 0.78 100 accuracy 1.00 56962 macro avg 0.89 0.89 0.89 56962 weighted avg 1.00 1.00 1.00 56962 And producing the confusion matrix yields: df_cm = pd.DataFrame(confusion_matrix(Y_validation, predictions), \ columns=np.unique(Y_validation), index = np.unique(Y_validation)) df_cm.index.name = 'Actual' df_cm.columns.name = 'Predicted' sns.heatmap(df_cm, cmap="Blues", annot=True,annot_kws={"size": 16}) 158 | Chapter 6: Supervised Learning: Classification Overall accuracy is strong, but the confusion metrics tell a different story. Despite the high accuracy level, 21 out of 100 instances of fraud are missed and incorrectly pre‐ dicted as nonfraud. The false negative rate is substantial. The intention of a fraud detection model is to minimize these false negatives. To do so, the first step would be to choose the right evaluation metric. In Chapter 4, we covered the evaluation metrics, such as accuracy, precision, and recall, for a classification-based problem. Accuracy is the number of correct predic‐ tions made as a ratio of all predictions made. Precision is the number of items cor‐ rectly identified as positive out of total items identified as positive by the model. Recall is the total number of items correctly identified as positive out of total true positives. For this type of problem, we should focus on recall, the ratio of true positives to the sum of true positives and false negatives. So if false negatives are high, then the value of recall will be low. In the next step, we perform model tuning, select the model using the recall metric, and perform under-sampling. 6. Model tuning The purpose of the model tuning step is to perform the grid search on the model selected in the previous step. However, since we encountered poor model perfor‐ mance in the previous section due to the unbalanced dataset, we will focus our atten‐ tion on that. We will analyze the impact of choosing the correct evaluation metric and see the impact of using an adjusted, balanced sample. Case Study 1: Fraud Detection | 159 6.1. Model tuning by choosing the correct evaluation metric. As mentioned in the preced‐ ing step, if false negatives are high, then the value of recall will be low. Models are ranked according to this metric: scoring = 'recall' Let us spot-check some basic classification algorithms for recall: models = [] models.append(('LR', LogisticRegression())) models.append(('LDA', LinearDiscriminantAnalysis())) models.append(('KNN', KNeighborsClassifier())) models.append(('CART', DecisionTreeClassifier())) Running cross validation: results = [] names = [] for name, model in models: kfold = KFold(n_splits=num_folds, random_state=seed) cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, \ scoring=scoring) results.append(cv_results) names.append(name) msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std()) print(msg) Output LR: 0.595470 (0.089743) LDA: 0.758283 (0.045450) KNN: 0.023882 (0.019671) CART: 0.735192 (0.073650) We see that the LDA model has the best recall of the four models. We continue by evaluating the test set using the trained LDA: # prepare model model = LinearDiscriminantAnalysis() model.fit(X_train, Y_train) # estimate accuracy on validation set predictions = model.predict(X_validation) print(accuracy_score(Y_validation, predictions)) Output 0.9995435553526912 160 | Chapter 6: Supervised Learning: Classification LDA performs better, missing only 18 out of 100 cases of fraud. Additionally, we find fewer false positives as well. However, there is still improvement to be made. 6.2. Model tuning—balancing the sample by random under-sampling. The current data exhibits a significant class imbalance, where there are very few data points labeled “fraud.” The issue of such class imbalance can result in a serious bias toward the majority class, reducing the classification performance and increasing the number of false negatives. One of the remedies to handle such situations is to under-sample the data. A simple technique is to under-sample the majority class randomly and uniformly. This might lead to a loss of information, but it may yield strong results by modeling the minority class well. Next, we will implement random under-sampling, which consists of removing data to have a more balanced dataset. This will help ensure that our models avoid overfitting. The steps to implement random under-sampling are: 1. First, we determine the severity of the class imbalance by using value_counts() on the class column. We determine how many instances are considered fraud transactions (fraud = 1). 2. We bring the nonfraud transaction observation count to the same amount as fraud transactions. Assuming we want a 50/50 ratio, this will be equivalent to 492 cases of fraud and 492 cases of nonfraud transactions. 3. We now have a subsample of our dataframe with a 50/50 ratio with regards to our classes. We train the models on this subsample. Then we perform this itera‐ tion again to shuffle the nonfraud observations in the training sample. We keep Case Study 1: Fraud Detection | 161 track of the model performance to see whether our models can maintain a cer‐ tain accuracy every time we repeat this process: df = pd.concat([X_train, Y_train], axis=1) # amount of fraud classes 492 rows. fraud_df = df.loc[df['Class'] == 1] non_fraud_df= df.loc[df['Class'] == 0][:492] normal_distributed_df = pd.concat([fraud_df, non_fraud_df]) # Shuffle dataframe rows df_new = normal_distributed_df.sample(frac=1, random_state=42) # split out validation dataset for the end Y_train_new= df_new["Class"] X_train_new = df_new.loc[:, dataset.columns != 'Class'] Let us look at the distribution of the classes in the dataset: print('Distribution of the Classes in the subsample dataset') print(df_new['Class'].value_counts()/len(df_new)) sns.countplot('Class', data=df_new) pyplot.title('Equally Distributed Classes', fontsize=14) pyplot.show() Output Distribution of the Classes in the subsample dataset 1 0.5 0 0.5 Name: Class, dtype: float64 162 | Chapter 6: Supervised Learning: Classification The data is now balanced, with close to 1,000 observations. We will train all the mod‐ els again, including an ANN. Now that the data is balanced, we will focus on accuracy as our main evaluation metric, since it considers both false positives and false nega‐ tives. Recall can always be produced if needed: #setting the evaluation metric scoring='accuracy' # spot-check the algorithms models = [] models.append(('LR', LogisticRegression())) models.append(('LDA', LinearDiscriminantAnalysis())) models.append(('KNN', KNeighborsClassifier())) models.append(('CART', DecisionTreeClassifier())) models.append(('NB', GaussianNB())) models.append(('SVM', SVC())) #Neural Network models.append(('NN', MLPClassifier())) # Ensemble Models # Boosting methods models.append(('AB', AdaBoostClassifier())) models.append(('GBM', GradientBoostingClassifier())) # Bagging methods models.append(('RF', RandomForestClassifier())) models.append(('ET', ExtraTreesClassifier())) The steps to define and compile an ANN-based deep learning model in Keras, along with all the terms (neurons, activation, momentum, etc.) mentioned in the following code, have been described in Chapter 3. This code can be leveraged for any deep learning–based classification model. Keras-based deep learning model: # Function to create model, required for KerasClassifier def create_model(neurons=12, activation='relu', learn_rate = 0.01, momentum=0): # create model model = Sequential() model.add(Dense(X_train.shape[1], input_dim=X_train.shape[1], \ activation=activation)) model.add(Dense(32,activation=activation)) model.add(Dense(1, activation='sigmoid')) # Compile model optimizer = SGD(lr=learn_rate, momentum=momentum) model.compile(loss='binary_crossentropy', optimizer='adam', \ metrics=['accuracy']) return model models.append(('DNN', KerasClassifier(build_fn=create_model,\ epochs=50, batch_size=10, verbose=0))) Running the cross validation on the new set of models results in the following: Case Study 1: Fraud Detection | 163 Although a couple of models, including random forest (RF) and logistic regression (LR), perform well, GBM slightly edges out the other models. We select this for fur‐ ther analysis. Note that the result of the deep learning model using Keras (i.e., “DNN”) is poor. A grid search is performed for the GBM model by varying the number of estimators and maximum depth. The details of the GBM model and the parameters to tune for this model are described in Chapter 4. # Grid Search: GradientBoosting Tuning n_estimators = [20,180,1000] max_depth= [2, 3,5] param_grid = dict(n_estimators=n_estimators, max_depth=max_depth) model = GradientBoostingClassifier() kfold = KFold(n_splits=num_folds, random_state=seed) grid = GridSearchCV(estimator=model, param_grid=param_grid, scoring=scoring, \ cv=kfold) grid_result = grid.fit(X_train_new, Y_train_new) print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_)) Output Best: 0.936992 using {'max_depth': 5, 'n_estimators': 1000} In the next step, the final model is prepared, and the result on the test set is checked: # prepare model model = GradientBoostingClassifier(max_depth= 5, n_estimators = 1000) model.fit(X_train_new, Y_train_new) # estimate accuracy on Original validation set predictions = model.predict(X_validation) print(accuracy_score(Y_validation, predictions)) 164 | Chapter 6: Supervised Learning: Classification Output 0.9668199852533268 The accuracy of the model is high. Let’s look at the confusion matrix: Output The results on the test set are impressive, with a high accuracy and, importantly, no false negatives. However, we see that an outcome of using our under-sampled data is a propensity for false positives—cases in which nonfraud transactions are misclassi‐ fied as fraudulent. This is a trade-off the financial institution would have to consider. There is an inherent cost balance between the operational overhead, and possible customer experience impact, from processing false positives and the financial loss resulting from missing fraud cases through false negatives. Conclusion In this case study, we performed fraud detection on credit card transactions. We illustrated how different classification machine learning models stack up against each other and demonstrated that choosing the right metric can make an important differ‐ ence in model evaluation. Under-sampling was shown to lead to a significant improvement, as all fraud cases in the test set were correctly identified after applying under-sampling. This came with a trade-off, though. The reduction in false negatives came with an increase in false positives. Overall, by using different machine learning models, choosing the right evaluation metrics, and handling unbalanced data, we demonstrated how the implementation of a simple classification-based model can produce robust results for fraud detection. Case Study 1: Fraud Detection | 165 Case Study 2: Loan Default Probability Lending is one of the most important activities of the finance industry. Lenders pro‐ vide loans to borrowers in exchange for the promise of repayment with interest. That means the lender makes a profit only if the borrower pays off the loan. Hence, the two most critical questions in the lending industry are: 1. How risky is the borrower? 2. Given the borrower’s risk, should we lend to them? Default prediction could be described as a perfect job for machine learning, as the algorithms can be trained on millions of examples of consumer data. Algorithms can perform automated tasks such as matching data records, identifying exceptions, and calculating whether an applicant qualifies for a loan. The underlying trends can be assessed with algorithms and continuously analyzed to detect trends that might influ‐ ence lending and underwriting risk in the future. The goal of this case study is to build a machine learning model to predict the proba‐ bility that a loan will default. In most real-life cases, including loan default modeling, we are unable to work with clean, complete data. Some of the potential problems we are bound to encounter are missing values, incomplete categorical data, and irrelevant features. Although data cleaning may not be mentioned often, it is critical for the success of machine learning applications. The algorithms that we use can be powerful, but without the relevant or appropriate data, the system may fail to yield ideal results. So one of the focus areas of this case study will be data preparation and cleaning. Various techniques and con‐ cepts of data processing, feature selection, and exploratory analysis are used for data cleaning and organizing the feature space. In this case study, we will focus on: • Data preparation, data cleaning, and handling a large number of features. • Data discretization and handling categorical data. • Feature selection and data transformation. 166 | Chapter 6: Supervised Learning: Classification Blueprint for Creating a Machine Learning Model for Predicting Loan Default Probability 1. Problem definition In the classification framework for this case study, the predicted variable is charge-off, a debt that a creditor has given up trying to collect on after a borrowerhas missed payments for several months. The predicted variable takes a value of 1 in case of charge-off and a value of 0 otherwise. We will analyze data for loans from 2007 to 2017Q3 from Lending Club, available on Kaggle. Lending Club is a US peer-to-peer lending company. It operates an online lending platform that enables borrowers to obtain a loan and investors to purchase notes backed by payments made on these loans. The dataset contains more than 887,000 observations with 150 variables containing complete loan data for all loans issued over the specified time period. The features include income, age, credit scores, home ownership, borrower location, collections, and many others. We will investi‐ gate these 150 predictor variables for feature selection. By the end of this case study, readers will be familiar with a general approach to loan default modeling, from gathering and cleaning data to building and tuning a classifier. 2. Getting started—loading the data and Python packages 2.1. Loading the Python packages. The standard Python packages are loaded in this step. The details have been presented in the previous case studies. Please refer to the Jupyter notebook for this case study for more details. 2.2. Loading the data. The loan data for the time period from 2007 to 2017Q3 is loaded: # load dataset dataset = pd.read_csv('LoansData.csv.gz', compression='gzip', \ low_memory=True) 3. Data preparation and feature selection In the first step, let us look at the size of the data: dataset.shape Case Study 2: Loan Default Probability | 167 https://oreil.ly/DG9j5 https://oreil.ly/DG9j5 2 The predicted variable is further used for correlation-based feature reduction. Output (1646801, 150) Given that there are 150 features for each loan, we will first focus on limiting the fea‐ ture space, followed by the exploratory analysis. 3.1. Preparing the predicted variable. Here, we look at the details of the predicted vari‐ able and prepare it. The predicted variable will be derived from the loan_status col‐ umn. Let’s check the value distributions:2 dataset['loan_status'].value_counts(dropna=False) Output Current 788950 Fully Paid 646902 Charged Off 168084 Late (31-120 days) 23763 In Grace Period 10474 Late (16-30 days) 5786 Does not meet the credit policy. Status:Fully Paid 1988 Does not meet the credit policy. Status:Charged Off 761 Default 70 NaN 23 Name: loan_status, dtype: int64 From the data definition documentation: Fully Paid Loans that have been fully repaid. Default Loans that have not been current for 121 days or more. Charged Off Loans for which there is no longer a reasonable expectation of further payments. A large proportion of observations show a status of Current, and we do not know whether those will be Charged Off, Fully Paid, or Default in the future. The obser‐ vations for Default are tiny in number compared to Fully Paid or Charged Off and are not considered. The remaining categories of loan status are not of prime importance for this analysis. So, in order to convert this to a binary classification problem and to analyze in detail the effect of important variables on the loan status, we will consider only two major categories—Charged Off and Fully Paid: 168 | Chapter 6: Supervised Learning: Classification dataset = dataset.loc[dataset['loan_status'].isin(['Fully Paid', 'Charged Off'])] dataset['loan_status'].value_counts(normalize=True, dropna=False) Output Fully Paid 0.793758 Charged Off 0.206242 Name: loan_status, dtype: float64 About 79% of the remaining loans have been fully paid and 21% have been charged off, so we have a somewhat unbalanced classification problem, but it is not as unbal‐ anced as the dataset of fraud detection we saw in the previous case study. In the next step, we create a new binary column in the dataset, where we categorize Fully Paid as 0 and Charged Off as 1. This column represents the predicted variable for this classification problem. A value of 1 in this column indicates the borrower has defaulted: dataset['charged_off'] = (dataset['loan_status'] == 'Charged Off').apply(np.uint8) dataset.drop('loan_status', axis=1, inplace=True) 3.2. Feature selection—limit the feature space. The full dataset has 150 features for each loan, but not all features contribute to the prediction variable. Removing features of low importance can improve accuracy and reduce both model complexity and over‐ fitting. Training time can also be reduced for very large datasets. We’ll eliminate fea‐ tures in the following steps using three different approaches: • Eliminating features that have more than 30% missing values. • Eliminating features that are unintuitive based on subjective judgment. • Eliminating features with low correlation with the predicted variable. 3.2.1. Feature elimination based on significant missing values. First, we calculate the per‐ centage of missing data for each feature: missing_fractions = dataset.isnull().mean().sort_values(ascending=False) #Drop the missing fraction drop_list = sorted(list(missing_fractions[missing_fractions > 0.3].index)) dataset.drop(labels=drop_list, axis=1, inplace=True) dataset.shape Output (814986, 92) This dataset has 92 columns remaining once some of the columns with a significant number of missing values are dropped. 3.2.2. Feature elimination based on intuitiveness. To filter the features further we check the description in the data dictionary and keep the features that intuitively contribute Case Study 2: Loan Default Probability | 169 to the prediction of default. We keep features that contain the relevant credit detail of the borrower, including annual income, FICO score, and debt-to-income ratio. We also keep those features that are available to investors when considering an invest‐ ment in the loan. These include features in the loan application and any features added by Lending Club when the loan listing was accepted, such as loan grade and interest rate. The list of the features retained are shown in the following code snippet: keep_list = ['charged_off','funded_amnt','addr_state', 'annual_inc', \ 'application_type','dti', 'earliest_cr_line', 'emp_length',\ 'emp_title', 'fico_range_high',\ 'fico_range_low', 'grade', 'home_ownership', 'id', 'initial_list_status', \ 'installment', 'int_rate', 'loan_amnt', 'loan_status',\ 'mort_acc', 'open_acc', 'pub_rec', 'pub_rec_bankruptcies', \ 'purpose', 'revol_bal', 'revol_util', \ 'sub_grade', 'term', 'title', 'total_acc',\ 'verification_status', 'zip_code','last_pymnt_amnt',\ 'num_actv_rev_tl', 'mo_sin_rcnt_rev_tl_op',\ 'mo_sin_old_rev_tl_op',"bc_util","bc_open_to_buy",\ "avg_cur_bal","acc_open_past_24mths" ] drop_list = [col for col in dataset.columns if col not in keep_list] dataset.drop(labels=drop_list, axis=1, inplace=True) dataset.shape Output (814986, 39) After removing the features in this step, 39 columns remain. 3.2.3. Feature elimination based on the correlation. The next step is to check the correla‐ tion with the predicted variable. Correlation gives us the interdependence between the predicted variable and the feature. We select features with a moderate-to-strong relationship with the target variable and drop those that have a correlation of less than 3% with the predicted variable: correlation = dataset.corr() correlation_chargeOff = abs(correlation['charged_off']) drop_list_corr = sorted(list(correlation_chargeOff\ [correlation_chargeOff < 0.03].index)) print(drop_list_corr) Output ['pub_rec', 'pub_rec_bankruptcies', 'revol_bal', 'total_acc'] The columns with low correlation are dropped from the dataset, and we are leftwith only 35 columns: dataset.drop(labels=drop_list_corr, axis=1, inplace=True) 170 | Chapter 6: Supervised Learning: Classification 4. Feature selection and exploratory analysis In this step, we perform the exploratory data analysis of the feature selection. Given that many features had to be eliminated, it is preferable that we perform the explora‐ tory data analysis after feature selection to better visualize the relevant features. We will also continue the feature elimination by visually screening and dropping those features deemed irrelevant. 4.1. Feature analysis and exploration. In the following sections, we take a deeper dive into the dataset features. 4.1.1. Analyzing the categorical features. Let us look at the some of the categorical fea‐ tures in the dataset. First, let’s look at the id, emp_title, title, and zip_code features: dataset[['id','emp_title','title','zip_code']].describe() Output id emp_title title zip_code count 814986 766415 807068 814986 unique 814986 280473 60298 925 top 14680062 Teacher Debt consolidation 945xx freq 1 11351 371874 9517 IDs are all unique and irrelevant for modeling. There are too many unique values for employment titles and titles. Occupation and job title may provide some information for default modeling; however, we assume much of this information is embedded in the verified income of the customer. Moreover, additional cleaning steps on these features, such as standardizing or grouping the titles, would be necessary to extract any marginal information. This work is outside the scope of this case study but could be explored in subsequent iterations of the model. Geography could play a role in credit determination, and zip codes provide a granu‐ lar view of this dimension. Again, additional work would be necessary to prepare this feature for modeling and was deemed outside the scope of this case study. dataset.drop(['id','emp_title','title','zip_code'], axis=1, inplace=True) Let’s look at the term feature. Term refers to the number of payments on the loan. Values are in months and can be either 36 or 60. The 60-month loans are more likely to charge off. Case Study 2: Loan Default Probability | 171 Let’s convert term to integers and group by the term for further analysis: dataset['term'] = dataset['term'].apply(lambda s: np.int8(s.split()[0])) dataset.groupby('term')['charged_off'].value_counts(normalize=True).loc[:,1] Output term 36 0.165710 60 0.333793 Name: charged_off, dtype: float64 Loans with five-year periods are more than twice as likely to charge-off as loans with three-year periods. This feature seems to be important for the prediction. Let’s look at the emp_length feature: dataset['emp_length'].replace(to_replace='10+ years', value='10 years',\ inplace=True) dataset['emp_length'].replace('< 1 year', '0 years', inplace=True) def emp_length_to_int(s): if pd.isnull(s): return s else: return np.int8(s.split()[0]) dataset['emp_length'] = dataset['emp_length'].apply(emp_length_to_int) charge_off_rates = dataset.groupby('emp_length')['charged_off'].value_counts\ (normalize=True).loc[:,1] sns.barplot(x=charge_off_rates.index, y=charge_off_rates.values) Output 172 | Chapter 6: Supervised Learning: Classification Loan status does not appear to vary much with employment length (on average); hence this feature is dropped: dataset.drop(['emp_length'], axis=1, inplace=True) Let’s look at the sub_grade feature: charge_off_rates = dataset.groupby('sub_grade')['charged_off'].value_counts\ (normalize=True).loc[:,1] sns.barplot(x=charge_off_rates.index, y=charge_off_rates.values) Output As shown in the chart, there’s a clear trend of higher probability of charge-off as the sub-grade worsens, and so it is considered to be a key feature. 4.1.2. Analyzing the continuous features. Let’s look at the annual_inc feature: dataset[['annual_inc']].describe() Output annual_inc count 8.149860e+05 mean 7.523039e+04 std 6.524373e+04 min 0.000000e+00 25% 4.500000e+04 50% 6.500000e+04 75% 9.000000e+04 max 9.550000e+06 Case Study 2: Loan Default Probability | 173 Annual income ranges from $0 to $9,550,000, with a median of $65,000. Because of the large range of incomes, we use a log transform of the annual income variable: dataset['log_annual_inc'] = dataset['annual_inc'].apply(lambda x: np.log10(x+1)) dataset.drop('annual_inc', axis=1, inplace=True) Let’s look at the FICO score (fico_range_low, fico_range_high) feature: dataset[['fico_range_low','fico_range_high']].corr() Output fico_range_low fico_range_high fico_range_low 1.0 1.0 fico_range_high 1.0 1.0 Given that the correlation between FICO low and high is 1, it is preferred that we keep only one feature, which we take as the average of FICO scores: dataset['fico_score'] = 0.5*dataset['fico_range_low'] +\ 0.5*dataset['fico_range_high'] dataset.drop(['fico_range_high', 'fico_range_low'], axis=1, inplace=True) 4.2. Encoding categorical data. In order to use a feature in the classification models, we need to convert the categorical data (i.e., text features) to its numeric representation. This process is called encoding. There can be different ways of encoding. However, for this case study we will use a label encoder, which encodes labels with a value between 0 and n, where n is the number of distinct labels. The LabelEncoder func‐ tion from sklearn is used in the following step, and all the categorical columns are encoded at once: from sklearn.preprocessing import LabelEncoder # Categorical boolean mask categorical_feature_mask = dataset.dtypes==object # filter categorical columns using mask and turn it into a list categorical_cols = dataset.columns[categorical_feature_mask].tolist() Let us look at the categorical columns: categorical_cols Output ['grade', 'sub_grade', 'home_ownership', 'verification_status', 'purpose', 'addr_state', 'initial_list_status', 'application_type'] 174 | Chapter 6: Supervised Learning: Classification 3 Sampling is covered in detail in “Case Study 1: Fraud Detection” on page 153. 4.3. Sampling data. Given that the loan data is skewed, it is sampled to have an equal number of charge-off and no charge-off observations. Sampling leads to a more bal‐ anced dataset and avoids overfitting:3 loanstatus_0 = dataset[dataset["charged_off"]==0] loanstatus_1 = dataset[dataset["charged_off"]==1] subset_of_loanstatus_0 = loanstatus_0.sample(n=5500) subset_of_loanstatus_1 = loanstatus_1.sample(n=5500) dataset = pd.concat([subset_of_loanstatus_1, subset_of_loanstatus_0]) dataset = dataset.sample(frac=1).reset_index(drop=True) print("Current shape of dataset :",dataset.shape) Although sampling may have its advantages, there might be some disadvantages as well. Sampling may exclude some data that might not be homogeneous to the data that is taken. This affects the level of accuracy in the results. Also, selection of the proper size of samples is a difficult job. Hence, sampling should be performed with caution and should generally be avoided in the case of a relatively balanced dataset. 5. Evaluate algorithms and models 5.1. Train-test split. Splitting out the validation dataset for the model evaluation is the next step: Y= dataset["charged_off"] X = dataset.loc[:, dataset.columns != 'charged_off'] validation_size = 0.2 seed = 7 X_train, X_validation, Y_train, Y_validation = \ train_test_split(X, Y, test_size=validation_size, random_state=seed) 5.2. Test options and evaluation metrics. In this step, the test options and evaluation metrics are selected. The roc_auc evaluation metric is selected for this classification. The details of this metric were provided in Chapter 4. This metric represents a model’s ability to discriminate between positive and negative classes. An roc_auc of 1.0 represents a model that made all predictions perfectly, and a value of 0.5 repre‐ sents a model that is as good as random. num_folds = 10 scoring = 'roc_auc' The model cannot afford to havea high amount of false negatives as that leads to a negative impact on the investors and the credibility of the company. So we can use recall as we did in the fraud detection use case. Case Study 2: Loan Default Probability | 175 5.3. Compare models and algorithms. Let us spot-check the classification algorithms. We include ANN and ensemble models in the list of models to be checked: models = [] models.append(('LR', LogisticRegression())) models.append(('LDA', LinearDiscriminantAnalysis())) models.append(('KNN', KNeighborsClassifier())) models.append(('CART', DecisionTreeClassifier())) models.append(('NB', GaussianNB())) # Neural Network models.append(('NN', MLPClassifier())) # Ensemble Models # Boosting methods models.append(('AB', AdaBoostClassifier())) models.append(('GBM', GradientBoostingClassifier())) # Bagging methods models.append(('RF', RandomForestClassifier())) models.append(('ET', ExtraTreesClassifier())) After performing the k-fold cross validation on the models shown above, the overall performance is as follows: The gradient boosting method (GBM) model performs best, and we select it for grid search in the next step. The details of GBM along with the model parameters are described in Chapter 4. 176 | Chapter 6: Supervised Learning: Classification 6. Model tuning and grid search We tune the number of estimator and maximum depth hyperparameters, which were discussed in Chapter 4: # Grid Search: GradientBoosting Tuning n_estimators = [20,180] max_depth= [3,5] param_grid = dict(n_estimators=n_estimators, max_depth=max_depth) model = GradientBoostingClassifier() kfold = KFold(n_splits=num_folds, random_state=seed) grid = GridSearchCV(estimator=model, param_grid=param_grid, scoring=scoring, \ cv=kfold) grid_result = grid.fit(X_train, Y_train) print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_)) Output Best: 0.952950 using {'max_depth': 5, 'n_estimators': 180} A GBM model with max_depth of 5 and number of estimators of 150 results in the best model. 7. Finalize the model Now, we perform the final steps for selecting a model. 7.1. Results on the test dataset. Let us prepare the GBM model with the parameters found during the grid search step and check the results on the test dataset: model = GradientBoostingClassifier(max_depth= 5, n_estimators= 180) model.fit(X_train, Y_train) # estimate accuracy on validation set predictions = model.predict(X_validation) print(accuracy_score(Y_validation, predictions)) Output 0.889090909090909 Case Study 2: Loan Default Probability | 177 The accuracy of the model is a reasonable 89% on the test set. Let us examine the confusion matrix: Looking at the confusion matrix and the overall result of the test set, both the rate of false positives and the rate of false negatives are lower; the overall model performance looks good and is in line with the training set results. 7.2. Variable intuition/feature importance. In this step, we compute and display the variable importance of our trained model: print(model.feature_importances_) #use inbuilt class feature_importances feat_importances = pd.Series(model.feature_importances_, index=X.columns) #plot graph of feature importances for better visualization feat_importances.nlargest(10).plot(kind='barh') pyplot.show() Output The results of the model importance are intuitive. The last payment amount seems to be the most important feature, followed by loan term and sub-grade. 178 | Chapter 6: Supervised Learning: Classification Conclusion In this case study, we introduced the classification-based tree algorithm applied to loan default prediction. We showed that data preparation is one of the most impor‐ tant steps. We addressed this by performing feature elimination using different tech‐ niques, such as feature intuition, correlation analysis, visualization, and data quality checks of the features. We illustrated that there can be different ways of handling and analyzing the categorical data and converting categorical data into a usable format for the models. We emphasized that performing data processing and establishing an understanding of variable importance is key in the model development process. A focus on these steps led to the implementation of a simple classification-based model that produced robust results for default prediction. Case Study 3: Bitcoin Trading Strategy First released as open source in 2009 by the pseudonymous Satoshi Nakamoto, bit‐ coin is the longest-running and most well-known cryptocurrency. A major drawback of cryptocurrency trading is the volatility of the market. Since cryptocurrency markets trade 24/7, tracking cryptocurrency positions against quickly changing market dynamics can rapidly become an impossible task to manage. This is where automated trading algorithms and trading bots can assist. Various machine learning algorithms can be used to generate trading signals in an attempt to predict the market’s movement. One could use machine learning algo‐ rithms to classify the next day’s movement into three categories: market will rise (take a long position), market will fall (take a short position), or market will move sideways (take no position). Since we know the market direction, we can decide the optimum entry and exit points. Machine learning has one key aspect called feature engineering. It means that we can create new, intuitive features and feed them to a machine learning algorithm in order to achieve better results. We can introduce different technical indicators as features to help predict future prices of an asset. These technical indicators are derived from market variables such as price or volume and have additional information or signals embedded in them. There are many different categories of technical indicators, including trend, volume, volatility, and momentum indicators. In this case study, we will use various classification-based models to predict whether the current position signal is buy or sell. We will create additional trend and momen‐ tum indicators from market prices to leverage as additional features in the prediction. Case Study 3: Bitcoin Trading Strategy | 179 In this case study, we will focus on: • Building a trading strategy using classification (classification of long/short sig‐ nals). • Feature engineering and constructing technical indicators of trend, momentum, and mean reversion. • Building a framework for backtesting results of a trading strategy. • Choosing the right evaluation metric to assess a trading strategy. Blueprint for Using Classification-Based Models to Predict Whether to Buy or Sell in the Bitcoin Market 1. Problem definition The problem of predicting a buy or sell signal for a trading strategy is defined in the classification framework, where the predicted variable has a value of 1 for buy and 0 for sell. This signal is decided through the comparison of the short-term and long- term price trends. The data used is from one of the largest bitcoin exchanges in terms of average daily volume, Bitstamp. The data covers prices from January 2012 to May 2017. Different trend and momentum indicators are created from the data and are added as features to enhance the performance of the prediction model. By the end of this case study, readers will be familiar with a general approach to building a trading strategy, from cleaning data and feature engineering to model tun‐ ing and developing a backtesting framework. 2. Getting started—loading the data and Python packages Let’s load the packages and the data. 2.1. Loading the Python packages. The standard Python packages are loaded in this step. The details have been presented in the previous case studies. Refer to the Jupyter notebook for this case study for more details. 2.2. Loading the data. The bitcoin data fetched from the Bitstamp website is loaded in this step: 180 | Chapter 6: Supervised Learning: Classification https://www.bitstamp.com # load dataset dataset = pd.read_csv('BitstampData.csv')3. Exploratory data analysis In this step, we will take a detailed look at this data. 3.1. Descriptive statistics. First, let us look at the shape of the data: dataset.shape Output (2841377, 8) # peek at data set_option('display.width', 100) dataset.tail(2) Output Timestamp Open High Low Close Volume_(BTC) Volume_(Currency) Weighted_Price 2841372 1496188560 2190.49 2190.49 2181.37 2181.37 1.700166 3723.784755 2190.247337 2841373 1496188620 2190.50 2197.52 2186.17 2195.63 6.561029 14402.811961 2195.206304 The dataset has minute-by-minute data of OHLC (Open, High, Low, Close) and traded volume of bitcoin. The dataset is large, with approximately 2.8 million total observations. 4. Data preparation In this part, we will clean the data to prepare for modeling. 4.1. Data cleaning. We clean the data by filling the NaNs with the last available values: dataset[dataset.columns.values] = dataset[dataset.columns.values].ffill() The Timestamp column is not useful for modeling and is dropped from the dataset: dataset=dataset.drop(columns=['Timestamp']) 4.2. Preparing the data for classification. As a first step, we will create the target variable for our model. This is the column that will indicate whether the trading signal is buy or sell. We define the short-term price as the 10-day rolling average and the long- term price as the 60-day rolling average. We attach a label of 1 (0) if the short-term price is higher (lower) than the long-term price: # Create short simple moving average over the short window dataset['short_mavg'] = dataset['Close'].rolling(window=10, min_periods=1,\ center=False).mean() Case Study 3: Bitcoin Trading Strategy | 181 # Create long simple moving average over the long window dataset['long_mavg'] = dataset['Close'].rolling(window=60, min_periods=1,\ center=False).mean() # Create signals dataset['signal'] = np.where(dataset['short_mavg'] > dataset['long_mavg'], 1.0, 0.0) 4.3. Feature engineering. We begin feature engineering by analyzing the features we expect may influence the performance of the prediction model. Based on a concep‐ tual understanding of key factors that drive investment strategies, the task at hand is to identify and construct new features that may capture the risks or characteristics embodied by these return drivers. For this case study, we will explore the efficacy of specific momentum technical indicators. The current data of the bitcoin consists of date, open, high, low, close, and volume. Using this data, we calculate the following momentum indicators: Moving average A moving average provides an indication of a price trend by cutting down the amount of noise in the series. Stochastic oscillator %K A stochastic oscillator is a momentum indicator that compares the closing price of a security to a range of its previous prices over a certain period of time. %K and %D are slow and fast indicators. The fast indicator is more sensitive than the slow indicator to changes in the price of the underlying security and will likely result in many transaction signals. Relative strength index (RSI) This is a momentum indicator that measures the magnitude of recent price changes to evaluate overbought or oversold conditions in the price of a stock or other asset. The RSI ranges from 0 to 100. An asset is deemed to be overbought once the RSI approaches 70, meaning that the asset may be getting overvalued and is a good candidate for a pullback. Likewise, if the RSI approaches 30, it is an indication that the asset may be getting oversold and is therefore likely to become undervalued. Rate of change (ROC) This is a momentum oscillator that measures the percentage change between the current price and the n period past prices. Assets with higher ROC values are considered more likely to be overbought; those with lower ROC, more likely to be oversold. 182 | Chapter 6: Supervised Learning: Classification Momentum (MOM) This is the rate of acceleration of a security’s price or volume—that is, the speed at which the price is changing. The following steps show how to generate some useful features for prediction. The features for trend and momentum can be leveraged for other trading strategy models: #calculation of exponential moving average def EMA(df, n): EMA = pd.Series(df['Close'].ewm(span=n, min_periods=n).mean(), name='EMA_'\ + str(n)) return EMA dataset['EMA10'] = EMA(dataset, 10) dataset['EMA30'] = EMA(dataset, 30) dataset['EMA200'] = EMA(dataset, 200) dataset.head() #calculation of rate of change def ROC(df, n): M = df.diff(n - 1) N = df.shift(n - 1) ROC = pd.Series(((M / N) * 100), name = 'ROC_' + str(n)) return ROC dataset['ROC10'] = ROC(dataset['Close'], 10) dataset['ROC30'] = ROC(dataset['Close'], 30) #calculation of price momentum def MOM(df, n): MOM = pd.Series(df.diff(n), name='Momentum_' + str(n)) return MOM dataset['MOM10'] = MOM(dataset['Close'], 10) dataset['MOM30'] = MOM(dataset['Close'], 30) #calculation of relative strength index def RSI(series, period): delta = series.diff().dropna() u = delta * 0 d = u.copy() u[delta > 0] = delta[delta > 0] d[delta < 0] = -delta[delta < 0] u[u.index[period-1]] = np.mean( u[:period] ) #first value is sum of avg gains u = u.drop(u.index[:(period-1)]) d[d.index[period-1]] = np.mean( d[:period] ) #first value is sum of avg losses d = d.drop(d.index[:(period-1)]) rs = u.ewm(com=period-1, adjust=False).mean() / \ d.ewm(com=period-1, adjust=False).mean() return 100 - 100 / (1 + rs) dataset['RSI10'] = RSI(dataset['Close'], 10) dataset['RSI30'] = RSI(dataset['Close'], 30) dataset['RSI200'] = RSI(dataset['Close'], 200) #calculation of stochastic osillator. Case Study 3: Bitcoin Trading Strategy | 183 def STOK(close, low, high, n): STOK = ((close - low.rolling(n).min()) / (high.rolling(n).max() - \ low.rolling(n).min())) * 100 return STOK def STOD(close, low, high, n): STOK = ((close - low.rolling(n).min()) / (high.rolling(n).max() - \ low.rolling(n).min())) * 100 STOD = STOK.rolling(3).mean() return STOD dataset['%K10'] = STOK(dataset['Close'], dataset['Low'], dataset['High'], 10) dataset['%D10'] = STOD(dataset['Close'], dataset['Low'], dataset['High'], 10) dataset['%K30'] = STOK(dataset['Close'], dataset['Low'], dataset['High'], 30) dataset['%D30'] = STOD(dataset['Close'], dataset['Low'], dataset['High'], 30) dataset['%K200'] = STOK(dataset['Close'], dataset['Low'], dataset['High'], 200) dataset['%D200'] = STOD(dataset['Close'], dataset['Low'], dataset['High'], 200) #calculation of moving average def MA(df, n): MA = pd.Series(df['Close'].rolling(n, min_periods=n).mean(), name='MA_'\ + str(n)) return MA dataset['MA21'] = MA(dataset, 10) dataset['MA63'] = MA(dataset, 30) dataset['MA252'] = MA(dataset, 200) With our features completed, we’ll prepare them for use. 4.4. Data visualization. In this step, we visualize different properties of the features and the predicted variable: dataset[['Weighted_Price']].plot(grid=True) plt.show() Output 184 | Chapter 6: Supervised Learning: Classification The chart illustrates a sharp rise in the price of bitcoin, increasing from close to $0 to around $2,500 in 2017. Also, high price volatility is readily visible. Let us look at the distribution of the predicted variable: fig = plt.figure() plot = dataset.groupby(['signal']).size().plot(kind='barh', color='red') plt.show() Output The predicted variable is 1 more than 52% of the time, meaning there are more buy signals than sell signals. The predicted variable is relatively balanced, especially as compared to the fraud dataset we saw in the first case study. 5. Evaluate algorithms and models In this step, we will evaluate different algorithms. 5.1. Train-test split. We first split the dataset into training (80%) and test (20%) sets. For this case study we use 100,000 observations for a faster calculation. The next steps would be same in the event we want to use the entire dataset: # split out validation dataset forthe end subset_dataset= dataset.iloc[-100000:] Y= subset_dataset["signal"] X = subset_dataset.loc[:, dataset.columns != 'signal'] validation_size = 0.2 seed = 1 X_train, X_validation, Y_train, Y_validation =\ train_test_split(X, Y, test_size=validation_size, random_state=1) Case Study 3: Bitcoin Trading Strategy | 185 5.2. Test options and evaluation metrics. Accuracy can be used as the evaluation metric since there is not a significant class imbalance in the data: # test options for classification num_folds = 10 scoring = 'accuracy' 5.3. Compare models and algorithms. In order to know which algorithm is best for our strategy, we evaluate the linear, nonlinear, and ensemble models. 5.3.1. Models. Checking the classification algorithms: models = [] models.append(('LR', LogisticRegression(n_jobs=-1))) models.append(('LDA', LinearDiscriminantAnalysis())) models.append(('KNN', KNeighborsClassifier())) models.append(('CART', DecisionTreeClassifier())) models.append(('NB', GaussianNB())) #Neural Network models.append(('NN', MLPClassifier())) # Ensemble Models # Boosting methods models.append(('AB', AdaBoostClassifier())) models.append(('GBM', GradientBoostingClassifier())) # Bagging methods models.append(('RF', RandomForestClassifier(n_jobs=-1))) After performing the k-fold cross validation, the comparison of the models is as follows: 186 | Chapter 6: Supervised Learning: Classification Although some of the models show promising results, we prefer an ensemble model given the huge size of the dataset, the large number of features, and an expected non‐ linear relationship between the predicted variable and the features. Random forest has the best performance among the ensemble models. 6. Model tuning and grid search A grid search is performed for the random forest model by varying the number of estimators and maximum depth. The details of the random forest model and the parameters to tune are discussed in Chapter 4: n_estimators = [20,80] max_depth= [5,10] criterion = ["gini","entropy"] param_grid = dict(n_estimators=n_estimators, max_depth=max_depth, \ criterion = criterion ) model = RandomForestClassifier(n_jobs=-1) kfold = KFold(n_splits=num_folds, random_state=seed) grid = GridSearchCV(estimator=model, param_grid=param_grid, \ scoring=scoring, cv=kfold) grid_result = grid.fit(X_train, Y_train) print("Best: %f using %s" % (grid_result.best_score_,\ grid_result.best_params_)) Output Best: 0.903438 using {'criterion': 'gini', 'max_depth': 10, 'n_estimators': 80} 7. Finalize the model Let us finalize the model with the best parameters found during the tuning step and perform the variable intuition. 7.1. Results on the test dataset. In this step, we evaluate the selected model on the test set: # prepare model model = RandomForestClassifier(criterion='gini', n_estimators=80,max_depth=10) #model = LogisticRegression() model.fit(X_train, Y_train) # estimate accuracy on validation set predictions = model.predict(X_validation) print(accuracy_score(Y_validation, predictions)) Output 0.9075 Case Study 3: Bitcoin Trading Strategy | 187 The selected model performs quite well, with an accuracy of 90.75%. Let us look at the confusion matrix: The overall model performance is reasonable and is in line with the training set results. 7.2. Variable intuition/feature importance. Let us look into the feature importance of the model: Importance = pd.DataFrame({'Importance':model.feature_importances_*100},\ index=X.columns) Importance.sort_values('Importance', axis=0, ascending=True).plot(kind='barh', \ color='r' ) plt.xlabel('Variable Importance') Output The result of the variable importance looks intuitive, and the momentum indicators of RSI and MOM over the last 30 days seem to be the two most important features. 188 | Chapter 6: Supervised Learning: Classification The feature importance chart corroborates the fact that introducing new features leads to an improvement in the model performance. 7.3. Backtesting results. In this additional step, we perform a backtest on the model we’ve developed. We create a column for strategy returns by multiplying the daily returns by the position that was held at the close of business the previous day and compare it against the actual returns. Backtesting Trading Strategies A backtesting approach similar to the one presented in this case study can be used to quickly backtest any trading strategy. backtestdata = pd.DataFrame(index=X_validation.index) backtestdata['signal_pred'] = predictions backtestdata['signal_actual'] = Y_validation backtestdata['Market Returns'] = X_validation['Close'].pct_change() backtestdata['Actual Returns'] = backtestdata['Market Returns'] *\ backtestdata['signal_actual'].shift(1) backtestdata['Strategy Returns'] = backtestdata['Market Returns'] * \ backtestdata['signal_pred'].shift(1) backtestdata=backtestdata.reset_index() backtestdata.head() backtestdata[['Strategy Returns','Actual Returns']].cumsum().hist() backtestdata[['Strategy Returns','Actual Returns']].cumsum().plot() Output Case Study 3: Bitcoin Trading Strategy | 189 Looking at the backtesting results, we do not deviate significantly from the actual market return. Indeed, the achieved momentum trading strategy made us better at predicting the price direction to buy or sell in order to make profits. However, as our accuracy is not 100% (but more than 96%), we made relatively few losses compared to the actual returns. Conclusion This case study demonstrated that framing the problem is a key step when tackling a finance problem with machine learning. In doing so, it was detemined that trans‐ forming the labels according to an investment objective and performing feature engi‐ neering were required for this trading strategy. We demonstrated the efficiency of using intuitive features related to the trend and momentum of the price movement. This helped increase the predictive power of the model. Finally, we introduced a backtesting framework, which allowed us to simulate a trading strategy using histori‐ cal data. This enabled us to generate results and analyze risk and profitability before risking any actual capital. Chapter Summary In “Case Study 1: Fraud Detection” on page 153, we explored the issue of an unbal‐ anced dataset and the importance of having the right evaluation metric. In “Case Study 2: Loan Default Probability” on page 166, various techniques and concepts of data processing, feature selection, and exploratory analysis were covered. In “Case Study 3: Bitcoin Trading Strategy” on page 179, we looked at ways to create technical 190 | Chapter 6: Supervised Learning: Classification indicators as features in order to use them for model enhancement. We also prepared a backtesting framework for a trading strategy. Overall, the concepts in Python, machine learning, and finance presented in this chapter can used as a blueprint for any other classification-based problem in finance. Exercises • Predict whether a stock price will go up or down using the features related to the stock or macroeconomic variables (use the ideas from the bitcoin-based case study presented in this chapter). • Create a model to detect money laundering using the features of a transaction. A sample dataset for this exercise can be obtained from Kaggle. • Perform a credit rating analysis of corporations using the features related to creditworthiness. Exercises | 191 https://oreil.ly/GcinN PART III Unsupervised Learning CHAPTER 7 Unsupervised Learning: Dimensionality Reduction In previous chapters, we used supervised learning techniques to build machine learn‐ ing models using data where the answer was already known (i.e., the class labels were available in our input data). Now we will explore unsupervised learning, where we draw inferences from datasets consisting of input data when the answer is unknown. Unsupervised learning algorithms attempt to infer patterns from the data without any knowledge of the output the data is meant toyield. Without requiring labeled data, which can be time-consuming and impractical to create or acquire, this family of models allows for easy use of larger datasets for analysis and model development. Dimensionality reduction is a key technique within unsupervised learning. It com‐ presses the data by finding a smaller, different set of variables that capture what mat‐ ters most in the original features, while minimizing the loss of information. Dimensionality reduction helps mitigate problems associated with high dimensional‐ ity and permits the visualization of salient aspects of higher-dimensional data that is otherwise difficult to explore. In the context of finance, where datasets are often large and contain many dimen‐ sions, dimensionality reduction techniques prove to be quite practical and useful. Dimensionality reduction enables us to reduce noise and redundancy in the dataset and find an approximate version of the dataset using fewer features. With fewer vari‐ ables to consider, exploration and visualization of a dataset becomes more straight‐ forward. Dimensionality reduction techniques also enhance supervised learning– based models by reducing the number of features or by finding new ones. Practition‐ ers use these dimensionality reduction techniques to allocate funds across asset classes and individual investments, identify trading strategies and signals, implement portfolio hedging and risk management, and develop instrument pricing models. 195 In this chapter, we will discuss fundamental dimensionality reduction techniques and walk through three case studies in the areas of portfolio management, interest rate modeling, and trading strategy development. The case studies are designed to not only cover diverse topics from a finance standpoint but also highlight multiple machine learning and data science concepts. The standardized template containing the detailed implementation of modeling steps in Python and machine learning and finance concepts can be used as a blueprint for any other dimensionality reduction– based problem in finance. In “Case Study 1: Portfolio Management: Finding an Eigen Portfolio” on page 202, we use a dimensionality reduction algorithm to allocate capital into different asset classes to maximize risk-adjusted returns. We also introduce a backtesting framework to assess the performance of the portfolio we constructed. In “Case Study 2: Yield Curve Construction and Interest Rate Modeling” on page 217, we use dimensionality reduction techniques to generate the typical movements of a yield curve. This will illustrate how dimensionality reduction techniques can be used for reducing the dimension of market variables across a number of asset classes to promote faster and effective portfolio management, trading, hedging, and risk man‐ agement. In “Case Study 3: Bitcoin Trading: Enhancing Speed and Accuracy” on page 227, we use dimensionality reduction techniques for algorithmic trading. This case study demonstrates data exploration in low dimension. In addition to the points mentioned above, readers will understand the following points by the end of this chapter: • Basic concepts of models and techniques used for dimensionality reduction and how to implement them in Python. • Concepts of eigenvalues and eigenvectors of Principal Component Analysis (PCA), selecting the right number of principal components, and extracting the factor weights of principal components. • Usage of dimensionality reduction techniques such as singular value decomposi‐ tion (SVD) and t-SNE to summarize high-dimensional data for effective data exploration and visualization. • How to reconstruct the original data using the reduced principal components. • How to enhance the speed and accuracy of supervised learning algorithms using dimensionality reduction. 196 | Chapter 7: Unsupervised Learning: Dimensionality Reduction • A backtesting framework for the portfolio performance computing and analyz‐ ing portfolio performance metrics such as the Sharpe ratio and the annualized return of the portfolio. This Chapter’s Code Repository A Python-based master template for dimensionality reduction, along with the Jupyter notebook for all the case studies in this chapter, is included in the folder Chapter 7 - Unsup. Learning - Dimensionality Reduction in the code repository for this book. To work through any dimensionality reduction–modeling machine learning problems in Python involving the dimensionality reduc‐ tion models (such as PCA, SVD, Kernel PCA, or t-SNE) presented in this chapter, readers need to modify the template slightly to align with their problem statement. All the case studies presented in this chapter use the standard Python master template with the standardized model development steps presented in Chapter 3. For the dimensionality reduction case studies, steps 6 (i.e., model tun‐ ing) and 7 (i.e., finalizing the model) are relatively lighter com‐ pared to the supervised learning models, so these steps have been merged with step 5. For situations in which steps are irrelevant, they have been skipped or combined with others to make the flow of the case study more intuitive. Dimensionality Reduction Techniques Dimensionality reduction represents the information in a given dataset more effi‐ ciently by using fewer features. These techniques project data onto a lower dimen‐ sional space by either discarding variation in the data that is not informative or identifying a lower dimensional subspace on or near where the data resides. There are many types of dimensionality reduction techniques. In this chapter, we will cover these most frequently used techniques for dimensionality reduction: • Principal component analysis (PCA) • Kernel principal component analysis (KPCA) • t-distributed stochastic neighbor embedding (t-SNE) After application of these dimensionality reduction techniques, the low-dimension feature subspace can be a linear or nonlinear function of the corresponding high- dimensional feature subspace. Hence, on a broad level these dimensionality reduc‐ tion algorithms can be classified as linear and nonlinear. Linear algorithms, such as PCA, force the new variables to be linear combinations of the original features. Dimensionality Reduction Techniques | 197 https://oreil.ly/tI-KJ https://oreil.ly/tI-KJ Nonlinear algorithms such KPCA and t-SNE can capture more complex structures in the data. However, given the infinite number of options, the algorithms still need to make assumptions to arrive at a solution. Principal Component Analysis The idea of principal component analysis (PCA) is to reduce the dimensionality of a dataset with a large number of variables, while retaining as much variance in the data as possible. PCA allows us to understand whether there is a different representation of the data that can explain a majority of the original data points. PCA finds a set of new variables that, through a linear combination, yield the original variables. The new variables are called principal components (PCs). These principal components are orthogonal (or independent) and can represent the original data. The number of components is a hyperparameter of the PCA algorithm that sets the target dimensionality. The PCA algorithm works by projecting the original data onto the principal compo‐ nent space. It then identifies a sequence of principal components, each of which aligns with the direction of maximum variance in the data (after accounting for var‐ iation captured by previously computed components). The sequential optimization also ensures that new components are not correlated with existing components. Thus the resulting set constitutes an orthogonal basis for a vector space. The decline in the amount of variance of the original data explained by each principal component reflects the extent of correlation among the original features. The number of components that capture, for example, 95% of the original variation relative to the total number of featuresprovides an insight into the linearly independent informa‐ tion of the original data. In order to understand how PCA works, let’s consider the distribution of data shown in Figure 7-1. Figure 7-1. PCA-1 PCA finds a new quadrant system (y’ and x’ axes) that is obtained from the original through translation and rotation. It will move the center of the coordinate system from the original point (0, 0) to the center of the distribution of data points. It will then move the x-axis into the principal axis of variation, which is the one with the 198 | Chapter 7: Unsupervised Learning: Dimensionality Reduction 1 Eigenvectors and eigenvalues are concepts of linear algebra. most variation relative to data points (i.e., the direction of maximum spread). Then it moves the other axis orthogonally to the principal one, into a less important direction of variation. Figure 7-2 shows an example of PCA in which two dimensions explain nearly all the variance of the underlying data. Figure 7-2. PCA-2 These new directions that contain the maximum variance are called principal compo‐ nents and are orthogonal to each other by design. There are two approaches to finding the principal components: Eigen decomposition and singular value decomposition (SVD). Eigen decomposition The steps of Eigen decomposition are as follows: 1. First, a covariance matrix is created for the features. 2. Once the covariance matrix is computed, the eigenvectors of the covariance matrix are calculated.1 These are the directions of maximum variance. 3. The eigenvalues are then created. They define the magnitude of the principal components. So, for n dimensions, there will be an n × n variance-covariance matrix, and as a result, we will have an eigenvector of n values and n eigenvalues. Python’s sklearn library offers a powerful implementation of PCA. The sklearn.decomposition.PCA function computes the desired number of principal components and projects the data into the component space. The following code snippet illustrates how to create two principal components from a dataset. Dimensionality Reduction Techniques | 199 https://oreil.ly/fDaLg Implementation # Import PCA Algorithm from sklearn.decomposition import PCA # Initialize the algorithm and set the number of PC's pca = PCA(n_components=2) # Fit the model to data pca.fit(data) # Get list of PC's pca.components_ # Transform the model to data pca.transform(data) # Get the eigenvalues pca.explained_variance_ratio There are additional items, such as factor loading, that can be obtained using the functions in the sklearn library. Their use will be demonstrated in the case studies. Singular value decomposition Singular value decomposition (SVD) is factorization of a matrix into three matrices and is applicable to a more general case of m × n rectangular matrices. If A is an m × n matrix, then SVD can express the matrix as: A = UΣV T where A is an m × n matrix, U is an (m × m) orthogonal matrix, Σ is an (m × n) nonnegative rectangular diagonal matrix, and V is an (n × n) orthogonal matrix. SVD of a given matrix tells us exactly how we can decompose the matrix. Σ is a diagonal matrix with m diagonal values called singular values. Their magnitude indicates how significant they are to preserving the information of the original data. V contains the principal components as column vectors. As shown above, both Eigen decomposition and SVD tell us that using PCA is effec‐ tively looking at the initial data from a different angle. Both will always give the same answer; however, SVD can be much more efficient than Eigen decomposition, as it is able to handle sparse matrices (those which contain very few nonzero elements). In addition, SVD yields better numerical stability, especially when some of the features are strongly correlated. Truncated SVD is a variant of SVD that computes only the largest singular values, where the number of computes is a user-specified parameter. This method is differ‐ ent from regular SVD in that it produces a factorization where the number of col‐ umns is equal to the specified truncation. For example, given an n × n matrix, SVD will produce matrices with n columns, whereas truncated SVD will produce matrices with a specified number of columns that may be less than n. 200 | Chapter 7: Unsupervised Learning: Dimensionality Reduction Implementation from sklearn.decomposition import TruncatedSVD svd = TruncatedSVD(ncomps=20).fit(X) In terms of the weaknesses of the PCA technique, although it is very effective in reducing the number of dimensions, the resulting principal components may be less interpretable than the original features. Additionally, the results may be sensitive to the selected number of principal components. For example, too few principal compo‐ nents may miss some information compared to the original list of features. Also, PCA may not work well if the data is strongly nonlinear. Kernel Principal Component Analysis A main limitation of PCA is that it only applies linear transformations. Kernel princi‐ pal component analysis (KPCA) extends PCA to handle nonlinearity. It first maps the original data to some nonlinear feature space (usually one of higher dimension). Then it applies PCA to extract the principal components in that space. A simple example of when KPCA is applicable is shown in Figure 7-3. Linear trans‐ formations are suitable for the blue and red data points on the left-hand plot. However, if all dots are arranged as per the graph on the right, the result is not line‐ arly separable. We would then need to apply KPCA to separate the components. Figure 7-3. Kernel PCA Implementation from sklearn.decomposition import KernelPCA kpca = KernelPCA(n_components=4, kernel='rbf').fit_transform(X) In the Python code, we specify kernel='rbf', which is the radial basis function ker‐ nel. This is commonly used as a kernel in machine learning techniques, such as in SVMs (see Chapter 4). Using KPCA, component separation becomes easier in a higher dimensional space, as mapping into a higher dimensional space often provides greater classification power. Dimensionality Reduction Techniques | 201 https://oreil.ly/zCo-X https://oreil.ly/zCo-X t-distributed Stochastic Neighbor Embedding t-distributed stochastic neighbor embedding (t-SNE) is a dimensionality reduction algorithm that reduces the dimensions by modeling the probability distribution of neighbors around each point. Here, the term neighbors refers to the set of points clos‐ est to a given point. The algorithm emphasizes keeping similar points together in low dimensions as opposed to maintaining the distance between points that are apart in high dimensions. The algorithm starts by calculating the probability of similarity of data points in cor‐ responding high and low dimensional space. The similarity of points is calculated as the conditional probability that a point A would choose point B as its neighbor if neighbors were picked in proportion to their probability density under a normal dis‐ tribution centered at A. The algorithm then tries to minimize the difference between these conditional probabilities (or similarities) in the high and low dimensional spaces for a perfect representation of data points in the low dimensional space. Implementation from sklearn.manifold import TSNE X_tsne = TSNE().fit_transform(X) An implementation of t-SNE is shown in the third case study presented in this chapter. Case Study 1: Portfolio Management: Finding an Eigen Portfolio A primary objective of portfolio management is to allocate capital into different asset classes to maximize risk-adjusted returns. Mean-variance portfolio optimization is the most commonly used technique for asset allocation. This method requires an estimated covariance matrix and expected returns of the assets considered. However, the erratic nature of financial returns leads to estimation errors in these inputs, espe‐ cially when the sample size of returns is insufficient compared to the number of assetsbeing allocated. These errors greatly jeopardize the optimality of the resulting portfolios, leading to poor and unstable outcomes. Dimensionality reduction is a technique we can use to address this issue. Using PCA, we can take an n × n covariance matrix of our assets and create a set of n linearly uncorrelated principal portfolios (sometimes referred to in literature as an eigen port‐ folio) made up of our assets and their corresponding variances. The principal compo‐ nents of the covariance matrix capture most of the covariation among the assets and are mutually uncorrelated. Moreover, we can use standardized principal components as the portfolio weights, with the statistical guarantee that the returns from these principal portfolios are linearly uncorrelated. 202 | Chapter 7: Unsupervised Learning: Dimensionality Reduction By the end of this case study, readers will be familiar with a general approach to find‐ ing an eigen portfolio for asset allocation, from understanding concepts of PCA to backtesting different principal components. This case study will focus on: • Understanding eigenvalues and eigenvectors of PCA and deriving portfolio weights using the principal components. • Developing a backtesting framework to evaluate portfolio performance. • Understanding how to work through a dimensionality reduction modeling prob‐ lem from end to end. Blueprint for Using Dimensionality Reduction for Asset Allocation 1. Problem definition Our goal in this case study is to maximize the risk-adjusted returns of an equity port‐ folio using PCA on a dataset of stock returns. The dataset used for this case study is the Dow Jones Industrial Average (DJIA) index and its respective 30 stocks. The return data used will be from the year 2000 onwards and can be downloaded from Yahoo Finance. We will also compare the performance of our hypothetical portfolios against a bench‐ mark and backtest the model to evaluate the effectiveness of the approach. 2. Getting started—loading the data and Python packages 2.1. Loading the Python packages. The list of the libraries used for data loading, data analysis, data preparation, model evaluation, and model tuning are shown below. The details of most of these packages and functions can be found in Chapters 2 and 4. Packages for dimensionality reduction from sklearn.decomposition import PCA from sklearn.decomposition import TruncatedSVD from numpy.linalg import inv, eig, svd from sklearn.manifold import TSNE from sklearn.decomposition import KernelPCA Case Study 1: Portfolio Management: Finding an Eigen Portfolio | 203 Packages for data processing and visualization import numpy as np import pandas as pd import matplotlib.pyplot as plt from pandas import read_csv, set_option from pandas.plotting import scatter_matrix import seaborn as sns from sklearn.preprocessing import StandardScaler 2.2. Loading the data. We import the dataframe containing the adjusted closing prices for all the companies in the DJIA index: # load dataset dataset = read_csv('Dow_adjcloses.csv', index_col=0) 3. Exploratory data analysis Next, we inspect the dataset. 3.1. Descriptive statistics. Let’s look at the shape of the data: dataset.shape Output (4804, 30) The data is comprised of 30 columns and 4,804 rows containing the daily closing prices of the 30 stocks in the index since 2000. 3.2. Data visualization. The first thing we must do is gather a basic sense of our data. Let us take a look at the return correlations: correlation = dataset.corr() plt.figure(figsize=(15, 15)) plt.title('Correlation Matrix') sns.heatmap(correlation, vmax=1, square=True,annot=True, cmap='cubehelix') There is a significant positive correlation between the daily returns. The plot (full-size version available on GitHub) also indicates that the information embedded in the data may be represented by fewer variables (i.e., something smaller than the 30 dimensions we have now). We will perform another detailed look at the data after implementing dimensionality reduction. 204 | Chapter 7: Unsupervised Learning: Dimensionality Reduction https://oreil.ly/yFwu- Output 4. Data preparation We prepare the data for modeling in the following sections. 4.1. Data cleaning. First, we check for NAs in the rows and either drop them or fill them with the mean of the column: #Checking for any null values and removing the null values''' print('Null Values =',dataset.isnull().values.any()) Output Null Values = True Case Study 1: Portfolio Management: Finding an Eigen Portfolio | 205 Some stocks were added to the index after our start date. To ensure proper analysis, we will drop those with more than 30% missing values. Two stocks fit this criteria— Dow Chemicals and Visa: missing_fractions = dataset.isnull().mean().sort_values(ascending=False) missing_fractions.head(10) drop_list = sorted(list(missing_fractions[missing_fractions > 0.3].index)) dataset.drop(labels=drop_list, axis=1, inplace=True) dataset.shape Output (4804, 28) We end up with return data for 28 companies and an additional one for the DJIA index. Now we fill the NAs with the mean of the columns: # Fill the missing values with the last value available in the dataset. dataset=dataset.fillna(method='ffill') 4.2. Data transformation. In addition to handling the missing values, we also want to standardize the dataset features onto a unit scale (mean = 0 and variance = 1). All the variables should be on the same scale before applying PCA; otherwise, a feature with large values will dominate the result. We use StandardScaler in sklearn to standard‐ ize the dataset, as shown below: from sklearn.preprocessing import StandardScaler scaler = StandardScaler().fit(datareturns) rescaledDataset = pd.DataFrame(scaler.fit_transform(datareturns),columns =\ datareturns.columns, index = datareturns.index) # summarize transformed data datareturns.dropna(how='any', inplace=True) rescaledDataset.dropna(how='any', inplace=True) Overall, cleaning and standardizing the data is important in order to create a mean‐ ingful and reliable dataset to be used in dimensionality reduction without error. Let us look at the returns of one of the stocks from the cleaned and standardized dataset: # Visualizing Log Returns for the DJIA plt.figure(figsize=(16, 5)) plt.title("AAPL Return") rescaledDataset.AAPL.plot() plt.grid(True); plt.legend() plt.show() 206 | Chapter 7: Unsupervised Learning: Dimensionality Reduction Output 5. Evaluate algorithms and models 5.1. Train-test split. The portfolio is divided into training and test sets to perform the analysis regarding the best portfolio and to perform backtesting: # Dividing the dataset into training and testing sets percentage = int(len(rescaledDataset) * 0.8) X_train = rescaledDataset[:percentage] X_test = rescaledDataset[percentage:] stock_tickers = rescaledDataset.columns.values n_tickers = len(stock_tickers) 5.2. Model evaluation: applying principal component analysis. As the next step, we create a function to perform PCA using the sklearn library. This function generates the prin‐ cipal components from the data that will be used for further analysis: pca = PCA() PrincipalComponent=pca.fit(X_train) 5.2.1. Explained variance using PCA. In this step, we look at the variance explained using PCA. The decline in the amount of variance of the original data explained by each principal component reflects the extent of correlation among the original features. The first principal component captures the most variance in the original data, the second component is a representation of the second highest variance, and so on. The eigenvectors with the lowest eigenvalues describe the least amount of variation within the dataset. Therefore, these values can be dropped. The following charts show the number of principal components and the variance explained by each. NumEigenvalues=20 fig, axes = plt.subplots(ncols=2, figsize=(14,4)) Series1 = pd.Series(pca.explained_variance_ratio_[:NumEigenvalues]).sort_values()Case Study 1: Portfolio Management: Finding an Eigen Portfolio | 207 Series2 = pd.Series(pca.explained_variance_ratio_[:NumEigenvalues]).cumsum() Series1.plot.barh(title='Explained Variance Ratio by Top Factors', ax=axes[0]); Series1.plot(ylim=(0,1), ax=axes[1], title='Cumulative Explained Variance'); Output We find that the most important factor explains around 40% of the daily return var‐ iation. This dominant principal component is usually interpreted as the “market” fac‐ tor. We will discuss the interpretation of this and the other factors when looking at the portfolio weights. The plot on the right shows the cumulative explained variance and indicates that around ten factors explain 73% of the variance in returns of the 28 stocks analyzed. 5.2.2. Looking at portfolio weights. In this step, we look more closely at the individual principal components. These may be less interpretable than the original features. However, we can look at the weights of the factors on each principal component to assess any intuitive themes relative to the 28 stocks. We construct five portfolios, defining the weights of each stock as each of the first five principal components. We then create a scatterplot that visualizes an organized descending plot with the respec‐ tive weight of every company at the current chosen principal component: def PCWeights(): #Principal Components (PC) weights for each 28 PCs weights = pd.DataFrame() for i in range(len(pca.components_)): weights["weights_{}".format(i)] = \ pca.components_[i] / sum(pca.components_[i]) weights = weights.values.T return weights weights=PCWeights() sum(pca.components_[0]) Output -5.247808242068631 208 | Chapter 7: Unsupervised Learning: Dimensionality Reduction NumComponents=5 topPortfolios = pd.DataFrame(pca.components_[:NumComponents],\ columns=dataset.columns) eigen_portfolios = topPortfolios.div(topPortfolios.sum(1), axis=0) eigen_portfolios.index = [f'Portfolio {i}' for i in range( NumComponents)] np.sqrt(pca.explained_variance_) eigen_portfolios.T.plot.bar(subplots=True, layout=(int(NumComponents),1), \ figsize=(14,10), legend=False, sharey=True, ylim= (-1,1)) Given that scale for the plots are the same, we can also look at the heatmap as follows: Output # plotting heatmap sns.heatmap(topPortfolios) Case Study 1: Portfolio Management: Finding an Eigen Portfolio | 209 Output The heatmap and barplots show the contribution of different stocks in each eigenvector. Traditionally, the intuition behind each principal portfolio is that it represents some sort of independent risk factor. The manifestation of those risk factors depends on the assets in the portfolio. In our case study, the assets are all U.S. domestic equities. The principal portfolio with the largest variance is typically a systematic risk factor (i.e., “market” factor). Looking at the first principal component (Portfolio 0), we see that the weights are distributed homogeneously across the stocks. This nearly equal weighted portfolio explains 40% of the variance in the index and is a fair representa‐ tion of a systematic risk factor. The rest of the eigen portfolios typically correspond to sector or industry factors. For example, Portfolio 1 assigns a high weight to JNJ and MRK, which are stocks from the health care sector. Similarly, Portfolio 3 has high weights on technology and electron‐ ics companies, such AAPL, MSFT, and IBM. When the asset universe for our portfolio is expanded to include broad, global invest‐ ments, we may identify factors for international equity risk, interest rate risk, com‐ modity exposure, geographic risk, and many others. In the next step, we find the best eigen portfolio. 5.2.3. Finding the best eigen portfolio. To determine the best eigen portfolio, we use the Sharpe ratio. This is an assessment of risk-adjusted performance that explains the annualized returns against the annualized volatility of a portfolio. A high Sharpe ratio explains higher returns and/or lower volatility for the specified portfolio. The annual‐ ized Sharpe ratio is computed by dividing the annualized returns against the annual‐ ized volatility. For annualized return we apply the geometric average of all the returns 210 | Chapter 7: Unsupervised Learning: Dimensionality Reduction in respect to the periods per year (days of operations in the exchange in a year). Annualized volatility is computed by taking the standard deviation of the returns and multiplying it by the square root of the periods per year. The following code computes the Sharpe ratio of a portfolio: # Sharpe Ratio Calculation # Calculation based on conventional number of trading days per year (i.e., 252). def sharpe_ratio(ts_returns, periods_per_year=252): n_years = ts_returns.shape[0]/ periods_per_year annualized_return = np.power(np.prod(1+ts_returns), (1/n_years))-1 annualized_vol = ts_returns.std() * np.sqrt(periods_per_year) annualized_sharpe = annualized_return / annualized_vol return annualized_return, annualized_vol, annualized_sharpe We construct a loop to compute the principal component weights for each eigen portfolio. Then it uses the Sharpe ratio function to look for the portfolio with the highest Sharpe ratio. Once we know which portfolio has the highest Sharpe ratio, we can visualize its performance against the index for comparison: def optimizedPortfolio(): n_portfolios = len(pca.components_) annualized_ret = np.array([0.] * n_portfolios) sharpe_metric = np.array([0.] * n_portfolios) annualized_vol = np.array([0.] * n_portfolios) highest_sharpe = 0 stock_tickers = rescaledDataset.columns.values n_tickers = len(stock_tickers) pcs = pca.components_ for i in range(n_portfolios): pc_w = pcs[i] / sum(pcs[i]) eigen_prtfi = pd.DataFrame(data ={'weights': pc_w.squeeze()*100}, \ index = stock_tickers) eigen_prtfi.sort_values(by=['weights'], ascending=False, inplace=True) eigen_prti_returns = np.dot(X_train_raw.loc[:, eigen_prtfi.index], pc_w) eigen_prti_returns = pd.Series(eigen_prti_returns.squeeze(),\ index=X_train_raw.index) er, vol, sharpe = sharpe_ratio(eigen_prti_returns) annualized_ret[i] = er annualized_vol[i] = vol sharpe_metric[i] = sharpe sharpe_metric= np.nan_to_num(sharpe_metric) # find portfolio with the highest Sharpe ratio highest_sharpe = np.argmax(sharpe_metric) print('Eigen portfolio #%d with the highest Sharpe. Return %.2f%%,\ vol = %.2f%%, Sharpe = %.2f' % Case Study 1: Portfolio Management: Finding an Eigen Portfolio | 211 (highest_sharpe, annualized_ret[highest_sharpe]*100, annualized_vol[highest_sharpe]*100, sharpe_metric[highest_sharpe])) fig, ax = plt.subplots() fig.set_size_inches(12, 4) ax.plot(sharpe_metric, linewidth=3) ax.set_title('Sharpe ratio of eigen-portfolios') ax.set_ylabel('Sharpe ratio') ax.set_xlabel('Portfolios') results = pd.DataFrame(data={'Return': annualized_ret,\ 'Vol': annualized_vol, 'Sharpe': sharpe_metric}) results.dropna(inplace=True) results.sort_values(by=['Sharpe'], ascending=False, inplace=True) print(results.head(5)) plt.show() optimizedPortfolio() Output Eigen portfolio #0 with the highest Sharpe. Return 11.47%, vol = 13.31%, \ Sharpe = 0.86 Return Vol Sharpe 0 0.115 0.133 0.862 7 0.096 0.693 0.138 5 0.100 0.845 0.118 1 0.057 0.670 0.084 As shown by the results above, Portfolio 0 is the best performing one, with the highest return and the lowest volatility. Let us look at the composition of this portfolio: weights = PCWeights() portfolio = portfolio = pd.DataFrame() 212 | Chapter 7: Unsupervised Learning: Dimensionality Reduction def plotEigen(weights, plot=False, portfolio=portfolio): portfolio = pd.DataFrame(data ={'weights': weights.squeeze() * 100}, \ index= stock_tickers) portfolio.sort_values(by=['weights'], ascending=False, inplace=True) if plot: portfolio.plot(title='Current Eigen-Portfolio Weights', figsize=(12, 6), xticks=range(0, len(stock_tickers), 1), rot=45, linewidth=3 ) plt.show() return portfolio # Weights are stored in arrays, where 0 is the first PC's weights. plotEigen(weights=weights[0], plot=True) Output Recall that this is the portfolio that explains 40% of the variance and represents the systematic risk factor. Looking at the portfolio weights (in percentages in the y-axis), they do not vary much and are in the range of 2.7% to 4.5% across all stocks. How‐ ever, the weights seem to be higher in the financial sector, and stocks such as AXP, JPM, and GS have higher-than-average weights. Case Study 1: Portfolio Management: Finding an Eigen Portfolio | 213 5.2.4. Backtesting the eigen portfolios. We will now try to backtest this algorithm on the test set. We will look at a few of the top performers and the worst performer. For the top performers we look at the 3rd- and 4th-ranked eigen portfolios (Portfolios 5 and 1), while the worst performer reviewed was ranked 19th (Portfolio 14): def Backtest(eigen): ''' Plots principal components returns against real returns. ''' eigen_prtfi = pd.DataFrame(data ={'weights': eigen.squeeze()}, \ index=stock_tickers) eigen_prtfi.sort_values(by=['weights'], ascending=False, inplace=True) eigen_prti_returns = np.dot(X_test_raw.loc[:, eigen_prtfi.index], eigen) eigen_portfolio_returns = pd.Series(eigen_prti_returns.squeeze(),\ index=X_test_raw.index) returns, vol, sharpe = sharpe_ratio(eigen_portfolio_returns) print('Current Eigen-Portfolio:\nReturn = %.2f%%\nVolatility = %.2f%%\n\ Sharpe = %.2f' % (returns * 100, vol * 100, sharpe)) equal_weight_return=(X_test_raw * (1/len(pca.components_))).sum(axis=1) df_plot = pd.DataFrame({'EigenPorfolio Return': eigen_portfolio_returns, \ 'Equal Weight Index': equal_weight_return}, index=X_test.index) np.cumprod(df_plot + 1).plot(title='Returns of the equal weighted\ index vs. First eigen-portfolio', figsize=(12, 6), linewidth=3) plt.show() Backtest(eigen=weights[5]) Backtest(eigen=weights[1]) Backtest(eigen=weights[14]) Output Current Eigen-Portfolio: Return = 32.76% Volatility = 68.64% Sharpe = 0.48 214 | Chapter 7: Unsupervised Learning: Dimensionality Reduction Current Eigen-Portfolio: Return = 99.80% Volatility = 58.34% Sharpe = 1.71 Current Eigen-Portfolio: Return = -79.42% Volatility = 185.30% Sharpe = -0.43 Case Study 1: Portfolio Management: Finding an Eigen Portfolio | 215 As shown in the preceding charts, the eigen portfolio return of the top portfolios out‐ performs the equally weighted index. The eigen portfolio ranked 19th underper‐ formed the market significantly in the test set. The outperformance and underperformance are attributed to the weights of the stocks or sectors in the eigen portfolio. We can drill down further to understand the individual drivers of each portfolio. For example, Portfolio 1 assigns high weight to several stocks in the health care sector, as discussed previously. This sector saw a significant increase in 2017 onwards, which is reflected in the chart for Eigen Portfolio 1. Given that these eigen portfolios are independent, they also provide diversification opportunities. As such, we can invest across these uncorrelated eigen portfolios, pro‐ viding other potential portfolio management benefits. Conclusion In this case study, we applied dimensionality reduction techniques in the context of portfolio management, using eigenvalues and eigenvectors from PCA to perform asset allocation. We demonstrated that, while some interpretability is lost, the initution behind the resulting portfolios can be matched to risk factors. In this example, the first eigen portfolio represented a systematic risk factor, while others exhibited sector or indus‐ try concentration. Through backtesting, we found that the portfolio with the best result on the training set also achieved the strongest performance on the test set. Several of the portfolios outperformed the index based on the Sharpe ratio, the risk-adjusted performance metric used in this exercise. 216 | Chapter 7: Unsupervised Learning: Dimensionality Reduction Overall, we found that using PCA and analyzing eigen portfolios can yield a robust methodology for asset allocation and portfolio management. Case Study 2: Yield Curve Construction and Interest Rate Modeling A number of problems in portfolio management, trading, and risk management require a deep understanding and modeling of yield curves. A yield curve represents interest rates, or yields, across a range of maturities, usually depicted in a line graph, as discussed in “Case Study 4: Yield Curve Prediction” on page 141 in Chapter 5. Yield curve illustrates the “price of funds” at a given point in time and, due to the time value of money, often shows interest rates rising as a func‐ tion of maturity. Researchers in finance have studied the yield curve and found that shifts or changes in the shape of the yield curve are attributable to a few unobservable factors. Specifi‐ cally, empirical studies reveal that more than 99% of the movement of various U.S. Treasury bond yields are captured by three factors, which are often referred to as level, slope, and curvature. The names describe how each influences the yield curve shape in response to a shock. A level shock changes the interest rates of all maturities by almost identical amounts, inducing a parallel shift that changes the level of the entire curve up or down. A shock to the slope factor changes the difference in short- term and long-term rates. For instance, when long-term rates increase by a larger amount than do short-term rates, it results in a curve that becomes steeper (i.e., visu‐ ally, the curve becomes more upward sloping). Changes in the short- and long-term rates can also produce a flatter yield curve. The main effects of the shock to the curva‐ ture factor focuses on medium-term interest rates, leading to hump, twist, or U- shaped characteristics. Dimensionality reduction breaks down the movement of the yield curve into these three factors. Reducing the yield curve into fewer components means we can focus on a few intuitive dimensions in the yield curve. Traders and risk managers use this technique to condense the curve in risk factors for hedging the interest rate risk. Sim‐ ilarly, portfolio managers then have fewer dimensions to analyze when allocating funds. Interest rate structurers use this technique to model the yield curve and ana‐ lyze its shape. Overall, it promotes faster and more effective portfolio management, trading, hedging, and risk management. In this case study, we use PCA to generate typical movements of a yield curve and show that the first three principal components correspond to a yield curve’s level, slope, and curvature, respectively. Case Study 2: Yield Curve Construction and Interest Rate Modeling | 217 This case study will focus on: • Understanding the intuition behind eigenvectors. • Using dimensions resulting from dimensionality reduction to reproduce the original data. Blueprint for Using Dimensionality Reduction to Generate a Yield Curve 1. Problem definition Our goal in this case study is to use dimensionality reduction techniques to generate the typical movements of a yield curve. The data used for this case study is obtained from Quandl, a premier source for financial, economic, and alternative datasets. We use the data of 11 tenors (or maturi‐ ties), from 1-month to 30-years, of Treasury curves. These are of daily frequency and are available from 1960 onwards. 2. Getting started—loading the data and Python packages 2.1. Loading the Python packages. The loading of Python packages is similar to the