Buscar

DataScience_Project_Template_ML_+

Esta é uma pré-visualização de arquivo. Entre para ver o arquivo original

{
 "cells": [
 {
 "cell_type": "markdown",
 "metadata": {},
 "source": [
 "<div class=\"alert\" style=\"background-color:#696969; color:white; padding:0px 10px; border-radius:5px;\"><h3 style='margin:15px 15px; font-size:12px'> All Rights Reserved. This notebook is proprietary content of machinelearningplus.com. This can be shared solely for educational purposes, with due credits to machinelearningplus.com</h3>\n",
 "</div>\n",
 "\n"
 ]
 },
 {
 "cell_type": "markdown",
 "metadata": {},
 "source": [
 "<div class=\"alert\" style=\"background-color:#fff; color:white; padding:0px 10px; border-radius:5px;\"><h1 style='margin:15px 15px; color:#006a79; font-size:40px'> Machine Learning Project Template</h1>\n",
 "</div>\n",
 "\n",
 "Just learning the concepts and theory is not sufficient to ace in machine learning. You need to implement the concepts and try to solve more practical problems.\n",
 "\n",
 "In this notebook, you will learn how to implement machine learning. You will be able to kickstart your project using this template"
 ]
 },
 {
 "cell_type": "markdown",
 "metadata": {},
 "source": [
 "### Content:\n",
 "1. Load the Data\n",
 " - Import libraries\n",
 " - Load the datasets\n",
 " \n",
 "2. Overview of the Data\n",
 " - Descriptive Statistics\n",
 " - Missing Values\n",
 " \n",
 "3. Exploratory Data Analysis\n",
 " - Create list of columns by data type\n",
 " - Check the distribution of target class\n",
 " - Check the distribution of every feature\n",
 " - Check how differnt numerical features are realated to target class\n",
 " \n",
 "4. Data Preparation\n",
 " - Data Cleaning\n",
 " - Feature Encoding\n",
 " - Split X & y\n",
 " - Feature Scaling\n",
 " - Train Test split\n",
 " \n",
 "5. Model Building\n",
 " - Train Model\n",
 " - Model Prediction\n",
 " - Model Evaluation\n",
 " \n",
 "7. Improve Model\n",
 " - Handle Class Imbalance\n",
 " - Save the Final Model"
 ]
 },
 {
 "cell_type": "markdown",
 "metadata": {},
 "source": [
 "<div class=\"alert alert-info\" style=\"background-color:#006a79; color:white; padding:0px 10px; border-radius:5px;\"><h2 style='margin:10px 5px'>1. Load the Data</h2>\n",
 "</div>\n",
 "\n",
 "In this section you will:\n",
 "\n",
 "- Import the libraries\n",
 "- Load the dataset"
 ]
 },
 {
 "cell_type": "markdown",
 "metadata": {},
 "source": [
 "### 1.1. Import Libraries\n",
 "\n",
 "Import all the libraries in the first cell itself"
 ]
 },
 {
 "cell_type": "code",
 "execution_count": 28,
 "metadata": {},
 "outputs": [],
 "source": [
 "# Import libraries \n",
 "\n",
 "# Data Manipulation\n",
 "import numpy as np \n",
 "import pandas as pd\n",
 "from pandas import DataFrame\n",
 "\n",
 "# Data Visualization\n",
 "import seaborn as sns\n",
 "import matplotlib.pyplot as plt\n",
 "\n",
 "\n",
 "# Machine Learning\n",
 "from sklearn.preprocessing import LabelEncoder, StandardScaler\n",
 "from sklearn.model_selection import train_test_split\n",
 "from sklearn.linear_model import LogisticRegression\n",
 "from sklearn.metrics import accuracy_score, confusion_matrix, classification_report\n",
 "from imblearn.over_sampling import RandomOverSampler\n",
 "import pickle\n",
 "\n",
 "# Maths\n",
 "import math\n",
 "\n",
 "# Set the options\n",
 "pd.set_option('display.max_rows', 800)\n",
 "pd.set_option('display.max_columns', 500)\n",
 "%matplotlib inline"
 ]
 },
 {
 "cell_type": "markdown",
 "metadata": {},
 "source": [
 "### 1.2. Load the datasets\n",
 "\n",
 "Load the dataset using pd.read_csv()"
 ]
 },
 {
 "cell_type": "code",
 "execution_count": 2,
 "metadata": {},
 "outputs": [
 {
 "data": {
 "text/html": [
 "<div>\n",
 "<style scoped>\n",
 " .dataframe tbody tr th:only-of-type {\n",
 " vertical-align: middle;\n",
 " }\n",
 "\n",
 " .dataframe tbody tr th {\n",
 " vertical-align: top;\n",
 " }\n",
 "\n",
 " .dataframe thead th {\n",
 " text-align: right;\n",
 " }\n",
 "</style>\n",
 "<table border=\"1\" class=\"dataframe\">\n",
 " <thead>\n",
 " <tr style=\"text-align: right;\">\n",
 " <th></th>\n",
 " <th>CustomerId</th>\n",
 " <th>Surname</th>\n",
 " <th>CreditScore</th>\n",
 " <th>Geography</th>\n",
 " <th>Gender</th>\n",
 " <th>Age</th>\n",
 " <th>Tenure</th>\n",
 " <th>Balance</th>\n",
 " <th>NumOfProducts</th>\n",
 " <th>HasCrCard</th>\n",
 " <th>IsActiveMember</th>\n",
 " <th>EstimatedSalary</th>\n",
 " <th>Exited</th>\n",
 " </tr>\n",
 " </thead>\n",
 " <tbody>\n",
 " <tr>\n",
 " <th>0</th>\n",
 " <td>15634602</td>\n",
 " <td>Hargrave</td>\n",
 " <td>619</td>\n",
 " <td>France</td>\n",
 " <td>Female</td>\n",
 " <td>42.0</td>\n",
 " <td>2</td>\n",
 " <td>0.00</td>\n",
 " <td>1</td>\n",
 " <td>1</td>\n",
 " <td>1</td>\n",
 " <td>101348.88</td>\n",
 " <td>1</td>\n",
 " </tr>\n",
 " <tr>\n",
 " <th>1</th>\n",
 " <td>15647311</td>\n",
 " <td>Hill</td>\n",
 " <td>608</td>\n",
 " <td>Spain</td>\n",
 " <td>Female</td>\n",
 " <td>41.0</td>\n",
 " <td>1</td>\n",
 " <td>83807.86</td>\n",
 " <td>1</td>\n",
 " <td>0</td>\n",
 " <td>1</td>\n",
 " <td>112542.58</td>\n",
 " <td>0</td>\n",
 " </tr>\n",
 " <tr>\n",
 " <th>2</th>\n",
 " <td>15619304</td>\n",
 " <td>Onio</td>\n",
 " <td>502</td>\n",
 " <td>France</td>\n",
 " <td>Female</td>\n",
 " <td>42.0</td>\n",
 " <td>8</td>\n",
 " <td>159660.80</td>\n",
 " <td>3</td>\n",
 " <td>1</td>\n",
 " <td>0</td>\n",
 " <td>113931.57</td>\n",
 " <td>1</td>\n",
 " </tr>\n",
 " <tr>\n",
 " <th>3</th>\n",
 " <td>15701354</td>\n",
 " <td>Boni</td>\n",
 " <td>699</td>\n",
 " <td>France</td>\n",
 " <td>Female</td>\n",
 " <td>39.0</td>\n",
 " <td>1</td>\n",
 " <td>0.00</td>\n",
 " <td>2</td>\n",
 " <td>0</td>\n",
 " <td>0</td>\n",
 " <td>93826.63</td>\n",
 " <td>0</td>\n",
 " </tr>\n",
 " <tr>\n",
 " <th>4</th>\n",
 " <td>15737888</td>\n",
 " <td>Mitchell</td>\n",
 " <td>850</td>\n",
 " <td>Spain</td>\n",
 " <td>Female</td>\n",
 " <td>43.0</td>\n",
 " <td>2</td>\n",
 " <td>125510.82</td>\n",
 " <td>1</td>\n",
 " <td>1</td>\n",
 " <td>1</td>\n",
 " <td>79084.10</td>\n",
 " <td>0</td>\n",
 " </tr>\n",
 " </tbody>\n",
 "</table>\n",
 "</div>"
 ],
 "text/plain": [
 " CustomerId Surname CreditScore Geography Gender Age Tenure \\\n",
 "0 15634602 Hargrave 619 France Female 42.0 2 \n",
 "1 15647311 Hill 608 Spain Female 41.0 1 \n",
 "2 15619304 Onio 502 France Female 42.0 8 \n",
 "3 15701354 Boni 699 France Female 39.0 1 \n",
 "4 15737888 Mitchell
850 Spain Female 43.0 2 \n",
 "\n",
 " Balance NumOfProducts HasCrCard IsActiveMember EstimatedSalary \\\n",
 "0 0.00 1 1 1 101348.88 \n",
 "1 83807.86 1 0 1 112542.58 \n",
 "2 159660.80 3 1 0 113931.57 \n",
 "3 0.00 2 0 0 93826.63 \n",
 "4 125510.82 1 1 1 79084.10 \n",
 "\n",
 " Exited \n",
 "0 1 \n",
 "1 0 \n",
 "2 1 \n",
 "3 0 \n",
 "4 0 "
 ]
 },
 "execution_count": 2,
 "metadata": {},
 "output_type": "execute_result"
 }
 ],
 "source": [
 "# Define the file name\n",
 "file_name = \"Churn_Modelling.csv\"\n",
 "\n",
 "# Read data in form of a csv file\n",
 "df = pd.read_csv(file_name)\n",
 "\n",
 "# First 5 rows of the dataset\n",
 "df.head()"
 ]
 },
 {
 "cell_type": "markdown",
 "metadata": {},
 "source": [
 "<div class=\"alert alert-info\" style=\"background-color:#006a79; color:white; padding:0px 10px; border-radius:5px;\"><h2 style='margin:10px 5px'>2. Overview of the Data</h2>\n",
 "</div>\n",
 "\n",
 "Before attempting to solve the problem, it's very important to have a good understanding of data.\n",
 "\n",
 "In this section you will:\n",
 "- Get the descriptive statistics of the data\n",
 "- Get the information about missing values in the data"
 ]
 },
 {
 "cell_type": "markdown",
 "metadata": {},
 "source": [
 "### 2.1. Descriptive Statistics\n",
 "\n",
 "As the name says descriptive statistics describes the data. It gives you information about\n",
 "- Mean, median, mode \n",
 "- Min, max\n",
 "- Count etc\n",
 "\n",
 "Let's understand the data we have"
 ]
 },
 {
 "cell_type": "code",
 "execution_count": 3,
 "metadata": {},
 "outputs": [
 {
 "data": {
 "text/plain": [
 "(10000, 13)"
 ]
 },
 "execution_count": 3,
 "metadata": {},
 "output_type": "execute_result"
 }
 ],
 "source": [
 "# Dimension of the data\n",
 "df.shape"
 ]
 },
 {
 "cell_type": "code",
 "execution_count": 4,
 "metadata": {},
 "outputs": [
 {
 "data": {
 "text/html": [
 "<div>\n",
 "<style scoped>\n",
 " .dataframe tbody tr th:only-of-type {\n",
 " vertical-align: middle;\n",
 " }\n",
 "\n",
 " .dataframe tbody tr th {\n",
 " vertical-align: top;\n",
 " }\n",
 "\n",
 " .dataframe thead th {\n",
 " text-align: right;\n",
 " }\n",
 "</style>\n",
 "<table border=\"1\" class=\"dataframe\">\n",
 " <thead>\n",
 " <tr style=\"text-align: right;\">\n",
 " <th></th>\n",
 " <th>CustomerId</th>\n",
 " <th>CreditScore</th>\n",
 " <th>Age</th>\n",
 " <th>Tenure</th>\n",
 " <th>Balance</th>\n",
 " <th>NumOfProducts</th>\n",
 " <th>HasCrCard</th>\n",
 " <th>IsActiveMember</th>\n",
 " <th>EstimatedSalary</th>\n",
 " <th>Exited</th>\n",
 " </tr>\n",
 " </thead>\n",
 " <tbody>\n",
 " <tr>\n",
 " <th>count</th>\n",
 " <td>1.000000e+04</td>\n",
 " <td>10000.000000</td>\n",
 " <td>9985.000000</td>\n",
 " <td>10000.000000</td>\n",
 " <td>10000.000000</td>\n",
 " <td>10000.000000</td>\n",
 " <td>10000.00000</td>\n",
 " <td>10000.000000</td>\n",
 " <td>10000.000000</td>\n",
 " <td>10000.000000</td>\n",
 " </tr>\n",
 " <tr>\n",
 " <th>mean</th>\n",
 " <td>1.569094e+07</td>\n",
 " <td>650.528800</td>\n",
 " <td>38.922684</td>\n",
 " <td>5.012800</td>\n",
 " <td>76485.889288</td>\n",
 " <td>1.530200</td>\n",
 " <td>0.70550</td>\n",
 " <td>0.515100</td>\n",
 " <td>100090.239881</td>\n",
 " <td>0.203700</td>\n",
 " </tr>\n",
 " <tr>\n",
 " <th>std</th>\n",
 " <td>7.193619e+04</td>\n",
 " <td>96.653299</td>\n",
 " <td>10.488949</td>\n",
 " <td>2.892174</td>\n",
 " <td>62397.405202</td>\n",
 " <td>0.581654</td>\n",
 " <td>0.45584</td>\n",
 " <td>0.499797</td>\n",
 " <td>57510.492818</td>\n",
 " <td>0.402769</td>\n",
 " </tr>\n",
 " <tr>\n",
 " <th>min</th>\n",
 " <td>1.556570e+07</td>\n",
 " <td>350.000000</td>\n",
 " <td>18.000000</td>\n",
 " <td>0.000000</td>\n",
 " <td>0.000000</td>\n",
 " <td>1.000000</td>\n",
 " <td>0.00000</td>\n",
 " <td>0.000000</td>\n",
 " <td>11.580000</td>\n",
 " <td>0.000000</td>\n",
 " </tr>\n",
 " <tr>\n",
 " <th>25%</th>\n",
 " <td>1.562853e+07</td>\n",
 " <td>584.000000</td>\n",
 " <td>32.000000</td>\n",
 " <td>3.000000</td>\n",
 " <td>0.000000</td>\n",
 " <td>1.000000</td>\n",
 " <td>0.00000</td>\n",
 " <td>0.000000</td>\n",
 " <td>51002.110000</td>\n",
 " <td>0.000000</td>\n",
 " </tr>\n",
 " <tr>\n",
 " <th>50%</th>\n",
 " <td>1.569074e+07</td>\n",
 " <td>652.000000</td>\n",
 " <td>37.000000</td>\n",
 " <td>5.000000</td>\n",
 " <td>97198.540000</td>\n",
 " <td>1.000000</td>\n",
 " <td>1.00000</td>\n",
 " <td>1.000000</td>\n",
 " <td>100193.915000</td>\n",
 " <td>0.000000</td>\n",
 " </tr>\n",
 " <tr>\n",
 " <th>75%</th>\n",
 " <td>1.575323e+07</td>\n",
 " <td>718.000000</td>\n",
 " <td>44.000000</td>\n",
 " <td>7.000000</td>\n",
 " <td>127644.240000</td>\n",
 " <td>2.000000</td>\n",
 " <td>1.00000</td>\n",
 " <td>1.000000</td>\n",
 " <td>149388.247500</td>\n",
 " <td>0.000000</td>\n",
 " </tr>\n",
 " <tr>\n",
 " <th>max</th>\n",
 " <td>1.581569e+07</td>\n",
 " <td>850.000000</td>\n",
 " <td>92.000000</td>\n",
 " <td>10.000000</td>\n",
 " <td>250898.090000</td>\n",
 " <td>4.000000</td>\n",
 " <td>1.00000</td>\n",
 " <td>1.000000</td>\n",
 " <td>199992.480000</td>\n",
 " <td>1.000000</td>\n",
 " </tr>\n",
 " </tbody>\n",
 "</table>\n",
 "</div>"
 ],
 "text/plain": [
 " CustomerId CreditScore Age Tenure Balance \\\n",
 "count 1.000000e+04 10000.000000 9985.000000 10000.000000 10000.000000 \n",
 "mean 1.569094e+07 650.528800 38.922684 5.012800 76485.889288 \n",
 "std 7.193619e+04 96.653299 10.488949 2.892174 62397.405202 \n",
 "min 1.556570e+07 350.000000 18.000000 0.000000 0.000000 \n",
 "25% 1.562853e+07 584.000000 32.000000 3.000000 0.000000 \n",
 "50% 1.569074e+07 652.000000 37.000000 5.000000 97198.540000 \n",
 "75% 1.575323e+07 718.000000 44.000000 7.000000 127644.240000 \n",
 "max 1.581569e+07 850.000000 92.000000 10.000000 250898.090000 \n",
 "\n",
 " NumOfProducts HasCrCard IsActiveMember
EstimatedSalary \\\n",
 "count 10000.000000 10000.00000 10000.000000 10000.000000 \n",
 "mean 1.530200 0.70550 0.515100 100090.239881 \n",
 "std 0.581654 0.45584 0.499797 57510.492818 \n",
 "min 1.000000 0.00000 0.000000 11.580000 \n",
 "25% 1.000000 0.00000 0.000000 51002.110000 \n",
 "50% 1.000000 1.00000 1.000000 100193.915000 \n",
 "75% 2.000000 1.00000 1.000000 149388.247500 \n",
 "max 4.000000 1.00000 1.000000 199992.480000 \n",
 "\n",
 " Exited \n",
 "count 10000.000000 \n",
 "mean 0.203700 \n",
 "std 0.402769 \n",
 "min 0.000000 \n",
 "25% 0.000000 \n",
 "50% 0.000000 \n",
 "75% 0.000000 \n",
 "max 1.000000 "
 ]
 },
 "execution_count": 4,
 "metadata": {},
 "output_type": "execute_result"
 }
 ],
 "source": [
 "# Summary of the dataset\n",
 "df.describe()"
 ]
 },
 {
 "cell_type": "markdown",
 "metadata": {},
 "source": [
 "### 2.2 Missing Values\n",
 "\n",
 "Get the info about missing values in the dataframe"
 ]
 },
 {
 "cell_type": "code",
 "execution_count": 5,
 "metadata": {},
 "outputs": [
 {
 "data": {
 "text/plain": [
 "CustomerId 0\n",
 "Surname 0\n",
 "CreditScore 0\n",
 "Geography 0\n",
 "Gender 0\n",
 "Age 15\n",
 "Tenure 0\n",
 "Balance 0\n",
 "NumOfProducts 0\n",
 "HasCrCard 0\n",
 "IsActiveMember 0\n",
 "EstimatedSalary 0\n",
 "Exited 0\n",
 "dtype: int64"
 ]
 },
 "execution_count": 5,
 "metadata": {},
 "output_type": "execute_result"
 }
 ],
 "source": [
 "# Missing values for every column\n",
 "df.isna().sum()"
 ]
 },
 {
 "cell_type": "markdown",
 "metadata": {},
 "source": [
 "<div class=\"alert alert-info\" style=\"background-color:#006a79; color:white; padding:0px 10px; border-radius:5px;\"><h2 style='margin:10px 5px'>3. Exploratory Data Analaysis</h2>\n",
 "</div>\n",
 "\n",
 "Exploratory data analysis is an approach to analyze or investigate data sets to find out patterns and see if any of the variables can be useful in predicting the y variables. Visual methods are often used to summarise the data. Primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing tasks.\n",
 "\n",
 "In this section you will:\n",
 "- Create list of columns by data type\n",
 "- Check the distribution of target class\n",
 "- Check the distribution of every feature\n",
 "- Check how differnt numerical features are realated to target class"
 ]
 },
 {
 "cell_type": "markdown",
 "metadata": {},
 "source": [
 "### 3.1. Extract data types of columns\n",
 "\n",
 "It's better to get the list of columns by data types in the start itself. You won't have to manually write the name of columns while performing certain operations. So always get the list of columns in the start itself."
 ]
 },
 {
 "cell_type": "code",
 "execution_count": 6,
 "metadata": {},
 "outputs": [],
 "source": [
 "# Remove extra columns\n",
 "col_remove = ['CustomerId']\n",
 "df = df.drop(col_remove, axis = 1)"
 ]
 },
 {
 "cell_type": "code",
 "execution_count": 7,
 "metadata": {},
 "outputs": [
 {
 "name": "stdout",
 "output_type": "stream",
 "text": [
 "Binary Columns : ['Gender', 'HasCrCard', 'IsActiveMember', 'Exited']\n",
 "Categorical Columns : ['Surname', 'Geography', 'Gender']\n",
 "Numerical Columns : ['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'EstimatedSalary']\n"
 ]
 }
 ],
 "source": [
 "# List of numeric and categorical columns\n",
 "binary_columns = [col for col in df.columns if df[col].nunique() == 2]\n",
 "print(\"Binary Columns : \", binary_columns)\n",
 "\n",
 "categorical_columns = [col for col in df.columns if df[col].dtype == \"object\"]\n",
 "print(\"Categorical Columns : \", categorical_columns)\n",
 "\n",
 "categorical_columns = binary_columns + categorical_columns\n",
 "categorical_columns = list(set(categorical_columns))\n",
 "\n",
 "numerical_columns = [col for col in df.columns if col not in categorical_columns]\n",
 "print(\"Numerical Columns : \", numerical_columns)"
 ]
 },
 {
 "cell_type": "markdown",
 "metadata": {},
 "source": [
 "**Note :** There might be some mismatch in the data type of the columns, so in such cases you will have to correct it manually"
 ]
 },
 {
 "cell_type": "markdown",
 "metadata": {},
 "source": [
 "### 3.2 Check distribution of target class\n",
 "\n",
 "You need to check the distribution of target class, see how many categories are there, is it balanced or not"
 ]
 },
 {
 "cell_type": "code",
 "execution_count": 8,
 "metadata": {},
 "outputs": [
 {
 "data": {
 "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXgAAAEGCAYAAABvtY4XAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuNCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8QVMy6AAAACXBIWXMAAAsTAAALEwEAmpwYAAAT+klEQVR4nO3de7RcZX3G8e+PxAACknBTBDSAFEVFwCig1EbuYMUCismigoClFRUvrRaWiuIqKt5qFVpBBBHkIiKUUhVQlKJVIOEawAhCBJSL3AMsy8Vf/9jvIZOYc86QmX1m8vL9rDVr9uy57OfMzHnOPnvveScyE0lSfVYYdABJUjsseEmqlAUvSZWy4CWpUha8JFVq8qADdFprrbVy+vTpg44hScuNuXPn3puZay/tuqEq+OnTpzNnzpxBx5Ck5UZE/Ha069xEI0mVsuAlqVIWvCRVyoKXpEpZ8JJUKQtekiplwUtSpSx4SaqUBS9JlRqqT7LeeMd9vPrD3xp0DEmaMHM/v19rj+0avCRVyoKXpEpZ8JJUKQtekiplwUtSpSx4SaqUBS9JlbLgJalSFrwkVcqCl6RKWfCSVCkLXpIqZcFLUqUseEmqlAUvSZWy4CWpUha8JFXKgpekSlnwklQpC16SKmXBS1KlLHhJqpQFL0mVsuAlqVIWvCRVyoKXpEpZ8JJUKQtekiplwUtSpSx4SaqUBS9JlbLgJalSFrwkVcqCl6RKWfCSVCkLXpIqZcFLUqUseEmqlAUvSZWy4CWpUq0WfETsGhHzI+LmiDiszWVJkhbXWsFHxCTgWGA3YDNgdkRs1tbyJEmLa3MN/rXAzZl5S2Y+DpwBvKXF5UmSOrRZ8OsBt3dcvqPMW0xEHBwRcyJizpOPLWwxjiQ9uwx8J2tmHp+ZMzJzxuTnrjboOJJUjTYL/nfABh2X1y/zJEkToM2CvwLYJCI2jIgpwCzgvBaXJ0nqMLmtB87MJyPivcAFwCTgxMy8vq3lSZIW11rBA2Tm94Hvt7kMSdLSDXwnqySpHRa8JFXKgpekSlnwklQpC16SKmXBS1KlLHhJqpQFL0mVsuAlqVIWvCRVyoKXpEpZ8JJUKQtekiplwUtSpSx4SarUuAUfEe+PiOdF4xsRcWVE7DwR4SRJy66bNfgDM/NhYGdgGvAO4LOtppIk9aybgo9yvjtwSvnavRjj9pKkIdBNwc+NiAtpCv6CiFgN+FO7sSRJvermO1kPArYAbsnMxyJiTeCAVlNJknrWzRp8ApsBh5bLqwArtZZIktQX3RT8vwPbArPL5YXAsa0lkiT1RTebaLbOzK0i4iqAzHwgIqa0nEuS1KNu1uCfiIhJNJtqiIi1cSerJA29bgr+K8A5wDoRcRTwM+DTraaSJPVs3E00mfntiJgL7EBz/PvfZOaNrSeTJPWkm6EKNgZuzcxjgXnAThExte1gkqTedLOJ5mzgqYh4CXAcsAFwWqupJEk966bg/5SZTwJ7Acdk5oeBdduNJUnqVbdH0cwG9gPOL/Oe014kSVI/dFPwB9B80OmozLw1IjYETmk3liSpV90cRXMDZZiCiJgGrJaZR7cdTJLUm26Oovlp+cKPNYArga9HxJfajyZJ6kU3m2hWL1/4sRfwrczcGtix3ViSpF51U/CTI2JdYB8W7WSVJA25bgr+U8AFwM2ZeUVEbATc1G4sSVKvutnJehZwVsflW4C92wwlSerduAUfESvRfKvTy+n4oo/MPLDfYV62/prM+fx+/X5YSXpW6mYTzSnAC4BdgEuA9Wm+9EOSNMS6KfiXZObHgUcz82TgTcDW7caSJPWqq6EKyvmDEfEKYHVgnfYiSZL6oZuv7Du+fIL148B5wKrAEa2mkiT1rJujaE4ok5cAG7UbR5LUL6MWfER8aKw7ZqbDFUjSEBtrDX61CUshSeq7UQs+M4+cyCCSpP7qZjTJkzu/gzUipkXEia2mkiT1rJvDJDfPzAdHLmTmA8CWrSWSJPVFNwW/QjlMEoAyLnw3h1dKkgaom6L+IvCLiBgZcOxtwFHtRZIk9UM3x8F/KyLmANuXWXuVr/GTJA2xrja1lEK31CVpOdLNNnhJ0nLIgpekSnVzHPzR3cyTJA2Xbtbgd1rKvN36HUSS1F9jDTb2buAQYKOIuLbjqtWAn7cdTJLUm7GOojkN+AHwGeCwjvkLM/P+VlNJkno26iaazHwoMxdk5mxgA2D7zPwtzSdbN5ywhJKkZTLucfAR8QlgBrApcBIwBTgVeH2/wzx+5/Xc9qlX9vthn1VedMR1g44gaUh0s5N1T2AP4FGAzPw9jhUvSUOvm4J/PDMTSICIWKXdSJKkfuim4L8TEccBUyPi74AfAV9vN5YkqVfdDDb2hYjYCXiYZjv8EZl5UevJJEk96XawsYsAS12SliPdHEWzkLL9vcNDwBzgHzPzljaCSZJ6080a/JeBO2g++BTALGBj4ErgRGBmS9kkST3oZifrHpl5XGYuzMyHM/N4YJfMPBOYNt6dJUmD0U3BPxYR+0TECuW0D/DHct2Sm24kSUOim4LfF3gHcA9wd5n+24hYGXhvi9kkST0Ycxt8REwCDsnMN49yk5/1P5IkqR/GXIPPzKeA7SYoiySpj7o5iuaqiDgPOIsyHg1AZn6vtVSSpJ51U/ArAfcB23fMS8CCl6Qh1s1QBQdMRBBJUn9180nWlYCDgJfTrM0DkJkHtphLktSjbg6TPAV4AbALcAmwPrCwzVCSpN6NWvARMbJ2/5LM/DjwaGaeDLwJ2HoiwkmSlt1Ya/CXl/MnyvmDEfEKYHVgnVZTSZJ61s1RNMdHxDTgY8B5wKrAx1tNJUnq2VgFv05EfKhMjxxJc2w592v7JGnIjVXwk2jW1mMp1znImCQNubEK/s7M/NSEJZEk9dVYO1mXtuYuSVpOjFXwO0xYCklS341a8Jl5/0QGkST1VzefZJUkLYcseEmqlAUvSZWy4CWpUha8JFXKgpekSlnwklQpC16SKmXBS1KlLHhJqpQFL0mVsuAlqVIWvCRVqrWCj4gTI+KeiJjX1jIkSaNrcw3+m8CuLT6+JGkMrRV8Zv4P4JjykjQgA98GHxEHR8SciJhz/6NPDTqOJFVj4AWfmcdn5ozMnLHGKpMGHUeSqjHwgpcktcOCl6RKtXmY5OnAL4BNI+KOiDiorWVJkv7c5LYeODNnt/XYkqTxuYlGkiplwUtSpSx4SaqUBS9JlbLgJalSFrwkVcqCl6RKWfCSVCkLXpIqZcFLUqUseEmqlAUvSZWy4CWpUha8JFXKgpekSlnwklQpC16SKmXBS1KlLHhJqpQFL0mVsuAlqVIWvCRVyoKXpEpZ8JJUKQtekiplwUtSpSx4SaqUBS9JlbLgJalSFrwkVcqCl6RKWfCSVCkLXpIqZcFLUqUseEmqlAUvSZWy4CWpUha8JFVq8qADdJqy7st50RFzBh1DkqrgGrwkVcqCl6RKWfCSVCkLXpIqZcFLUqUseEmqlAUvSZWy4CWpUha8JFXKgpekSkVmDjrD0yJiITB/0DnGsBZw76BDjMF8vTFfb8zXm2XN9+LMXHtpVwzVWDTA/MycMegQo4mIOeZbdubrjfl682zM5yYaSaqUBS9JlRq2gj9+0AHGYb7emK835uvNsy7fUO1klST1z7CtwUuS+sSCl6RKDUXBR8SuETE/Im6OiMMmcLknRsQ9ETGvY94aEXFRRNxUzqeV+RERXykZr42IrTrus3+5/U0RsX8f820QET+JiBsi4vqIeP8wZYyIlSLi8oi4puQ7sszfMCIuKznOjIgpZf6K5fLN5frpHY91eJk/PyJ26Ue+jseeFBFXRcT5w5YvIhZExHURcXVEzCnzhuL1LY87NSK+GxG/iogbI2LbYckXEZuW523k9HBEfGBY8pXH/WD53ZgXEaeX35mJe/9l5kBPwCTgN8BGwBTgGmCzCVr2G4CtgHkd8z4HHFamDwOOLtO7Az8AAtgGuKzMXwO4pZxPK9PT+pRvXWCrMr0a8Gtgs2HJWJazapl+DnBZWe53gFll/teAd5fpQ4CvlelZwJllerPyuq8IbFjeD5P6+Dp/CDgNOL9cHpp8wAJgrSXmDcXrWx77ZOBdZXoKMHWY8nXknATcBbx4WPIB6wG3Ait3vO/eOZHvv749wT08CdsCF3RcPhw4fAKXP53FC34+sG6ZXpfmw1cAxwGzl7wdMBs4rmP+Yrfrc9b/BHYaxozAc4Erga1pPo03ecnXF7gA2LZMTy63iyVf887b9SHX+sCPge2B88vyhinfAv684Ifi9QVWpymoGMZ8S2TaGfj5MOWjKfjbaf5wTC7vv10m8v03DJtoRp6EEXeUeYPy/My8s0zfBTy/TI+Wc0Lyl3/XtqRZSx6ajGXzx9XAPcBFNGsXD2bmk0tZ1tM5yvUPAWu2mQ/4MvAR4E/l8ppDli+BCyNibkQcXOYNy+u7IfAH4KSyieuEiFhliPJ1mgWcXqaHIl9m/g74AnAbcCfN+2kuE/j+G4aCH1rZ/Lkc+HGkEbEqcDbwgcx8uPO6QWfMzKcycwuaNeXXAi8dVJYlRcRfA/dk5txBZxnDdpm5FbAb8J6IeEPnlQN+fSfTbML8j8zcEniUZpPH0wb9/gMo27D3AM5a8rpB5ivb/t9C84fyhcAqwK4TmWEYCv53wAYdl9cv8wbl7ohYF6Cc31Pmj5az1fwR8Ryacv92Zn5vGDMCZOaDwE9o/uWcGhEj4xx1LuvpHOX61YH7Wsz3emCPiFgAnEGzmebfhijfyFoemXkPcA7NH8lheX3vAO7IzMvK5e/SFP6w5BuxG3BlZt5dLg9Lvh2BWzPzD5n5BPA9mvfkhL3/hqHgrwA2KXuWp9D8q3XeAPOcB4zsRd+fZrv3yPz9yp74bYCHyr+BFwA7R8S08hd75zKvZxERwDeAGzPzS8OWMSLWjoipZXplmv0DN9IU/VtHyTeS+63AxWUN6zxgVjmKYENgE+DyXvNl5uGZuX5mTqd5X12cmfsOS76IWCUiVhuZpnld5jEkr29m3gXcHhGbllk7ADcMS74Os1m0eWYkxzDkuw3YJiKeW36XR56/iXv/9XNHRw87I3anOULkN8BHJ3C5p9NsG3uCZm3lIJptXj8GbgJ+BKxRbhvAsSXjdcCMjsc5ELi5nA7oY77taP69vBa4upx2H5aMwObAVSXfPOCIMn+j8ga8mebf5hXL/JXK5ZvL9Rt1PNZHS+75wG4tvNYzWXQUzVDkKzmuKafrR977w/L6lsfdAphTXuNzaY4yGaZ8q9Cs5a7eMW+Y8h0J/Kr8fpxCcyTMhL3/HKpAkio1DJtoJEktsOAlqVIWvCRVyoKXpEpZ8JJUKQte44qIF0TEGRHxm/KR+u9HxF/0eRkzI+J1z/A+K0bEj6IZSfDt/czTsYxHxrn+gFg0muHjsWhkyM+2lGdZnqcFEbFWG3k03CaPfxM9m5UPaJwDnJyZs8q8V9GM7/HrPi5qJvAI8L/P4D5bAmQzVMJAZOZJwEnQFCnwxsy8t5v7RsSkzHzqGS5yJs/8edKzlGvwGs8bgScy82sjMzLzmsy8tHwi8PPRjHV93chadFnLPH/k9hFxTES8s0wviIgjI+LKcp+XRjOQ2j8AHyxrv3/ZGSCa8b3PjWYM719GxOYRsQ5wKvCacp+Nl7jPxhHxw/Ifx6UR8dIy/83RjLV9VVn7f36Zv2pEnFQyXRsRe3c81lHRjHn/y5Hbj6fknRvNWOAHd8x/JCK+GBHXANtGxEER8etoxtX/ekQcU263dkScHRFXlNPru3ieRv0ZxsoVzYBx3+x4HT9Y5h8azXcRXBsRZ3Tzc2vI9OsTW57qPAGHAv86ynV704wgOYlmjf42muFXZ1I+NVpudwzwzjK9AHhfmT4EOKFMfxL4p1GW81XgE2V6e+DqMr3Ycpa4z4+BTcr01jQf+4bmk5gjH/B7F/DFMn008OWO+08r5wm8uUx/DvjYGM/VAsrQvyz69OTKNJ9iXLPj8fYp0y8s91mDZjz9S4FjynWn0QxEBvAimuEqxnueRvsZxswFvBq4qON+U8v571n0Kcupg34venrmJzfRqBfbAadns5nh7oi4BHgN8PDYd2Nk0LS5wF5dLmdvgMy8OCLWjIjnjXbjaEbffB1wVrOFCWg+Ig7NQE1nRjMI1RSa8c6hGRhq1siNM/OBMvk4zTjeI3l36iIvwKERsWeZ3oBm/JD7gKdoBo+DZmCxSzLz/pL7LGBk38aOwGYd+Z9Xfq6xjPYzjJdrPrBRRHwV+G/gwnL9tcC3I+JcmmEKtJxxE43Gcz3NGt4z8SSLv7dWWuL6/yvnT9HOfqAVaMbc3qLj9LJy3Vdp1pJfCfz9UrIt6Yksq7Dd5o2ImTRlu21mvopmvJ6R5fwxu9vuvgKwTUf+9TJzzB2+y5qr/CF4FfBTmk1AJ5S7vIlm7JatgCti0QiIWk5Y8BrPxcCKS2xH3rxs/70UeHvZhrs2zVcgXg78lmbtc8VoRpvcoYvlLKT5WsKluRTYtyx7JnBvLjEufqdy3a0R8bZynyg7hqEZgnVkqNX9O+52EfCejp9xWheZR7M68EBmPla2/W8zyu2uAP4qmlEMJ1P+SykuBN7XkWeLMjnW8zTez7DUXNEcYbNCZp4NfAzYKiJWADbIzJ8A/1zuO95/EBoyFrzGVNZe9wR2jOYwyeuBz9B8U845NP/GX0Pzh+AjmXlXZt5O872T88r5VV0s6r+APZe285Bmu/OrI+Ja4LMsXsyj2Rc4qOzMvJ7mixdGHuusiJhL85VoI/4FmFZ2NF5Ds3N5Wf0QmBwRN5a8v1zajbIZC/7TNH8Uf06zrfyhcvWhwIyyg/MGmjVrGPt5Gu9nGC3XesBPo/lmrlNpviJuEnBqRFxH8/p9JZsx/7UccTRJaYAiYtXMfKSswZ8DnJiZ5ww6l+rgGrw0WJ8sa87zaHb4njvQNKqKa/CSVCnX4CWpUha8JFXKgpekSlnwklQpC16SKvX/roOYEANU6F4AAAAASUVORK5CYII=\n",
"text/plain": [
 "<Figure size 432x288 with 1 Axes>"
 ]
 },
 "metadata": {
 "needs_background": "light"
 },
 "output_type": "display_data"
 }
 ],
 "source": [
 "# Target class\n",
 "target_class = \"Exited\"\n",
 "\n",
 "# Check distribution of target class\n",
 "sns.countplot(y=df[target_class] ,data=df)\n",
 "plt.xlabel(\"Count of each Target class\")\n",
 "plt.ylabel(\"Target classes\")\n",
 "plt.show()"
 ]
 },
 {
 "cell_type": "markdown",
 "metadata": {},
 "source": [
 "### 3.3. Check the distribution of every feature"
 ]
 },
 {
 "cell_type": "code",
 "execution_count": 9,
 "metadata": {},
 "outputs": [
 {
 "data": {
 "image/png": "\n",
"text/plain": [
 "<Figure size 1080x864 with 9 Axes>"
 ]
 },
 "metadata": {
 "needs_background": "light"
 },
 "output_type": "display_data"
 }
 ],
 "source": [
 "# Check the distribution of all the features\n",
 "df.hist(figsize=(15,12),bins = 15)\n",
 "plt.title(\"Features Distribution\")\n",
 "plt.show()"
 ]
 },
 {
 "cell_type": "markdown",
 "metadata": {},
 "source": [
 "### 3.4 Check how differnt numerical features are realated to target class"
 ]
 },
 {
 "cell_type": "code",
 "execution_count": 10,
 "metadata": {},
 "outputs": [],
 "source": [
 "# Number of rows and columns in the plot\n",
 "n_cols = 3\n",
 "n_rows = math.ceil(len(numerical_columns)/n_cols)"
 ]
 },
 {
 "cell_type": "code",
 "execution_count": 11,
 "metadata": {},
 "outputs": [
 {
 "data": {
 "image/png": "\n",
"text/plain": [
 "<Figure size 2160x2160 with 6 Axes>"
 ]
 },
 "metadata": {
 "needs_background": "light"
 },
 "output_type": "display_data"
 }
 ],
 "source": [
 "# Check the distribution of y variable corresponding to every x variable \n",
 "fig,ax = plt.subplots(nrows = n_rows, ncols = n_cols, figsize=(30,30))\n",
 "row = 0\n",
 "col = 0\n",
 "for i in numerical_columns:\n",
 " if col > 2:\n",
 " row += 1\n",
 " col = 0\n",
 " axes = ax[row,col]\n",
 " sns.boxplot(x = df[target_class], y = df[i],ax = axes)\n",
 " col += 1\n",
 "plt.tight_layout()\n",
 "plt.title(\"Individual Features by Class\")\n",
 "plt.show()"
 ]
 },
 {
 "cell_type": "markdown",
 "metadata": {},
 "source": [
 "<div class=\"alert alert-info\" style=\"background-color:#006a79; color:white; padding:0px 10px; border-radius:5px;\"><h2 style='margin:10px 5px'>4. Data Preparation</h2>\n",
 "</div>\n",
 "\n",
 "The data is not yet ready for model building. You need to process the data and make it ready for model building\n",
 "\n",
 "In this section you will:\n",
 "- Clean the data\n",
 "- Encode the categorical features\n",
 "- Split the dataset in X and y dataset\n",
 "- Scale the features\n",
 "- Split the data in train and test set"
 ]
 },
 {
 "cell_type": "markdown",
 "metadata": {},
 "source": [
 "### 4.1. Data Cleaning\n",
 "\n",
 "Machine Learning works on the idea of garbage in - garbage out. If you feed in dirty data, the results won't be good. Hence it's very important to clean the data before training the model\n",
 "\n",
 "Here you will impute the missing values with mean"
 ]
 },
 {
 "cell_type": "code",
 "execution_count": 12,
 "metadata": {},
 "outputs": [],
 "source": [
 "# Impute missing values\n",
 "df = df.fillna(df.mean())"
 ]
 },
 {
 "cell_type": "code",
 "execution_count": 13,
 "metadata": {},
 "outputs": [
 {
 "data": {
 "text/html": [
 "<div>\n",
 "<style scoped>\n",
 " .dataframe tbody tr th:only-of-type {\n",
 " vertical-align: middle;\n",
 " }\n",
 "\n",
 " .dataframe tbody tr th {\n",
 " vertical-align: top;\n",
 " }\n",
 "\n",
 " .dataframe thead th {\n",
 " text-align: right;\n",
 " }\n",
 "</style>\n",
 "<table border=\"1\" class=\"dataframe\">\n",
 " <thead>\n",
 " <tr style=\"text-align: right;\">\n",
 " <th></th>\n",
 " <th>Surname</th>\n",
 " <th>CreditScore</th>\n",
 " <th>Geography</th>\n",
 " <th>Gender</th>\n",
 " <th>Age</th>\n",
 " <th>Tenure</th>\n",
 " <th>Balance</th>\n",
 " <th>NumOfProducts</th>\n",
 " <th>HasCrCard</th>\n",
 " <th>IsActiveMember</th>\n",
 " <th>EstimatedSalary</th>\n",
 " <th>Exited</th>\n",
 " </tr>\n",
 " </thead>\n",
 " <tbody>\n",
 " <tr>\n",
 " <th>0</th>\n",
 " <td>Hargrave</td>\n",
 " <td>619</td>\n",
 " <td>France</td>\n",
 " <td>Female</td>\n",
 " <td>42.0</td>\n",
 " <td>2</td>\n",
 " <td>0.00</td>\n",
 " <td>1</td>\n",
 " <td>1</td>\n",
 " <td>1</td>\n",
 " <td>101348.88</td>\n",
 " <td>1</td>\n",
 " </tr>\n",
 " <tr>\n",
 " <th>1</th>\n",
 " <td>Hill</td>\n",
 " <td>608</td>\n",
 " <td>Spain</td>\n",
 " <td>Female</td>\n",
 " <td>41.0</td>\n",
 " <td>1</td>\n",
 " <td>83807.86</td>\n",
 " <td>1</td>\n",
 " <td>0</td>\n",
 " <td>1</td>\n",
 " <td>112542.58</td>\n",
 " <td>0</td>\n",
 " </tr>\n",
 " <tr>\n",
 " <th>2</th>\n",
 " <td>Onio</td>\n",
 " <td>502</td>\n",
 " <td>France</td>\n",
 " <td>Female</td>\n",
 " <td>42.0</td>\n",
 " <td>8</td>\n",
 " <td>159660.80</td>\n",
 " <td>3</td>\n",
 " <td>1</td>\n",
 " <td>0</td>\n",
 " <td>113931.57</td>\n",
 " <td>1</td>\n",
 " </tr>\n",
 " <tr>\n",
 " <th>3</th>\n",
 " <td>Boni</td>\n",
 " <td>699</td>\n",
 " <td>France</td>\n",
 " <td>Female</td>\n",
 " <td>39.0</td>\n",
 " <td>1</td>\n",
 " <td>0.00</td>\n",
 " <td>2</td>\n",
 " <td>0</td>\n",
 " <td>0</td>\n",
 " <td>93826.63</td>\n",
 " <td>0</td>\n",
 " </tr>\n",
 " <tr>\n",
 " <th>4</th>\n",
 " <td>Mitchell</td>\n",
 " <td>850</td>\n",
 " <td>Spain</td>\n",
 " <td>Female</td>\n",
 " <td>43.0</td>\n",
 " <td>2</td>\n",
 " <td>125510.82</td>\n",
 " <td>1</td>\n",
 " <td>1</td>\n",
 " <td>1</td>\n",
 " <td>79084.10</td>\n",
 " <td>0</td>\n",
 " </tr>\n",
 " </tbody>\n",
 "</table>\n",
 "</div>"
 ],
 "text/plain": [
 " Surname CreditScore Geography Gender Age Tenure Balance \\\n",
 "0 Hargrave 619 France Female 42.0 2 0.00 \n",
 "1 Hill 608 Spain Female 41.0 1 83807.86 \n",
 "2 Onio 502 France Female 42.0 8 159660.80 \n",
 "3 Boni 699 France Female 39.0 1 0.00 \n",
 "4 Mitchell 850 Spain Female 43.0 2 125510.82 \n",
 "\n",
 " NumOfProducts HasCrCard IsActiveMember EstimatedSalary Exited \n",
 "0 1 1 1 101348.88 1 \n",
 "1 1 0 1 112542.58 0 \n",
 "2 3 1 0 113931.57 1 \n",
 "3 2 0 0 93826.63 0 \n",
 "4 1 1 1 79084.10 0 "
 ]
 },
 "execution_count": 13,
 "metadata": {},
 "output_type": "execute_result"
 }
 ],
 "source": [
 "df.head()"
 ]
 },
 {
 "cell_type": "markdown",
 "metadata": {},
 "source": [
 "### 4.2. Feature Encoding\n",
 "\n",
 "Encoding is the process of converting data from one form to another. Most of the Machine learning algorithms can not handle categorical values unless we convert them to numerical values. Many algorithm�s performances vary based on how Categorical columns are encoded."
 ]
 },
 {
 "cell_type": "code",
 "execution_count": 14,
 "metadata": {},
 "outputs": [
 {
 "data": {
 "text/html": [
 "<div>\n",
 "<style scoped>\n",
 " .dataframe tbody tr th:only-of-type {\n",
 " vertical-align: middle;\n",
 " }\n",
 "\n",
 " .dataframe tbody tr th {\n",
 " vertical-align: top;\n",
 " }\n",
 "\n",
 " .dataframe thead th {\n",
 " text-align: right;\n",
 " }\n",
 "</style>\n",
 "<table border=\"1\" class=\"dataframe\">\n",
 " <thead>\n",
 " <tr style=\"text-align: right;\">\n",
 " <th></th>\n",
 " <th>Surname</th>\n",
 " <th>CreditScore</th>\n",
 " <th>Geography</th>\n",
 " <th>Gender</th>\n",
 " <th>Age</th>\n",
 " <th>Tenure</th>\n",
 " <th>Balance</th>\n",
 " <th>NumOfProducts</th>\n",
 " <th>HasCrCard</th>\n",
" <th>IsActiveMember</th>\n",
 " <th>EstimatedSalary</th>\n",
 " <th>Exited</th>\n",
 " </tr>\n",
 " </thead>\n",
 " <tbody>\n",
 " <tr>\n",
 " <th>0</th>\n",
 " <td>1115</td>\n",
 " <td>619</td>\n",
 " <td>0</td>\n",
 " <td>0</td>\n",
 " <td>42.0</td>\n",
 " <td>2</td>\n",
 " <td>0.00</td>\n",
 " <td>1</td>\n",
 " <td>1</td>\n",
 " <td>1</td>\n",
 " <td>101348.88</td>\n",
 " <td>1</td>\n",
 " </tr>\n",
 " <tr>\n",
 " <th>1</th>\n",
 " <td>1177</td>\n",
 " <td>608</td>\n",
 " <td>2</td>\n",
 " <td>0</td>\n",
 " <td>41.0</td>\n",
 " <td>1</td>\n",
 " <td>83807.86</td>\n",
 " <td>1</td>\n",
 " <td>0</td>\n",
 " <td>1</td>\n",
 " <td>112542.58</td>\n",
 " <td>0</td>\n",
 " </tr>\n",
 " <tr>\n",
 " <th>2</th>\n",
 " <td>2040</td>\n",
 " <td>502</td>\n",
 " <td>0</td>\n",
 " <td>0</td>\n",
 " <td>42.0</td>\n",
 " <td>8</td>\n",
 " <td>159660.80</td>\n",
 " <td>3</td>\n",
 " <td>1</td>\n",
 " <td>0</td>\n",
 " <td>113931.57</td>\n",
 " <td>1</td>\n",
 " </tr>\n",
 " <tr>\n",
 " <th>3</th>\n",
 " <td>289</td>\n",
 " <td>699</td>\n",
 " <td>0</td>\n",
 " <td>0</td>\n",
 " <td>39.0</td>\n",
 " <td>1</td>\n",
 " <td>0.00</td>\n",
 " <td>2</td>\n",
 " <td>0</td>\n",
 " <td>0</td>\n",
 " <td>93826.63</td>\n",
 " <td>0</td>\n",
 " </tr>\n",
 " <tr>\n",
 " <th>4</th>\n",
 " <td>1822</td>\n",
 " <td>850</td>\n",
 " <td>2</td>\n",
 " <td>0</td>\n",
 " <td>43.0</td>\n",
 " <td>2</td>\n",
 " <td>125510.82</td>\n",
 " <td>1</td>\n",
 " <td>1</td>\n",
 " <td>1</td>\n",
 " <td>79084.10</td>\n",
 " <td>0</td>\n",
 " </tr>\n",
 " </tbody>\n",
 "</table>\n",
 "</div>"
 ],
 "text/plain": [
 " Surname CreditScore Geography Gender Age Tenure Balance \\\n",
 "0 1115 619 0 0 42.0 2 0.00 \n",
 "1 1177 608 2 0 41.0 1 83807.86 \n",
 "2 2040 502 0 0 42.0 8 159660.80 \n",
 "3 289 699 0 0 39.0 1 0.00 \n",
 "4 1822 850 2 0 43.0 2 125510.82 \n",
 "\n",
 " NumOfProducts HasCrCard IsActiveMember EstimatedSalary Exited \n",
 "0 1 1 1 101348.88 1 \n",
 "1 1 0 1 112542.58 0 \n",
 "2 3 1 0 113931.57 1 \n",
 "3 2 0 0 93826.63 0 \n",
 "4 1 1 1 79084.10 0 "
 ]
 },
 "execution_count": 14,
 "metadata": {},
 "output_type": "execute_result"
 }
 ],
 "source": [
 "# Label encode the variables\n",
 "for col in categorical_columns:\n",
 " lbl = LabelEncoder()\n",
 " lbl.fit(list(df[col].values))\n",
 " df[col] = lbl.transform(list(df[col].values))\n",
 " \n",
 "df.head()"
 ]
 },
 {
 "cell_type": "markdown",
 "metadata": {},
 "source": [
 "### 4.3. Split X and y\n",
 "\n",
 "Split the X and y dataset"
 ]
 },
 {
 "cell_type": "code",
 "execution_count": 15,
 "metadata": {},
 "outputs": [],
 "source": [
 "# Split the y variable series and x variables dataset\n",
 "X = df.drop([target_class],axis=1)\n",
 "y = df[target_class]"
 ]
 },
 {
 "cell_type": "markdown",
 "metadata": {},
 "source": [
 "### 4.4. Feature Scaling\n",
 "\n",
 "It is a technique to standardize the x variables (features) present in the data in a fixed range. It needs to be done before training the model.\n",
 "\n",
 "But if you are using tree based models, you should not go for feature scaling"
 ]
 },
 {
 "cell_type": "code",
 "execution_count": 18,
 "metadata": {},
 "outputs": [],
 "source": [
 "# Define the function to scale the data using StandardScaler()\n",
 "def scale_data(data):\n",
 " \n",
 " scaler = StandardScaler() \n",
 "\n",
 " # transform data\n",
 " scaled_data = scaler.fit_transform(data)\n",
 " scaled_data = DataFrame(scaled_data)\n",
 "\n",
 " scaled_data.columns = data.columns\n",
 " \n",
 " return scaled_data"
 ]
 },
 {
 "cell_type": "code",
 "execution_count": 19,
 "metadata": {},
 "outputs": [
 {
 "data": {
 "text/html": [
 "<div>\n",
 "<style scoped>\n",
 " .dataframe tbody tr th:only-of-type {\n",
 " vertical-align: middle;\n",
 " }\n",
 "\n",
 " .dataframe tbody tr th {\n",
 " vertical-align: top;\n",
 " }\n",
 "\n",
 " .dataframe thead th {\n",
 " text-align: right;\n",
 " }\n",
 "</style>\n",
 "<table border=\"1\" class=\"dataframe\">\n",
 " <thead>\n",
 " <tr style=\"text-align: right;\">\n",
 " <th></th>\n",
 " <th>Surname</th>\n",
 " <th>CreditScore</th>\n",
 " <th>Geography</th>\n",
 " <th>Gender</th>\n",
 " <th>Age</th>\n",
 " <th>Tenure</th>\n",
 " <th>Balance</th>\n",
 " <th>NumOfProducts</th>\n",
 " <th>HasCrCard</th>\n",
 " <th>IsActiveMember</th>\n",
 " <th>EstimatedSalary</th>\n",
 " </tr>\n",
 " </thead>\n",
 " <tbody>\n",
 " <tr>\n",
 " <th>0</th>\n",
 " <td>-0.464183</td>\n",
 " <td>-0.326221</td>\n",
 " <td>-0.901886</td>\n",
 " <td>-1.095988</td>\n",
 " <td>0.293621</td>\n",
 " <td>-1.041760</td>\n",
 " <td>-1.225848</td>\n",
 " <td>-0.911583</td>\n",
 " <td>0.646092</td>\n",
 " <td>0.970243</td>\n",
 " <td>0.021886</td>\n",
 " </tr>\n",
 " <tr>\n",
 " <th>1</th>\n",
 " <td>-0.390911</td>\n",
 " <td>-0.440036</td>\n",
 " <td>1.515067</td>\n",
 " <td>-1.095988</td>\n",
 " <td>0.198207</td>\n",
 " <td>-1.387538</td>\n",
 " <td>0.117350</td>\n",
 " <td>-0.911583</td>\n",
 " <td>-1.547768</td>\n",
 " <td>0.970243</td>\n",
 " <td>0.216534</td>\n",
 " </tr>\n",
 " <tr>\n",
 " <th>2</th>\n",
 " <td>0.628988</td>\n",
 " <td>-1.536794</td>\n",
 " <td>-0.901886</td>\n",
 " <td>-1.095988</td>\n",
 " <td>0.293621</td>\n",
 " <td>1.032908</td>\n",
 " <td>1.333053</td>\n",
 " <td>2.527057</td>\n",
 " <td>0.646092</td>\n",
 " <td>-1.030670</td>\n",
 " <td>0.240687</td>\n",
 " </tr>\n",
 " <tr>\n",
 " <th>3</th>\n",
 " <td>-1.440356</td>\n",
 " <td>0.501521</td>\n",
 " <td>-0.901886</td>\n",
 " <td>-1.095988</td>\n",
 " <td>0.007377</td>\n",
 " <td>-1.387538</td>\n",
 " <td>-1.225848</td>\n",
" <td>0.807737</td>\n",
 " <td>-1.547768</td>\n",
 " <td>-1.030670</td>\n",
 " <td>-0.108918</td>\n",
 " </tr>\n",
 " <tr>\n",
 " <th>4</th>\n",
 " <td>0.371354</td>\n",
 " <td>2.063884</td>\n",
 " <td>1.515067</td>\n",
 " <td>-1.095988</td>\n",
 " <td>0.389036</td>\n",
 " <td>-1.041760</td>\n",
 " <td>0.785728</td>\n",
 " <td>-0.911583</td>\n",
 " <td>0.646092</td>\n",
 " <td>0.970243</td>\n",
 " <td>-0.365276</td>\n",
 " </tr>\n",
 " </tbody>\n",
 "</table>\n",
 "</div>"
 ],
 "text/plain": [
 " Surname CreditScore Geography Gender Age Tenure Balance \\\n",
 "0 -0.464183 -0.326221 -0.901886 -1.095988 0.293621 -1.041760 -1.225848 \n",
 "1 -0.390911 -0.440036 1.515067 -1.095988 0.198207 -1.387538 0.117350 \n",
 "2 0.628988 -1.536794 -0.901886 -1.095988 0.293621 1.032908 1.333053 \n",
 "3 -1.440356 0.501521 -0.901886 -1.095988 0.007377 -1.387538 -1.225848 \n",
 "4 0.371354 2.063884 1.515067 -1.095988 0.389036 -1.041760 0.785728 \n",
 "\n",
 " NumOfProducts HasCrCard IsActiveMember EstimatedSalary \n",
 "0 -0.911583 0.646092 0.970243 0.021886 \n",
 "1 -0.911583 -1.547768 0.970243 0.216534 \n",
 "2 2.527057 0.646092 -1.030670 0.240687 \n",
 "3 0.807737 -1.547768 -1.030670 -0.108918 \n",
 "4 -0.911583 0.646092 0.970243 -0.365276 "
 ]
 },
 "execution_count": 19,
 "metadata": {},
 "output_type": "execute_result"
 }
 ],
 "source": [
 "# Scale X dataset\n",
 "scaled_X = scale_data(X)\n",
 "scaled_X.head()"
 ]
 },
 {
 "cell_type": "markdown",
 "metadata": {},
 "source": [
 "### 4.5 Train - Test Split\n",
 "\n",
 "Split the dataset in training and test set"
 ]
 },
 {
 "cell_type": "code",
 "execution_count": 20,
 "metadata": {},
 "outputs": [],
 "source": [
 "# Split the dataset into the training set and test set\n",
 "X_train, X_test, y_train, y_test = train_test_split(scaled_X, y, test_size = 0.3, random_state = 0)"
 ]
 },
 {
 "cell_type": "markdown",
 "metadata": {},
 "source": [
 "<div class=\"alert alert-info\" style=\"background-color:#006a79; color:white; padding:0px 10px; border-radius:5px;\"><h2 style='margin:10px 5px'>5. Model Building</h2>\n",
 "</div>\n",
 "\n",
 "In this section you will:\n",
 "- Train the model on training data\n",
 "- Get the predictions on testing data\n",
 "- Evaluate the performance of model on testing data"
 ]
 },
 {
 "cell_type": "markdown",
 "metadata": {},
 "source": [
 "### 5.1 Train Model\n",
 "\n",
 "Train the logistic regression model on training data"
 ]
 },
 {
 "cell_type": "code",
 "execution_count": 21,
 "metadata": {},
 "outputs": [
 {
 "data": {
 "text/plain": [
 "LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n",
 " intercept_scaling=1, l1_ratio=None, max_iter=100,\n",
 " multi_class='auto', n_jobs=None, penalty='l2',\n",
 " random_state=0, solver='lbfgs', tol=0.0001, verbose=0,\n",
 " warm_start=False)"
 ]
 },
 "execution_count": 21,
 "metadata": {},
 "output_type": "execute_result"
 }
 ],
 "source": [
 "# Defining the model \n",
 "model = LogisticRegression(random_state=0)\n",
 "# model = DecisionTreeClassifier(random_state=0)\n",
 "\n",
 "# Training the model:\n",
 "model.fit(X_train, y_train)\n",
 "\n",
 "model"
 ]
 },
 {
 "cell_type": "markdown",
 "metadata": {},
 "source": [
 "### 5.2 Model Predictions\n",
 "\n",
 "Get the predictions from the model on testing data"
 ]
 },
 {
 "cell_type": "code",
 "execution_count": 22,
 "metadata": {},
 "outputs": [
 {
 "name": "stdout",
 "output_type": "stream",
 "text": [
 "Y predicted : [0 0 0 ... 0 0 0]\n",
 "Y probability predicted : [0.15028993372247473, 0.40348523298833217, 0.18960139980549454, 0.1402526159936366, 0.11542139569731145]\n"
 ]
 }
 ],
 "source": [
 "# Predict class for test dataset\n",
 "y_pred = model.predict(X_test)\n",
 "\n",
 "# Predict probability for test dataset\n",
 "y_pred_prod = model.predict_proba(X_test)\n",
 "y_pred_prod = [x[1] for x in y_pred_prod]\n",
 "print(\"Y predicted : \",y_pred)\n",
 "print(\"Y probability predicted : \",y_pred_prod[:5])"
 ]
 },
 {
 "cell_type": "markdown",
 "metadata": {},
 "source": [
 "### 5.3. Model Evaluation\n",
 "\n",
 "Get the evaluation metrics to evaluate the performance of model on testing data"
 ]
 },
 {
 "cell_type": "code",
 "execution_count": 23,
 "metadata": {},
 "outputs": [],
 "source": [
 "# Define a function to compute various evaluation metrics \n",
 "def compute_evaluation_metric(y_actual, y_predicted):\n",
 " print(\"\\n Accuracy Score : \\n \",accuracy_score(y_actual,y_predicted))\n",
 " print(\"\\n Confusion Matrix : \\n \",confusion_matrix(y_actual, y_predicted))\n",
 " print(\"\\n Classification Report : \\n\",classification_report(y_actual, y_predicted))"
 ]
 },
 {
 "cell_type": "code",
 "execution_count": 24,
 "metadata": {},
 "outputs": [
 {
 "name": "stdout",
 "output_type": "stream",
 "text": [
 "\n",
 " Accuracy Score : \n",
 " 0.8033333333333333\n",
 "\n",
 " Confusion Matrix : \n",
 " [[2302 77]\n",
 " [ 513 108]]\n",
 "\n",
 " Classification Report : \n",
 " precision recall f1-score support\n",
 "\n",
 " 0 0.82 0.97 0.89 2379\n",
 " 1 0.58 0.17 0.27 621\n",
 "\n",
 " accuracy 0.80 3000\n",
 " macro avg 0.70 0.57 0.58 3000\n",
 "weighted avg 0.77 0.80 0.76 3000\n",
 "\n"
 ]
 }
 ],
 "source": [
 "# Compute Evaluation Metric\n",
 "compute_evaluation_metric(y_test, y_pred)"
 ]
 },
 {
 "cell_type": "markdown",
 "metadata": {},
 "source": [
 "<div class=\"alert alert-info\" style=\"background-color:#006a79; color:white; padding:0px 10px; border-radius:5px;\"><h2 style='margin:10px 5px'>6. Improve Model</h2>\n",
 "</div>\n",
 "\n",
 "The first model you make may not be a good one. You need to improve the model. \n",
 "\n",
 "In majority of the classification problems, the target class is imbalanced. So you need to balance it in order to get best modelling results. \n",
 "\n",
 "In this section you will:\n",
 "- Handle class imbalance\n",
 "- Save the final model"
 ]
 },
 {
 "cell_type": "markdown",
 "metadata": {},
 "source": [
 "### 6.1 Handle Class Imbalance\n",
 "\n",
 "Imbalanced classes are a common problem in machine learning classification where there are a disproportionate ratio of observations in each class.\n",
 "\n",
 "Most machine learning algorithms work best when the number of samples in each class are about equal. This is because most algorithms are designed to maximize accuracy and reduce error.\n",
 "\n",
 "Here, you will upsample the minority class"
 ]
 },
 {
 "cell_type": "code",
 "execution_count": 25,
 "metadata": {},
 "outputs": [
 {
 "data": {
 "text/plain": [
 "1 7963\n",
 "0 7963\n",
 "Name: Exited, dtype: int64"
 ]
 },
 "execution_count": 25,
 "metadata": {},
 "output_type": "execute_result"
 }
 ],
 "source": [
"# Over sample the minority class \n",
 "ros = RandomOverSampler()\n",
 "X_ros, y_ros = ros.fit_sample(X, y)\n",
 "\n",
 "y_ros.value_counts()"
 ]
 },
 {
 "cell_type": "code",
 "execution_count": 29,
 "metadata": {},
 "outputs": [],
 "source": [
 "# Define the function to build model on balanced dataset\n",
 "def classification_model(X, y):\n",
 " \n",
 " scaled_X = scale_data(X)\n",
 " \n",
 " # Split the dataset into the training set and test set\n",
 " X_train, X_test, y_train, y_test = train_test_split(scaled_X, y, test_size = 0.3, random_state = 0)\n",
 " \n",
 " # Defining the model \n",
 " model = LogisticRegression(random_state=0)\n",
 "\n",
 " # Training the model:\n",
 " model.fit(X_train, y_train)\n",
 "\n",
 " # Predict class for test dataset\n",
 " y_pred = model.predict(X_test)\n",
 " \n",
 " # Compute Evaluation Metric\n",
 " compute_evaluation_metric(y_test, y_pred)\n",
 " \n",
 " return model"
 ]
 },
 {
 "cell_type": "code",
 "execution_count": 31,
 "metadata": {},
 "outputs": [
 {
 "name": "stdout",
 "output_type": "stream",
 "text": [
 "\n",
 " Accuracy Score : \n",
 " 0.6975722059439096\n",
 "\n",
 " Confusion Matrix : \n",
 " [[1678 706]\n",
 " [ 739 1655]]\n",
 "\n",
 " Classification Report : \n",
 " precision recall f1-score support\n",
 "\n",
 " 0 0.69 0.70 0.70 2384\n",
 " 1 0.70 0.69 0.70 2394\n",
 "\n",
 " accuracy 0.70 4778\n",
 " macro avg 0.70 0.70 0.70 4778\n",
 "weighted avg 0.70 0.70 0.70 4778\n",
 "\n"
 ]
 }
 ],
 "source": [
 "# Build model on balanced data and get evaluation metrics\n",
 "model = classification_model(X_ros, y_ros)"
 ]
 },
 {
 "cell_type": "markdown",
 "metadata": {},
 "source": [
 "### 6.2 Save the final model\n",
 "\n",
 "You can save the model in local disk and use it whenever you want"
 ]
 },
 {
 "cell_type": "code",
 "execution_count": 34,
 "metadata": {},
 "outputs": [],
 "source": [
 "# save the model to disk\n",
 "filename = 'final_model.sav'\n",
 "pickle.dump(model, open(filename, 'wb'))"
 ]
 },
 {
 "cell_type": "code",
 "execution_count": 35,
 "metadata": {},
 "outputs": [
 {
 "data": {
 "text/plain": [
 "LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n",
 " intercept_scaling=1, l1_ratio=None, max_iter=100,\n",
 " multi_class='auto', n_jobs=None, penalty='l2',\n",
 " random_state=0, solver='lbfgs', tol=0.0001, verbose=0,\n",
 " warm_start=False)"
 ]
 },
 "execution_count": 35,
 "metadata": {},
 "output_type": "execute_result"
 }
 ],
 "source": [
 "# load the model from disk\n",
 "loaded_model = pickle.load(open(filename, 'rb'))\n",
 "loaded_model"
 ]
 },
 {
 "cell_type": "markdown",
 "metadata": {},
 "source": [
 "<div class=\"alert\" style=\"background-color:#696969; color:white; padding:0px 10px; border-radius:5px;\"><h3 style='margin:15px 15px; font-size:12px'> All Rights Reserved. This notebook is proprietary content of machinelearningplus.com. This can be shared solely for educational purposes, with due credits to machinelearningplus.com</h3>\n",
 "</div>\n",
 "\n"
 ]
 }
 ],
 "metadata": {
 "kernelspec": {
 "display_name": "Python 3",
 "language": "python",
 "name": "python3"
 },
 "language_info": {
 "codemirror_mode": {
 "name": "ipython",
 "version": 3
 },
 "file_extension": ".py",
 "mimetype": "text/x-python",
 "name": "python",
 "nbconvert_exporter": "python",
 "pygments_lexer": "ipython3",
 "version": "3.7.3"
 }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}

Teste o Premium para desbloquear

Aproveite todos os benefícios por 3 dias sem pagar! 😉
Já tem cadastro?

Continue navegando