{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Test3: (40 marks)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Your ID: 6204640087"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this excercise, we aimed at the case of customers default payments in Taiwan and compares the predictive accuracy of probability of default among data mining methods. From the perspective of risk management, the result of predictive accuracy of the estimated probability of default will be more valuable than the binary result of classification - credible or not credible clients. With the real probability of default as the response variable (Y), and the predictive probability of default as the independent variable (X), this research employed a binary variable, default payment (Yes = 1, No = 0), as the response variable. This study reviewed the literature and used the following 23 variables as explanatory variables: \\\\\n",
    "\n",
    "X1: Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit. \\\\\n",
    "\n",
    "X2: Gender (1 = male; 2 = female). \\\\\n",
    "\n",
    "X3: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others). \\\\\n",
    "\n",
    "X4: Marital status (1 = married; 2 = single; 3 = others). \\\\\n",
    "\n",
    "X5: Age (year). \\\\\n",
    "\n",
    "X6 - X11: History of past payment. We tracked the past monthly payment records (from April to September, 2005) as follows: X6 = the repayment status in September, 2005; X7 = the repayment status in August, 2005; . . .;X11 = the repayment status in April, 2005. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above. \\\\\n",
    "\n",
    "X12-X17: Amount of bill statement (NT dollar). X12 = amount of bill statement in September, 2005; X13 = amount of bill statement in August, 2005; . . .; X17 = amount of bill statement in April, 2005. \\\\\n",
    "\n",
    "X18-X23: Amount of previous payment (NT dollar). X18 = amount paid in September, 2005; X19 = amount paid in August, 2005; . . .;X23 = amount paid in April, 2005. \\\\\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import datetime\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import seaborn as sns\n",
    "#import pandas_datareader.data as web\n",
    "\n",
    "from sklearn import (\n",
    "    linear_model, metrics, pipeline, preprocessing, model_selection\n",
    ")\n",
    "\n",
    "import matplotlib.pyplot as plt\n",
    "%matplotlib inline\n",
    "# activate plot theme\n",
    "#import qeds\n",
    "#qeds.themes.mpl_style();"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Unnamed: 0</th>\n",
       "      <th>X1</th>\n",
       "      <th>X2</th>\n",
       "      <th>X3</th>\n",
       "      <th>X4</th>\n",
       "      <th>X5</th>\n",
       "      <th>X6</th>\n",
       "      <th>X7</th>\n",
       "      <th>X8</th>\n",
       "      <th>X9</th>\n",
       "      <th>...</th>\n",
       "      <th>X15</th>\n",
       "      <th>X16</th>\n",
       "      <th>X17</th>\n",
       "      <th>X18</th>\n",
       "      <th>X19</th>\n",
       "      <th>X20</th>\n",
       "      <th>X21</th>\n",
       "      <th>X22</th>\n",
       "      <th>X23</th>\n",
       "      <th>Y</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>20000</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>24</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>689</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>120000</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>26</td>\n",
       "      <td>-1</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>3272</td>\n",
       "      <td>3455</td>\n",
       "      <td>3261</td>\n",
       "      <td>0</td>\n",
       "      <td>1000</td>\n",
       "      <td>1000</td>\n",
       "      <td>1000</td>\n",
       "      <td>0</td>\n",
       "      <td>2000</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>90000</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>34</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>14331</td>\n",
       "      <td>14948</td>\n",
       "      <td>15549</td>\n",
       "      <td>1518</td>\n",
       "      <td>1500</td>\n",
       "      <td>1000</td>\n",
       "      <td>1000</td>\n",
       "      <td>1000</td>\n",
       "      <td>5000</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>50000</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>37</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>28314</td>\n",
       "      <td>28959</td>\n",
       "      <td>29547</td>\n",
       "      <td>2000</td>\n",
       "      <td>2019</td>\n",
       "      <td>1200</td>\n",
       "      <td>1100</td>\n",
       "      <td>1069</td>\n",
       "      <td>1000</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>50000</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>57</td>\n",
       "      <td>-1</td>\n",
       "      <td>0</td>\n",
       "      <td>-1</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>20940</td>\n",
       "      <td>19146</td>\n",
       "      <td>19131</td>\n",
       "      <td>2000</td>\n",
       "      <td>36681</td>\n",
       "      <td>10000</td>\n",
       "      <td>9000</td>\n",
       "      <td>689</td>\n",
       "      <td>679</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 25 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   Unnamed: 0      X1  X2  X3  X4  X5  X6  X7  X8  X9  ...    X15    X16  \\\n",
       "0           1   20000   2   2   1  24   2   2  -1  -1  ...      0      0   \n",
       "1           2  120000   2   2   2  26  -1   2   0   0  ...   3272   3455   \n",
       "2           3   90000   2   2   2  34   0   0   0   0  ...  14331  14948   \n",
       "3           4   50000   2   2   1  37   0   0   0   0  ...  28314  28959   \n",
       "4           5   50000   1   2   1  57  -1   0  -1   0  ...  20940  19146   \n",
       "\n",
       "     X17   X18    X19    X20   X21   X22   X23  Y  \n",
       "0      0     0    689      0     0     0     0  1  \n",
       "1   3261     0   1000   1000  1000     0  2000  1  \n",
       "2  15549  1518   1500   1000  1000  1000  5000  0  \n",
       "3  29547  2000   2019   1200  1100  1069  1000  0  \n",
       "4  19131  2000  36681  10000  9000   689   679  0  \n",
       "\n",
       "[5 rows x 25 columns]"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_excel('default of credit card clients.xls')\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['Unnamed: 0', 'X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9',\n",
       "       'X10', 'X11', 'X12', 'X13', 'X14', 'X15', 'X16', 'X17', 'X18', 'X19',\n",
       "       'X20', 'X21', 'X22', 'X23', 'Y'],\n",
       "      dtype='object')"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>X1</th>\n",
       "      <th>X2</th>\n",
       "      <th>X3</th>\n",
       "      <th>X4</th>\n",
       "      <th>X5</th>\n",
       "      <th>X6</th>\n",
       "      <th>X7</th>\n",
       "      <th>X8</th>\n",
       "      <th>X9</th>\n",
       "      <th>X10</th>\n",
       "      <th>...</th>\n",
       "      <th>X15</th>\n",
       "      <th>X16</th>\n",
       "      <th>X17</th>\n",
       "      <th>X18</th>\n",
       "      <th>X19</th>\n",
       "      <th>X20</th>\n",
       "      <th>X21</th>\n",
       "      <th>X22</th>\n",
       "      <th>X23</th>\n",
       "      <th>Y</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>20000</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>24</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-2</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>689</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>120000</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>26</td>\n",
       "      <td>-1</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>3272</td>\n",
       "      <td>3455</td>\n",
       "      <td>3261</td>\n",
       "      <td>0</td>\n",
       "      <td>1000</td>\n",
       "      <td>1000</td>\n",
       "      <td>1000</td>\n",
       "      <td>0</td>\n",
       "      <td>2000</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>90000</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>34</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>14331</td>\n",
       "      <td>14948</td>\n",
       "      <td>15549</td>\n",
       "      <td>1518</td>\n",
       "      <td>1500</td>\n",
       "      <td>1000</td>\n",
       "      <td>1000</td>\n",
       "      <td>1000</td>\n",
       "      <td>5000</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>50000</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>37</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>28314</td>\n",
       "      <td>28959</td>\n",
       "      <td>29547</td>\n",
       "      <td>2000</td>\n",
       "      <td>2019</td>\n",
       "      <td>1200</td>\n",
       "      <td>1100</td>\n",
       "      <td>1069</td>\n",
       "      <td>1000</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>50000</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>57</td>\n",
       "      <td>-1</td>\n",
       "      <td>0</td>\n",
       "      <td>-1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>20940</td>\n",
       "      <td>19146</td>\n",
       "      <td>19131</td>\n",
       "      <td>2000</td>\n",
       "      <td>36681</td>\n",
       "      <td>10000</td>\n",
       "      <td>9000</td>\n",
       "      <td>689</td>\n",
       "      <td>679</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29995</th>\n",
       "      <td>220000</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>39</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>88004</td>\n",
       "      <td>31237</td>\n",
       "      <td>15980</td>\n",
       "      <td>8500</td>\n",
       "      <td>20000</td>\n",
       "      <td>5003</td>\n",
       "      <td>3047</td>\n",
       "      <td>5000</td>\n",
       "      <td>1000</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29996</th>\n",
       "      <td>150000</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "      <td>43</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>8979</td>\n",
       "      <td>5190</td>\n",
       "      <td>0</td>\n",
       "      <td>1837</td>\n",
       "      <td>3526</td>\n",
       "      <td>8998</td>\n",
       "      <td>129</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29997</th>\n",
       "      <td>30000</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>37</td>\n",
       "      <td>4</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "      <td>-1</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>20878</td>\n",
       "      <td>20582</td>\n",
       "      <td>19357</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>22000</td>\n",
       "      <td>4200</td>\n",
       "      <td>2000</td>\n",
       "      <td>3100</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29998</th>\n",
       "      <td>80000</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>41</td>\n",
       "      <td>1</td>\n",
       "      <td>-1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>52774</td>\n",
       "      <td>11855</td>\n",
       "      <td>48944</td>\n",
       "      <td>85900</td>\n",
       "      <td>3409</td>\n",
       "      <td>1178</td>\n",
       "      <td>1926</td>\n",
       "      <td>52964</td>\n",
       "      <td>1804</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29999</th>\n",
       "      <td>50000</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>46</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>36535</td>\n",
       "      <td>32428</td>\n",
       "      <td>15313</td>\n",
       "      <td>2078</td>\n",
       "      <td>1800</td>\n",
       "      <td>1430</td>\n",
       "      <td>1000</td>\n",
       "      <td>1000</td>\n",
       "      <td>1000</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>30000 rows × 24 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "           X1  X2  X3  X4  X5  X6  X7  X8  X9  X10  ...    X15    X16    X17  \\\n",
       "0       20000   2   2   1  24   2   2  -1  -1   -2  ...      0      0      0   \n",
       "1      120000   2   2   2  26  -1   2   0   0    0  ...   3272   3455   3261   \n",
       "2       90000   2   2   2  34   0   0   0   0    0  ...  14331  14948  15549   \n",
       "3       50000   2   2   1  37   0   0   0   0    0  ...  28314  28959  29547   \n",
       "4       50000   1   2   1  57  -1   0  -1   0    0  ...  20940  19146  19131   \n",
       "...       ...  ..  ..  ..  ..  ..  ..  ..  ..  ...  ...    ...    ...    ...   \n",
       "29995  220000   1   3   1  39   0   0   0   0    0  ...  88004  31237  15980   \n",
       "29996  150000   1   3   2  43  -1  -1  -1  -1    0  ...   8979   5190      0   \n",
       "29997   30000   1   2   2  37   4   3   2  -1    0  ...  20878  20582  19357   \n",
       "29998   80000   1   3   1  41   1  -1   0   0    0  ...  52774  11855  48944   \n",
       "29999   50000   1   2   1  46   0   0   0   0    0  ...  36535  32428  15313   \n",
       "\n",
       "         X18    X19    X20   X21    X22   X23  Y  \n",
       "0          0    689      0     0      0     0  1  \n",
       "1          0   1000   1000  1000      0  2000  1  \n",
       "2       1518   1500   1000  1000   1000  5000  0  \n",
       "3       2000   2019   1200  1100   1069  1000  0  \n",
       "4       2000  36681  10000  9000    689   679  0  \n",
       "...      ...    ...    ...   ...    ...   ... ..  \n",
       "29995   8500  20000   5003  3047   5000  1000  0  \n",
       "29996   1837   3526   8998   129      0     0  0  \n",
       "29997      0      0  22000  4200   2000  3100  1  \n",
       "29998  85900   3409   1178  1926  52964  1804  1  \n",
       "29999   2078   1800   1430  1000   1000  1000  1  \n",
       "\n",
       "[30000 rows x 24 columns]"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.drop(columns = 'Unnamed: 0', axis = 1, inplace = True)\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "target = 'Y'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Y</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29995</th>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29996</th>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29997</th>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29998</th>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29999</th>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>30000 rows × 1 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       Y\n",
       "0      1\n",
       "1      1\n",
       "2      0\n",
       "3      0\n",
       "4      0\n",
       "...   ..\n",
       "29995  0\n",
       "29996  0\n",
       "29997  1\n",
       "29998  1\n",
       "29999  1\n",
       "\n",
       "[30000 rows x 1 columns]"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "y = df[[target]]\n",
    "y"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['X1',\n",
       " 'X2',\n",
       " 'X3',\n",
       " 'X4',\n",
       " 'X5',\n",
       " 'X6',\n",
       " 'X7',\n",
       " 'X8',\n",
       " 'X9',\n",
       " 'X10',\n",
       " 'X11',\n",
       " 'X12',\n",
       " 'X13',\n",
       " 'X14',\n",
       " 'X15',\n",
       " 'X16',\n",
       " 'X17',\n",
       " 'X18',\n",
       " 'X19',\n",
       " 'X20',\n",
       " 'X21',\n",
       " 'X22',\n",
       " 'X23']"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "predictors = [c for c in df.columns if c != target]\n",
    "predictors"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>X1</th>\n",
       "      <th>X2</th>\n",
       "      <th>X3</th>\n",
       "      <th>X4</th>\n",
       "      <th>X5</th>\n",
       "      <th>X6</th>\n",
       "      <th>X7</th>\n",
       "      <th>X8</th>\n",
       "      <th>X9</th>\n",
       "      <th>X10</th>\n",
       "      <th>...</th>\n",
       "      <th>X14</th>\n",
       "      <th>X15</th>\n",
       "      <th>X16</th>\n",
       "      <th>X17</th>\n",
       "      <th>X18</th>\n",
       "      <th>X19</th>\n",
       "      <th>X20</th>\n",
       "      <th>X21</th>\n",
       "      <th>X22</th>\n",
       "      <th>X23</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>20000</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>24</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-2</td>\n",
       "      <td>...</td>\n",
       "      <td>689</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>689</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>120000</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>26</td>\n",
       "      <td>-1</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>2682</td>\n",
       "      <td>3272</td>\n",
       "      <td>3455</td>\n",
       "      <td>3261</td>\n",
       "      <td>0</td>\n",
       "      <td>1000</td>\n",
       "      <td>1000</td>\n",
       "      <td>1000</td>\n",
       "      <td>0</td>\n",
       "      <td>2000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>90000</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>34</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>13559</td>\n",
       "      <td>14331</td>\n",
       "      <td>14948</td>\n",
       "      <td>15549</td>\n",
       "      <td>1518</td>\n",
       "      <td>1500</td>\n",
       "      <td>1000</td>\n",
       "      <td>1000</td>\n",
       "      <td>1000</td>\n",
       "      <td>5000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>50000</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>37</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>49291</td>\n",
       "      <td>28314</td>\n",
       "      <td>28959</td>\n",
       "      <td>29547</td>\n",
       "      <td>2000</td>\n",
       "      <td>2019</td>\n",
       "      <td>1200</td>\n",
       "      <td>1100</td>\n",
       "      <td>1069</td>\n",
       "      <td>1000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>50000</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>57</td>\n",
       "      <td>-1</td>\n",
       "      <td>0</td>\n",
       "      <td>-1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>35835</td>\n",
       "      <td>20940</td>\n",
       "      <td>19146</td>\n",
       "      <td>19131</td>\n",
       "      <td>2000</td>\n",
       "      <td>36681</td>\n",
       "      <td>10000</td>\n",
       "      <td>9000</td>\n",
       "      <td>689</td>\n",
       "      <td>679</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29995</th>\n",
       "      <td>220000</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>39</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>208365</td>\n",
       "      <td>88004</td>\n",
       "      <td>31237</td>\n",
       "      <td>15980</td>\n",
       "      <td>8500</td>\n",
       "      <td>20000</td>\n",
       "      <td>5003</td>\n",
       "      <td>3047</td>\n",
       "      <td>5000</td>\n",
       "      <td>1000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29996</th>\n",
       "      <td>150000</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "      <td>43</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>3502</td>\n",
       "      <td>8979</td>\n",
       "      <td>5190</td>\n",
       "      <td>0</td>\n",
       "      <td>1837</td>\n",
       "      <td>3526</td>\n",
       "      <td>8998</td>\n",
       "      <td>129</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29997</th>\n",
       "      <td>30000</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>37</td>\n",
       "      <td>4</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "      <td>-1</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>2758</td>\n",
       "      <td>20878</td>\n",
       "      <td>20582</td>\n",
       "      <td>19357</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>22000</td>\n",
       "      <td>4200</td>\n",
       "      <td>2000</td>\n",
       "      <td>3100</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29998</th>\n",
       "      <td>80000</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>41</td>\n",
       "      <td>1</td>\n",
       "      <td>-1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>76304</td>\n",
       "      <td>52774</td>\n",
       "      <td>11855</td>\n",
       "      <td>48944</td>\n",
       "      <td>85900</td>\n",
       "      <td>3409</td>\n",
       "      <td>1178</td>\n",
       "      <td>1926</td>\n",
       "      <td>52964</td>\n",
       "      <td>1804</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29999</th>\n",
       "      <td>50000</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>46</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>49764</td>\n",
       "      <td>36535</td>\n",
       "      <td>32428</td>\n",
       "      <td>15313</td>\n",
       "      <td>2078</td>\n",
       "      <td>1800</td>\n",
       "      <td>1430</td>\n",
       "      <td>1000</td>\n",
       "      <td>1000</td>\n",
       "      <td>1000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>30000 rows × 23 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "           X1  X2  X3  X4  X5  X6  X7  X8  X9  X10  ...     X14    X15    X16  \\\n",
       "0       20000   2   2   1  24   2   2  -1  -1   -2  ...     689      0      0   \n",
       "1      120000   2   2   2  26  -1   2   0   0    0  ...    2682   3272   3455   \n",
       "2       90000   2   2   2  34   0   0   0   0    0  ...   13559  14331  14948   \n",
       "3       50000   2   2   1  37   0   0   0   0    0  ...   49291  28314  28959   \n",
       "4       50000   1   2   1  57  -1   0  -1   0    0  ...   35835  20940  19146   \n",
       "...       ...  ..  ..  ..  ..  ..  ..  ..  ..  ...  ...     ...    ...    ...   \n",
       "29995  220000   1   3   1  39   0   0   0   0    0  ...  208365  88004  31237   \n",
       "29996  150000   1   3   2  43  -1  -1  -1  -1    0  ...    3502   8979   5190   \n",
       "29997   30000   1   2   2  37   4   3   2  -1    0  ...    2758  20878  20582   \n",
       "29998   80000   1   3   1  41   1  -1   0   0    0  ...   76304  52774  11855   \n",
       "29999   50000   1   2   1  46   0   0   0   0    0  ...   49764  36535  32428   \n",
       "\n",
       "         X17    X18    X19    X20   X21    X22   X23  \n",
       "0          0      0    689      0     0      0     0  \n",
       "1       3261      0   1000   1000  1000      0  2000  \n",
       "2      15549   1518   1500   1000  1000   1000  5000  \n",
       "3      29547   2000   2019   1200  1100   1069  1000  \n",
       "4      19131   2000  36681  10000  9000    689   679  \n",
       "...      ...    ...    ...    ...   ...    ...   ...  \n",
       "29995  15980   8500  20000   5003  3047   5000  1000  \n",
       "29996      0   1837   3526   8998   129      0     0  \n",
       "29997  19357      0      0  22000  4200   2000  3100  \n",
       "29998  48944  85900   3409   1178  1926  52964  1804  \n",
       "29999  15313   2078   1800   1430  1000   1000  1000  \n",
       "\n",
       "[30000 rows x 23 columns]"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X = df[predictors]\n",
    "X"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Learning the Logistic Regression, KNN, in Python and Scikit-Learn  \n",
    "\n",
    "There are 4 sub-question:\n",
    "\n",
    "1.1 Create the table to report the proportion of case of customers default payments in Taiwan separated by gender, education, and marital status. What is/are the interesting result/s you can draw from this table?[10 Points].\\\\"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.groupby?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "9.576666666666666"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "(np.sum(df[df['X2'] == 1]['Y'])/len(df))*100 #By gender == Male"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.12543333333333334"
      ]
     },
     "execution_count": 47,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "(np.sum(df[df['X2'] == 2]['Y'])/len(df))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Proportion of Default</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Proportion of Default\n",
       "2                    NaN\n",
       "1                    NaN"
      ]
     },
     "execution_count": 56,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sum_gen = pd.DataFrame(data = np.nan, columns = ['Proportion of Default'], index = df['X2'].unique())\n",
    "sum_gen"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Proportion of Default</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.125433</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.095767</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Proportion of Default\n",
       "2               0.125433\n",
       "1               0.095767"
      ]
     },
     "execution_count": 57,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sum_gen.iloc[1] = (np.sum(df[df['X2'] == 1]['Y'])/len(df)) #By gender == Male\n",
    "sum_gen.iloc[0] = (np.sum(df[df['X2'] == 2]['Y'])/len(df)) #By gender == Female\n",
    "sum_gen"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Proportion of Default</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Female</th>\n",
       "      <td>0.125433</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Male</th>\n",
       "      <td>0.095767</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        Proportion of Default\n",
       "Female               0.125433\n",
       "Male                 0.095767"
      ]
     },
     "execution_count": 58,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sum_gen.rename(index = {1:'Male', 2:'Female'}, inplace = True)\n",
    "sum_gen"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Proportion of Default</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>graduate school</th>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>university</th>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>high school</th>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>others</th>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                 Proportion of Default\n",
       "graduate school                    NaN\n",
       "university                         NaN\n",
       "high school                        NaN\n",
       "others                             NaN"
      ]
     },
     "execution_count": 62,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sum_educ = pd.DataFrame(data = np.nan, columns = ['Proportion of Default'], index = range(4))\n",
    "sum_educ.rename(index = {0 : 'graduate school', 1 : 'university', 2 : 'high school', 3 : 'others'}, inplace= True)\n",
    "sum_educ"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Proportion of Default</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>graduate school</th>\n",
       "      <td>0.067867</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>university</th>\n",
       "      <td>0.111000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>high school</th>\n",
       "      <td>0.041233</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>others</th>\n",
       "      <td>0.000233</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                 Proportion of Default\n",
       "graduate school               0.067867\n",
       "university                    0.111000\n",
       "high school                   0.041233\n",
       "others                        0.000233"
      ]
     },
     "execution_count": 64,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sum_educ.iloc[0] = (np.sum(df[df['X3'] == 1]['Y'])/len(df)) \n",
    "sum_educ.iloc[1] = (np.sum(df[df['X3'] == 2]['Y'])/len(df))\n",
    "sum_educ.iloc[2] = (np.sum(df[df['X3'] == 3]['Y'])/len(df)) \n",
    "sum_educ.iloc[3] = (np.sum(df[df['X3'] == 4]['Y'])/len(df))\n",
    "sum_educ"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Proportion of Default</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>married</th>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>single</th>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>others</th>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         Proportion of Default\n",
       "married                    NaN\n",
       "single                     NaN\n",
       "others                     NaN"
      ]
     },
     "execution_count": 65,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sum_marr = pd.DataFrame(data = np.nan, columns = ['Proportion of Default'], index = range(3))\n",
    "sum_marr.rename(index = {0 : 'married', 1 : 'single', 2 : 'others'}, inplace= True)\n",
    "sum_marr"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Proportion of Default</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>married</th>\n",
       "      <td>0.106867</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>single</th>\n",
       "      <td>0.111367</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>others</th>\n",
       "      <td>0.002800</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         Proportion of Default\n",
       "married               0.106867\n",
       "single                0.111367\n",
       "others                0.002800"
      ]
     },
     "execution_count": 67,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sum_marr.iloc[0] = (np.sum(df[df['X4'] == 1]['Y'])/len(df)) \n",
    "sum_marr.iloc[1] = (np.sum(df[df['X4'] == 2]['Y'])/len(df))\n",
    "sum_marr.iloc[2] = (np.sum(df[df['X4'] == 3]['Y'])/len(df)) \n",
    "sum_marr"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "According to the tables, \n",
    "- women are more likely to cause payment default compared to men.\n",
    "- People who has the university education level are more likely to cause payment default.\n",
    "- Single and married are almost indifferenet in term of payment default."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then, partition the data into training and test sets using 80 $\\%$ to be the train data. The model will be fit to the training data and evaluated on the test set. \n",
    "\n",
    "1.2 Use the Logistic regression to train the model with all features. Then, use the test data to conduct the confusion matrix. Interpret the results carefully [10 Points]."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train1, X_test1, y_train1, y_test1 = model_selection.train_test_split(X, y, test_size=0.2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "C:\\Users\\thana\\anaconda3\\lib\\site-packages\\sklearn\\utils\\validation.py:63: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n",
      "  return f(*args, **kwargs)\n",
      "C:\\Users\\thana\\anaconda3\\lib\\site-packages\\sklearn\\linear_model\\_logistic.py:763: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
      "STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n",
      "\n",
      "Increase the number of iterations (max_iter) or scale the data as shown in:\n",
      "    https://scikit-learn.org/stable/modules/preprocessing.html\n",
      "Please also refer to the documentation for alternative solver options:\n",
      "    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
      "  n_iter_i = _check_optimize_result(\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "LogisticRegression()"
      ]
     },
     "execution_count": 69,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "logistic_model = linear_model.LogisticRegression()\n",
    "logistic_model.fit(X_train1, y_train1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 76,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "              precision    recall  f1-score   support\n",
      "\n",
      "   No deault       0.77      1.00      0.87      4640\n",
      "     Default       0.00      0.00      0.00      1360\n",
      "\n",
      "    accuracy                           0.77      6000\n",
      "   macro avg       0.39      0.50      0.44      6000\n",
      "weighted avg       0.60      0.77      0.67      6000\n",
      "\n"
     ]
    }
   ],
   "source": [
    "report = metrics.classification_report(\n",
    "    y_test1, logistic_model.predict(X_test1),\n",
    "    target_names = ['No deault', 'Default']\n",
    ")\n",
    "print(report)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Precision and recall rates are all zero. The model cannot predict the default customer correctly."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1.3 Use the KNN to train the model with all features (you should convert the features to be the standarized variables first). Then, use the test data to conduct the confusion matrix. Interpret the results carefully [10 Points]."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 80,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.pipeline import Pipeline\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "from sklearn.neighbors import KNeighborsClassifier\n",
    "from sklearn.model_selection  import KFold"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 100,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "C:\\Users\\thana\\anaconda3\\lib\\site-packages\\sklearn\\neighbors\\_classification.py:179: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n",
      "  return self._fit(X, y)\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "0.6974377777777778"
      ]
     },
     "execution_count": 100,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "classifier = Pipeline([('norm', StandardScaler()),('knn', KNeighborsClassifier())])\n",
    "classifier.fit(X_train1.to_numpy(), y_train1.to_numpy())\n",
    "prediction = classifier.predict(X_test1.to_numpy())\n",
    "curmean = np.mean(prediction == y_test1.to_numpy())\n",
    "curmean"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    " 1.4 Compare the predictive performance of two models. Which one you select? And Why? [10 Points]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#put your code here"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
