{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Test3: (40 marks)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Your ID:   6504930246"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this excercise, we aimed at the case of customers default payments in Taiwan and compares the predictive accuracy of probability of default among data mining methods. From the perspective of risk management, the result of predictive accuracy of the estimated probability of default will be more valuable than the binary result of classification - credible or not credible clients. With the real probability of default as the response variable (Y), and the predictive probability of default as the independent variable (X), this research employed a binary variable, default payment (Yes = 1, No = 0), as the response variable. This study reviewed the literature and used the following 23 variables as explanatory variables: \\\\\n",
    "\n",
    "X1: Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit. \\\\\n",
    "\n",
    "X2: Gender (1 = male; 2 = female). \\\\\n",
    "\n",
    "X3: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others). \\\\\n",
    "\n",
    "X4: Marital status (1 = married; 2 = single; 3 = others). \\\\\n",
    "\n",
    "X5: Age (year). \\\\\n",
    "\n",
    "X6 - X11: History of past payment. We tracked the past monthly payment records (from April to September, 2005) as follows: X6 = the repayment status in September, 2005; X7 = the repayment status in August, 2005; . . .;X11 = the repayment status in April, 2005. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above. \\\\\n",
    "\n",
    "X12-X17: Amount of bill statement (NT dollar). X12 = amount of bill statement in September, 2005; X13 = amount of bill statement in August, 2005; . . .; X17 = amount of bill statement in April, 2005. \\\\\n",
    "\n",
    "X18-X23: Amount of previous payment (NT dollar). X18 = amount paid in September, 2005; X19 = amount paid in August, 2005; . . .;X23 = amount paid in April, 2005. \\\\\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "from pathlib import Path\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "from sklearn.decomposition import PCA\n",
    "from sklearn import preprocessing\n",
    "import matplotlib.pylab as plt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "%matplotlib inline\n",
    "import pandas as pd\n",
    "import statsmodels.api as sm\n",
    "from statsmodels.iolib.summary2 import summary_col"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Unnamed: 0</th>\n",
       "      <th>X1</th>\n",
       "      <th>X2</th>\n",
       "      <th>X3</th>\n",
       "      <th>X4</th>\n",
       "      <th>X5</th>\n",
       "      <th>X6</th>\n",
       "      <th>X7</th>\n",
       "      <th>X8</th>\n",
       "      <th>X9</th>\n",
       "      <th>...</th>\n",
       "      <th>X15</th>\n",
       "      <th>X16</th>\n",
       "      <th>X17</th>\n",
       "      <th>X18</th>\n",
       "      <th>X19</th>\n",
       "      <th>X20</th>\n",
       "      <th>X21</th>\n",
       "      <th>X22</th>\n",
       "      <th>X23</th>\n",
       "      <th>Y</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>20000</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>24</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>689</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>120000</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>26</td>\n",
       "      <td>-1</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>3272</td>\n",
       "      <td>3455</td>\n",
       "      <td>3261</td>\n",
       "      <td>0</td>\n",
       "      <td>1000</td>\n",
       "      <td>1000</td>\n",
       "      <td>1000</td>\n",
       "      <td>0</td>\n",
       "      <td>2000</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>90000</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>34</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>14331</td>\n",
       "      <td>14948</td>\n",
       "      <td>15549</td>\n",
       "      <td>1518</td>\n",
       "      <td>1500</td>\n",
       "      <td>1000</td>\n",
       "      <td>1000</td>\n",
       "      <td>1000</td>\n",
       "      <td>5000</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>50000</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>37</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>28314</td>\n",
       "      <td>28959</td>\n",
       "      <td>29547</td>\n",
       "      <td>2000</td>\n",
       "      <td>2019</td>\n",
       "      <td>1200</td>\n",
       "      <td>1100</td>\n",
       "      <td>1069</td>\n",
       "      <td>1000</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>50000</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>57</td>\n",
       "      <td>-1</td>\n",
       "      <td>0</td>\n",
       "      <td>-1</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>20940</td>\n",
       "      <td>19146</td>\n",
       "      <td>19131</td>\n",
       "      <td>2000</td>\n",
       "      <td>36681</td>\n",
       "      <td>10000</td>\n",
       "      <td>9000</td>\n",
       "      <td>689</td>\n",
       "      <td>679</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29995</th>\n",
       "      <td>29996</td>\n",
       "      <td>220000</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>39</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>88004</td>\n",
       "      <td>31237</td>\n",
       "      <td>15980</td>\n",
       "      <td>8500</td>\n",
       "      <td>20000</td>\n",
       "      <td>5003</td>\n",
       "      <td>3047</td>\n",
       "      <td>5000</td>\n",
       "      <td>1000</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29996</th>\n",
       "      <td>29997</td>\n",
       "      <td>150000</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "      <td>43</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>...</td>\n",
       "      <td>8979</td>\n",
       "      <td>5190</td>\n",
       "      <td>0</td>\n",
       "      <td>1837</td>\n",
       "      <td>3526</td>\n",
       "      <td>8998</td>\n",
       "      <td>129</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29997</th>\n",
       "      <td>29998</td>\n",
       "      <td>30000</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>37</td>\n",
       "      <td>4</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "      <td>-1</td>\n",
       "      <td>...</td>\n",
       "      <td>20878</td>\n",
       "      <td>20582</td>\n",
       "      <td>19357</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>22000</td>\n",
       "      <td>4200</td>\n",
       "      <td>2000</td>\n",
       "      <td>3100</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29998</th>\n",
       "      <td>29999</td>\n",
       "      <td>80000</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>41</td>\n",
       "      <td>1</td>\n",
       "      <td>-1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>52774</td>\n",
       "      <td>11855</td>\n",
       "      <td>48944</td>\n",
       "      <td>85900</td>\n",
       "      <td>3409</td>\n",
       "      <td>1178</td>\n",
       "      <td>1926</td>\n",
       "      <td>52964</td>\n",
       "      <td>1804</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29999</th>\n",
       "      <td>30000</td>\n",
       "      <td>50000</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>46</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>36535</td>\n",
       "      <td>32428</td>\n",
       "      <td>15313</td>\n",
       "      <td>2078</td>\n",
       "      <td>1800</td>\n",
       "      <td>1430</td>\n",
       "      <td>1000</td>\n",
       "      <td>1000</td>\n",
       "      <td>1000</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>30000 rows × 25 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       Unnamed: 0      X1  X2  X3  X4  X5  X6  X7  X8  X9  ...    X15    X16  \\\n",
       "0               1   20000   2   2   1  24   2   2  -1  -1  ...      0      0   \n",
       "1               2  120000   2   2   2  26  -1   2   0   0  ...   3272   3455   \n",
       "2               3   90000   2   2   2  34   0   0   0   0  ...  14331  14948   \n",
       "3               4   50000   2   2   1  37   0   0   0   0  ...  28314  28959   \n",
       "4               5   50000   1   2   1  57  -1   0  -1   0  ...  20940  19146   \n",
       "...           ...     ...  ..  ..  ..  ..  ..  ..  ..  ..  ...    ...    ...   \n",
       "29995       29996  220000   1   3   1  39   0   0   0   0  ...  88004  31237   \n",
       "29996       29997  150000   1   3   2  43  -1  -1  -1  -1  ...   8979   5190   \n",
       "29997       29998   30000   1   2   2  37   4   3   2  -1  ...  20878  20582   \n",
       "29998       29999   80000   1   3   1  41   1  -1   0   0  ...  52774  11855   \n",
       "29999       30000   50000   1   2   1  46   0   0   0   0  ...  36535  32428   \n",
       "\n",
       "         X17    X18    X19    X20   X21    X22   X23  Y  \n",
       "0          0      0    689      0     0      0     0  1  \n",
       "1       3261      0   1000   1000  1000      0  2000  1  \n",
       "2      15549   1518   1500   1000  1000   1000  5000  0  \n",
       "3      29547   2000   2019   1200  1100   1069  1000  0  \n",
       "4      19131   2000  36681  10000  9000    689   679  0  \n",
       "...      ...    ...    ...    ...   ...    ...   ... ..  \n",
       "29995  15980   8500  20000   5003  3047   5000  1000  0  \n",
       "29996      0   1837   3526   8998   129      0     0  0  \n",
       "29997  19357      0      0  22000  4200   2000  3100  1  \n",
       "29998  48944  85900   3409   1178  1926  52964  1804  1  \n",
       "29999  15313   2078   1800   1430  1000   1000  1000  1  \n",
       "\n",
       "[30000 rows x 25 columns]"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X1 = pd.read_excel(\"default of credit card clients.xls\")\n",
    "X1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>X1</th>\n",
       "      <th>X2</th>\n",
       "      <th>X3</th>\n",
       "      <th>X4</th>\n",
       "      <th>X5</th>\n",
       "      <th>X6</th>\n",
       "      <th>X7</th>\n",
       "      <th>X8</th>\n",
       "      <th>X9</th>\n",
       "      <th>X10</th>\n",
       "      <th>...</th>\n",
       "      <th>X15</th>\n",
       "      <th>X16</th>\n",
       "      <th>X17</th>\n",
       "      <th>X18</th>\n",
       "      <th>X19</th>\n",
       "      <th>X20</th>\n",
       "      <th>X21</th>\n",
       "      <th>X22</th>\n",
       "      <th>X23</th>\n",
       "      <th>Y</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>30000.000000</td>\n",
       "      <td>30000.000000</td>\n",
       "      <td>30000.000000</td>\n",
       "      <td>30000.000000</td>\n",
       "      <td>30000.000000</td>\n",
       "      <td>30000.000000</td>\n",
       "      <td>30000.000000</td>\n",
       "      <td>30000.000000</td>\n",
       "      <td>30000.000000</td>\n",
       "      <td>30000.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>30000.000000</td>\n",
       "      <td>30000.000000</td>\n",
       "      <td>30000.000000</td>\n",
       "      <td>30000.000000</td>\n",
       "      <td>3.000000e+04</td>\n",
       "      <td>30000.00000</td>\n",
       "      <td>30000.000000</td>\n",
       "      <td>30000.000000</td>\n",
       "      <td>30000.000000</td>\n",
       "      <td>30000.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>167484.322667</td>\n",
       "      <td>1.603733</td>\n",
       "      <td>1.853133</td>\n",
       "      <td>1.551867</td>\n",
       "      <td>35.485500</td>\n",
       "      <td>-0.016700</td>\n",
       "      <td>-0.133767</td>\n",
       "      <td>-0.166200</td>\n",
       "      <td>-0.220667</td>\n",
       "      <td>-0.266200</td>\n",
       "      <td>...</td>\n",
       "      <td>43262.948967</td>\n",
       "      <td>40311.400967</td>\n",
       "      <td>38871.760400</td>\n",
       "      <td>5663.580500</td>\n",
       "      <td>5.921163e+03</td>\n",
       "      <td>5225.68150</td>\n",
       "      <td>4826.076867</td>\n",
       "      <td>4799.387633</td>\n",
       "      <td>5215.502567</td>\n",
       "      <td>0.221200</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>129747.661567</td>\n",
       "      <td>0.489129</td>\n",
       "      <td>0.790349</td>\n",
       "      <td>0.521970</td>\n",
       "      <td>9.217904</td>\n",
       "      <td>1.123802</td>\n",
       "      <td>1.197186</td>\n",
       "      <td>1.196868</td>\n",
       "      <td>1.169139</td>\n",
       "      <td>1.133187</td>\n",
       "      <td>...</td>\n",
       "      <td>64332.856134</td>\n",
       "      <td>60797.155770</td>\n",
       "      <td>59554.107537</td>\n",
       "      <td>16563.280354</td>\n",
       "      <td>2.304087e+04</td>\n",
       "      <td>17606.96147</td>\n",
       "      <td>15666.159744</td>\n",
       "      <td>15278.305679</td>\n",
       "      <td>17777.465775</td>\n",
       "      <td>0.415062</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>10000.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>21.000000</td>\n",
       "      <td>-2.000000</td>\n",
       "      <td>-2.000000</td>\n",
       "      <td>-2.000000</td>\n",
       "      <td>-2.000000</td>\n",
       "      <td>-2.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>-170000.000000</td>\n",
       "      <td>-81334.000000</td>\n",
       "      <td>-339603.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000e+00</td>\n",
       "      <td>0.00000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>50000.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>28.000000</td>\n",
       "      <td>-1.000000</td>\n",
       "      <td>-1.000000</td>\n",
       "      <td>-1.000000</td>\n",
       "      <td>-1.000000</td>\n",
       "      <td>-1.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>2326.750000</td>\n",
       "      <td>1763.000000</td>\n",
       "      <td>1256.000000</td>\n",
       "      <td>1000.000000</td>\n",
       "      <td>8.330000e+02</td>\n",
       "      <td>390.00000</td>\n",
       "      <td>296.000000</td>\n",
       "      <td>252.500000</td>\n",
       "      <td>117.750000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>140000.000000</td>\n",
       "      <td>2.000000</td>\n",
       "      <td>2.000000</td>\n",
       "      <td>2.000000</td>\n",
       "      <td>34.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>19052.000000</td>\n",
       "      <td>18104.500000</td>\n",
       "      <td>17071.000000</td>\n",
       "      <td>2100.000000</td>\n",
       "      <td>2.009000e+03</td>\n",
       "      <td>1800.00000</td>\n",
       "      <td>1500.000000</td>\n",
       "      <td>1500.000000</td>\n",
       "      <td>1500.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>240000.000000</td>\n",
       "      <td>2.000000</td>\n",
       "      <td>2.000000</td>\n",
       "      <td>2.000000</td>\n",
       "      <td>41.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>54506.000000</td>\n",
       "      <td>50190.500000</td>\n",
       "      <td>49198.250000</td>\n",
       "      <td>5006.000000</td>\n",
       "      <td>5.000000e+03</td>\n",
       "      <td>4505.00000</td>\n",
       "      <td>4013.250000</td>\n",
       "      <td>4031.500000</td>\n",
       "      <td>4000.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>1000000.000000</td>\n",
       "      <td>2.000000</td>\n",
       "      <td>6.000000</td>\n",
       "      <td>3.000000</td>\n",
       "      <td>79.000000</td>\n",
       "      <td>8.000000</td>\n",
       "      <td>8.000000</td>\n",
       "      <td>8.000000</td>\n",
       "      <td>8.000000</td>\n",
       "      <td>8.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>891586.000000</td>\n",
       "      <td>927171.000000</td>\n",
       "      <td>961664.000000</td>\n",
       "      <td>873552.000000</td>\n",
       "      <td>1.684259e+06</td>\n",
       "      <td>896040.00000</td>\n",
       "      <td>621000.000000</td>\n",
       "      <td>426529.000000</td>\n",
       "      <td>528666.000000</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>8 rows × 24 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                   X1            X2            X3            X4            X5  \\\n",
       "count    30000.000000  30000.000000  30000.000000  30000.000000  30000.000000   \n",
       "mean    167484.322667      1.603733      1.853133      1.551867     35.485500   \n",
       "std     129747.661567      0.489129      0.790349      0.521970      9.217904   \n",
       "min      10000.000000      1.000000      0.000000      0.000000     21.000000   \n",
       "25%      50000.000000      1.000000      1.000000      1.000000     28.000000   \n",
       "50%     140000.000000      2.000000      2.000000      2.000000     34.000000   \n",
       "75%     240000.000000      2.000000      2.000000      2.000000     41.000000   \n",
       "max    1000000.000000      2.000000      6.000000      3.000000     79.000000   \n",
       "\n",
       "                 X6            X7            X8            X9           X10  \\\n",
       "count  30000.000000  30000.000000  30000.000000  30000.000000  30000.000000   \n",
       "mean      -0.016700     -0.133767     -0.166200     -0.220667     -0.266200   \n",
       "std        1.123802      1.197186      1.196868      1.169139      1.133187   \n",
       "min       -2.000000     -2.000000     -2.000000     -2.000000     -2.000000   \n",
       "25%       -1.000000     -1.000000     -1.000000     -1.000000     -1.000000   \n",
       "50%        0.000000      0.000000      0.000000      0.000000      0.000000   \n",
       "75%        0.000000      0.000000      0.000000      0.000000      0.000000   \n",
       "max        8.000000      8.000000      8.000000      8.000000      8.000000   \n",
       "\n",
       "       ...            X15            X16            X17            X18  \\\n",
       "count  ...   30000.000000   30000.000000   30000.000000   30000.000000   \n",
       "mean   ...   43262.948967   40311.400967   38871.760400    5663.580500   \n",
       "std    ...   64332.856134   60797.155770   59554.107537   16563.280354   \n",
       "min    ... -170000.000000  -81334.000000 -339603.000000       0.000000   \n",
       "25%    ...    2326.750000    1763.000000    1256.000000    1000.000000   \n",
       "50%    ...   19052.000000   18104.500000   17071.000000    2100.000000   \n",
       "75%    ...   54506.000000   50190.500000   49198.250000    5006.000000   \n",
       "max    ...  891586.000000  927171.000000  961664.000000  873552.000000   \n",
       "\n",
       "                X19           X20            X21            X22  \\\n",
       "count  3.000000e+04   30000.00000   30000.000000   30000.000000   \n",
       "mean   5.921163e+03    5225.68150    4826.076867    4799.387633   \n",
       "std    2.304087e+04   17606.96147   15666.159744   15278.305679   \n",
       "min    0.000000e+00       0.00000       0.000000       0.000000   \n",
       "25%    8.330000e+02     390.00000     296.000000     252.500000   \n",
       "50%    2.009000e+03    1800.00000    1500.000000    1500.000000   \n",
       "75%    5.000000e+03    4505.00000    4013.250000    4031.500000   \n",
       "max    1.684259e+06  896040.00000  621000.000000  426529.000000   \n",
       "\n",
       "                 X23             Y  \n",
       "count   30000.000000  30000.000000  \n",
       "mean     5215.502567      0.221200  \n",
       "std     17777.465775      0.415062  \n",
       "min         0.000000      0.000000  \n",
       "25%       117.750000      0.000000  \n",
       "50%      1500.000000      0.000000  \n",
       "75%      4000.000000      0.000000  \n",
       "max    528666.000000      1.000000  \n",
       "\n",
       "[8 rows x 24 columns]"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X1.columns\n",
    "#X1 = X1.drop(\"Unnamed: 0\", axis = 1)\n",
    "X1.describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Learning the Logistic Regression, KNN, in Python and Scikit-Learn  \n",
    "\n",
    "There are 4 sub-question:\n",
    "\n",
    "1.1 Create the table to report the proportion of case of customers default payments in Taiwan separated by gender, education, and marital status. What is/are the interesting result/s you can draw from this table?[10 Points].\\\\"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Y</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.241672</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.207763</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.192348</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.237349</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.251576</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.056911</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>0.064286</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>0.156863</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.092593</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.234717</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.209283</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.260062</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          Y\n",
       "1  0.241672\n",
       "2  0.207763\n",
       "0  0.000000\n",
       "1  0.192348\n",
       "2  0.237349\n",
       "3  0.251576\n",
       "4  0.056911\n",
       "5  0.064286\n",
       "6  0.156863\n",
       "0  0.092593\n",
       "1  0.234717\n",
       "2  0.209283\n",
       "3  0.260062"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#put your code here\n",
    "X2 = pd.DataFrame()\n",
    "\n",
    "tablevac = pd.pivot_table(X1, values = \"Y\", index = [\"X2\"], aggfunc = np.mean)\n",
    "\n",
    "\n",
    "tablesw = pd.pivot_table(X1, values = \"Y\", index = [\"X3\"], aggfunc = np.mean)\n",
    "tablesw\n",
    "\n",
    "tableslot = pd.pivot_table(X1, values = \"Y\", index = [\"X4\"], aggfunc = np.mean)\n",
    "tableslot\n",
    "\n",
    "tablefull = pd.concat([tablevac, tablesw, tableslot], axis = 0)\n",
    "#tablefull= tablefull.transpose()\n",
    "tablefull"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The interesting results that I can gather from the data being split out but by the demographic information is:\n",
    "\n",
    " - There are similar default rates among men and women (24% and 21%)\n",
    " - Individuals that were less educated (only with a high school education for example) have the highest probability of defaulting on their debt (25%) than all other educational backgrounds.\n",
    " - There were similar level of default rates depending on the relationship status of individuals, although \"other\" had the highest proportion of defaulting on their loans."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then, partition the data into training and test sets using 80 $\\%$ to be the train data. The model will be fit to the training data and evaluated on the test set. \n",
    "\n",
    "1.2 Use the Logistic regression to train the model with all features. Then, use the test data to conduct the confusion matrix. Interpret the results carefully [10 Points]."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0        1\n",
       "1        1\n",
       "2        0\n",
       "3        0\n",
       "4        0\n",
       "        ..\n",
       "29995    0\n",
       "29996    0\n",
       "29997    1\n",
       "29998    1\n",
       "29999    1\n",
       "Name: Y, Length: 30000, dtype: int64"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#y = X1[\"Y\"]\n",
    "#X1 = X1.drop(\"Y\", axis =1)\n",
    "y"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn import (\n",
    "    linear_model, metrics, pipeline, preprocessing, model_selection\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [],
   "source": [
    "#put your code here\n",
    "\n",
    "# Split dataset into training set and test set\n",
    "X_train, X_test, y_train, y_test = train_test_split(X1, y, test_size=0.2)\n",
    "# 80% train\n",
    "# Shuffle being false keeps the train and test split the same"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "LogisticRegression()"
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "linear_model.LogisticRegression()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/devonkelly/opt/anaconda3/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
      "STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n",
      "\n",
      "Increase the number of iterations (max_iter) or scale the data as shown in:\n",
      "    https://scikit-learn.org/stable/modules/preprocessing.html\n",
      "Please also refer to the documentation for alternative solver options:\n",
      "    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
      "  n_iter_i = _check_optimize_result(\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "LogisticRegression()"
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "logistic_position_model = linear_model.LogisticRegression(solver=\"lbfgs\")\n",
    "logistic_position_model.fit(X_train, y_train)\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(0.7795833333333333, 0.7746666666666666)"
      ]
     },
     "execution_count": 44,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train_acc = logistic_position_model.score(X_train, y_train)\n",
    "test_acc = logistic_position_model.score(X_test, y_test)\n",
    "\n",
    "train_acc, test_acc"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 71,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# Results of logistic regression with all features^^"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(0.7795833333333333, 0.7746666666666666)"
      ]
     },
     "execution_count": 72,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train_acc = logistic_position_model.score(X_train, y_train)\n",
    "test_acc = logistic_position_model.score(X_test, y_test)\n",
    "\n",
    "train_acc, test_acc"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 76,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "              precision    recall  f1-score   support\n",
      "\n",
      "  no default       0.78      1.00      0.87      4651\n",
      "     default       0.00      0.00      0.00      1349\n",
      "\n",
      "    accuracy                           0.77      6000\n",
      "   macro avg       0.39      0.50      0.44      6000\n",
      "weighted avg       0.60      0.77      0.68      6000\n",
      "\n"
     ]
    }
   ],
   "source": [
    "\n",
    "\n",
    "report = metrics.classification_report(\n",
    "    y_test ,logistic_model1.predict(X_test),\n",
    "        target_names=[\"no default\", \"default\"]\n",
    ")\n",
    "print(report)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1.3 Use the KNN to train the model with all features (you should convert the features to be the standarized variables first). Then, use the test data to conduct the confusion matrix. Interpret the results carefully [10 Points]."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [],
   "source": [
    "from matplotlib import pyplot as plt\n",
    "import numpy as np\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.neighbors import KNeighborsClassifier # now importing KN neighbors to do our analysis\n",
    "classifier = KNeighborsClassifier(n_neighbors=9) # Selecting 9 different points to compare against (ODD number)\n",
    "\n",
    "from sklearn.tree import DecisionTreeClassifier\n",
    "from sklearn.ensemble import RandomForestClassifier\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.model_selection  import KFold\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [],
   "source": [
    "kf = KFold(n_splits=5, shuffle= True) # KFold to split and shuffle the data, now contains 20% of the data equally"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ 20000,      2,      2, ...,      0,      0,      0],\n",
       "       [120000,      2,      2, ...,   1000,      0,   2000],\n",
       "       [ 90000,      2,      2, ...,   1000,   1000,   5000],\n",
       "       ...,\n",
       "       [ 30000,      1,      2, ...,   4200,   2000,   3100],\n",
       "       [ 80000,      1,      3, ...,   1926,  52964,   1804],\n",
       "       [ 50000,      1,      2, ...,   1000,   1000,   1000]])"
      ]
     },
     "execution_count": 61,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "features = X1\n",
    "labels = y\n",
    "features = features.to_numpy()\n",
    "features"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 77,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Mean accuracy: 80.4%\n"
     ]
    }
   ],
   "source": [
    "means = [] # mean = 0, starting accuracy rate because we haven't applied anything yet\n",
    "# Get cumulative mean from the model, then have to average the results again because have multiple means\n",
    "\n",
    "for training,testing in kf.split(features): # splitting data into 2 parts, training and testing\n",
    "   # We learn a model for this fold with `fit` and then apply it to the\n",
    "   # testing data with `predict`:\n",
    "   classifier.fit(features[training], labels[training])\n",
    "   prediction = classifier.predict(features[testing])\n",
    "   # np.mean on an array of booleans returns fraction\n",
    " # of correct decisions for this fold:\n",
    "   curmean = np.mean(prediction == labels[testing])\n",
    "   means.append(curmean)\n",
    "print('Mean accuracy: {:.1%}'.format(np.mean(means)))\n",
    "\n",
    "# This is a very good model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Mean accuracy: 80.5%\n"
     ]
    }
   ],
   "source": [
    "from sklearn.pipeline import Pipeline\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "\n",
    "classifier = KNeighborsClassifier(n_neighbors = 9) # Setting as 9 since it is the same as above\n",
    "classifier = Pipeline([('norm', StandardScaler()), ('knn', classifier)])\n",
    "# Standard scalar gets the z score of selected data. To just use the mean in calculation set \"with_std = False\"\n",
    "\n",
    "means = []\n",
    "for training,testing in kf.split(features):\n",
    "    # We learn a model for this fold with `fit` and then apply it to the\n",
    "    # testing data with `predict`:\n",
    "    classifier.fit(features[training], labels[training])\n",
    "    prediction = classifier.predict(features[testing])\n",
    "\n",
    "    # Concept is almost the same as the kf fold above\n",
    "    # np.mean on an array of booleans returns fraction\n",
    "    # of correct decisions for this fold:\n",
    "    curmean = np.mean(prediction == labels[testing])\n",
    "    means.append(curmean)\n",
    "print('Mean accuracy: {:.1%}'.format(np.mean(means)))\n",
    "\n",
    "# Mean accuracy has actually improved. Using the z-score is better, as we have learned\n",
    "# Ajarn's did not improve, but that is not a big problem"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "              precision    recall  f1-score   support\n",
      "\n",
      "           0       0.83      0.94      0.88      4662\n",
      "           1       0.61      0.32      0.42      1338\n",
      "\n",
      "    accuracy                           0.80      6000\n",
      "   macro avg       0.72      0.63      0.65      6000\n",
      "weighted avg       0.78      0.80      0.78      6000\n",
      "\n"
     ]
    }
   ],
   "source": [
    "report = metrics.classification_report(\n",
    "    labels[testing] ,classifier.predict(features[testing]))\n",
    "print(report)\n",
    "\n",
    "# Classification matrix that we have used before"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The results show that the model has an accuracy rate of 78%, and that the model is much better at predicting an outcome of 0 - no default - (with both a better precision and recall rate) than an outcome of 1 - default\n",
    "\n",
    "The results also show that the standardized data, using StandardScalar, gives better results than when we used non-standardized data (80.5% vs. 80.4% mean accuracy)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    " 1.4 Compare the predictive performance of two models. Which one you select? And Why? [10 Points]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#put your code here"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}