{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Test3: (40 marks)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Your ID: 6304640391"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this excercise, we aimed at the case of customers default payments in Taiwan and compares the predictive accuracy of probability of default among data mining methods. From the perspective of risk management, the result of predictive accuracy of the estimated probability of default will be more valuable than the binary result of classification - credible or not credible clients. With the real probability of default as the response variable (Y), and the predictive probability of default as the independent variable (X), this research employed a binary variable, default payment (Yes = 1, No = 0), as the response variable. This study reviewed the literature and used the following 23 variables as explanatory variables: \\\\\n",
    "\n",
    "X1: Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit. \\\\\n",
    "\n",
    "X2: Gender (1 = male; 2 = female). \\\\\n",
    "\n",
    "X3: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others). \\\\\n",
    "\n",
    "X4: Marital status (1 = married; 2 = single; 3 = others). \\\\\n",
    "\n",
    "X5: Age (year). \\\\\n",
    "\n",
    "X6 - X11: History of past payment. We tracked the past monthly payment records (from April to September, 2005) as follows: X6 = the repayment status in September, 2005; X7 = the repayment status in August, 2005; . . .;X11 = the repayment status in April, 2005. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above. \\\\\n",
    "\n",
    "X12-X17: Amount of bill statement (NT dollar). X12 = amount of bill statement in September, 2005; X13 = amount of bill statement in August, 2005; . . .; X17 = amount of bill statement in April, 2005. \\\\\n",
    "\n",
    "X18-X23: Amount of previous payment (NT dollar). X18 = amount paid in September, 2005; X19 = amount paid in August, 2005; . . .;X23 = amount paid in April, 2005. \\\\\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Learning the Logistic Regression, KNN, in Python and Scikit-Learn  \n",
    "\n",
    "There are 4 sub-question:\n",
    "\n",
    "1.1 Create the table to report the proportion of case of customers default payments in Taiwan separated by gender, education, and marital status. What is/are the interesting result/s you can draw from this table?[10 Points].\\\\"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Unnamed: 0</th>\n",
       "      <th>X1</th>\n",
       "      <th>X2</th>\n",
       "      <th>X3</th>\n",
       "      <th>X4</th>\n",
       "      <th>X5</th>\n",
       "      <th>X6</th>\n",
       "      <th>X7</th>\n",
       "      <th>X8</th>\n",
       "      <th>X9</th>\n",
       "      <th>...</th>\n",
       "      <th>X15</th>\n",
       "      <th>X16</th>\n",
       "      <th>X17</th>\n",
       "      <th>X18</th>\n",
       "      <th>X19</th>\n",
       "      <th>X20</th>\n",
       "      <th>X21</th>\n",
       "      <th>X22</th>\n",
       "      <th>X23</th>\n",
       "      <th>Y</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>20000</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>24</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>689</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>120000</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>26</td>\n",
       "      <td>-1</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>3272</td>\n",
       "      <td>3455</td>\n",
       "      <td>3261</td>\n",
       "      <td>0</td>\n",
       "      <td>1000</td>\n",
       "      <td>1000</td>\n",
       "      <td>1000</td>\n",
       "      <td>0</td>\n",
       "      <td>2000</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>90000</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>34</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>14331</td>\n",
       "      <td>14948</td>\n",
       "      <td>15549</td>\n",
       "      <td>1518</td>\n",
       "      <td>1500</td>\n",
       "      <td>1000</td>\n",
       "      <td>1000</td>\n",
       "      <td>1000</td>\n",
       "      <td>5000</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>50000</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>37</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>28314</td>\n",
       "      <td>28959</td>\n",
       "      <td>29547</td>\n",
       "      <td>2000</td>\n",
       "      <td>2019</td>\n",
       "      <td>1200</td>\n",
       "      <td>1100</td>\n",
       "      <td>1069</td>\n",
       "      <td>1000</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>50000</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>57</td>\n",
       "      <td>-1</td>\n",
       "      <td>0</td>\n",
       "      <td>-1</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>20940</td>\n",
       "      <td>19146</td>\n",
       "      <td>19131</td>\n",
       "      <td>2000</td>\n",
       "      <td>36681</td>\n",
       "      <td>10000</td>\n",
       "      <td>9000</td>\n",
       "      <td>689</td>\n",
       "      <td>679</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29995</th>\n",
       "      <td>29996</td>\n",
       "      <td>220000</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>39</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>88004</td>\n",
       "      <td>31237</td>\n",
       "      <td>15980</td>\n",
       "      <td>8500</td>\n",
       "      <td>20000</td>\n",
       "      <td>5003</td>\n",
       "      <td>3047</td>\n",
       "      <td>5000</td>\n",
       "      <td>1000</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29996</th>\n",
       "      <td>29997</td>\n",
       "      <td>150000</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "      <td>43</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>...</td>\n",
       "      <td>8979</td>\n",
       "      <td>5190</td>\n",
       "      <td>0</td>\n",
       "      <td>1837</td>\n",
       "      <td>3526</td>\n",
       "      <td>8998</td>\n",
       "      <td>129</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29997</th>\n",
       "      <td>29998</td>\n",
       "      <td>30000</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>37</td>\n",
       "      <td>4</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "      <td>-1</td>\n",
       "      <td>...</td>\n",
       "      <td>20878</td>\n",
       "      <td>20582</td>\n",
       "      <td>19357</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>22000</td>\n",
       "      <td>4200</td>\n",
       "      <td>2000</td>\n",
       "      <td>3100</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29998</th>\n",
       "      <td>29999</td>\n",
       "      <td>80000</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>41</td>\n",
       "      <td>1</td>\n",
       "      <td>-1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>52774</td>\n",
       "      <td>11855</td>\n",
       "      <td>48944</td>\n",
       "      <td>85900</td>\n",
       "      <td>3409</td>\n",
       "      <td>1178</td>\n",
       "      <td>1926</td>\n",
       "      <td>52964</td>\n",
       "      <td>1804</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29999</th>\n",
       "      <td>30000</td>\n",
       "      <td>50000</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>46</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>36535</td>\n",
       "      <td>32428</td>\n",
       "      <td>15313</td>\n",
       "      <td>2078</td>\n",
       "      <td>1800</td>\n",
       "      <td>1430</td>\n",
       "      <td>1000</td>\n",
       "      <td>1000</td>\n",
       "      <td>1000</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>30000 rows × 25 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       Unnamed: 0      X1  X2  X3  X4  X5  X6  X7  X8  X9  ...    X15    X16  \\\n",
       "0               1   20000   2   2   1  24   2   2  -1  -1  ...      0      0   \n",
       "1               2  120000   2   2   2  26  -1   2   0   0  ...   3272   3455   \n",
       "2               3   90000   2   2   2  34   0   0   0   0  ...  14331  14948   \n",
       "3               4   50000   2   2   1  37   0   0   0   0  ...  28314  28959   \n",
       "4               5   50000   1   2   1  57  -1   0  -1   0  ...  20940  19146   \n",
       "...           ...     ...  ..  ..  ..  ..  ..  ..  ..  ..  ...    ...    ...   \n",
       "29995       29996  220000   1   3   1  39   0   0   0   0  ...  88004  31237   \n",
       "29996       29997  150000   1   3   2  43  -1  -1  -1  -1  ...   8979   5190   \n",
       "29997       29998   30000   1   2   2  37   4   3   2  -1  ...  20878  20582   \n",
       "29998       29999   80000   1   3   1  41   1  -1   0   0  ...  52774  11855   \n",
       "29999       30000   50000   1   2   1  46   0   0   0   0  ...  36535  32428   \n",
       "\n",
       "         X17    X18    X19    X20   X21    X22   X23  Y  \n",
       "0          0      0    689      0     0      0     0  1  \n",
       "1       3261      0   1000   1000  1000      0  2000  1  \n",
       "2      15549   1518   1500   1000  1000   1000  5000  0  \n",
       "3      29547   2000   2019   1200  1100   1069  1000  0  \n",
       "4      19131   2000  36681  10000  9000    689   679  0  \n",
       "...      ...    ...    ...    ...   ...    ...   ... ..  \n",
       "29995  15980   8500  20000   5003  3047   5000  1000  0  \n",
       "29996      0   1837   3526   8998   129      0     0  0  \n",
       "29997  19357      0      0  22000  4200   2000  3100  1  \n",
       "29998  48944  85900   3409   1178  1926  52964  1804  1  \n",
       "29999  15313   2078   1800   1430  1000   1000  1000  1  \n",
       "\n",
       "[30000 rows x 25 columns]"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import datetime\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import seaborn as sns\n",
    "#import pandas_datareader.data as web\n",
    "\n",
    "from sklearn import (\n",
    "    linear_model, metrics, pipeline, preprocessing, model_selection\n",
    ")\n",
    "\n",
    "import matplotlib.pyplot as plt\n",
    "%matplotlib inline\n",
    "\n",
    "df = pd.read_csv(\"default_card.csv\")\n",
    "df\n",
    "# activate plot theme\n",
    "#import qeds\n",
    "#qeds.themes.mpl_style();"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>X2</th>\n",
       "      <th>X3</th>\n",
       "      <th>X4</th>\n",
       "      <th>Y</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29995</th>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29996</th>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29997</th>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29998</th>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29999</th>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>30000 rows × 4 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       X2  X3  X4  Y\n",
       "0       2   2   1  1\n",
       "1       2   2   2  1\n",
       "2       2   2   2  0\n",
       "3       2   2   1  0\n",
       "4       1   2   1  0\n",
       "...    ..  ..  .. ..\n",
       "29995   1   3   1  0\n",
       "29996   1   3   2  0\n",
       "29997   1   2   2  1\n",
       "29998   1   3   1  1\n",
       "29999   1   2   1  1\n",
       "\n",
       "[30000 rows x 4 columns]"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df1 =df[['X2', 'X3', 'X4', 'Y']]\n",
    "df1 #put your code here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>X2</th>\n",
       "      <th>X3</th>\n",
       "      <th>X4</th>\n",
       "      <th>Y</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>X2</th>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.014232</td>\n",
       "      <td>-0.031389</td>\n",
       "      <td>-0.039961</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>X3</th>\n",
       "      <td>0.014232</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>-0.143464</td>\n",
       "      <td>0.028006</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>X4</th>\n",
       "      <td>-0.031389</td>\n",
       "      <td>-0.143464</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>-0.024339</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Y</th>\n",
       "      <td>-0.039961</td>\n",
       "      <td>0.028006</td>\n",
       "      <td>-0.024339</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          X2        X3        X4         Y\n",
       "X2  1.000000  0.014232 -0.031389 -0.039961\n",
       "X3  0.014232  1.000000 -0.143464  0.028006\n",
       "X4 -0.031389 -0.143464  1.000000 -0.024339\n",
       "Y  -0.039961  0.028006 -0.024339  1.000000"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "correlation_data = df1.corr()  #-1<rho<1\n",
    "correlation_data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### we can see that the 3 independent variables have very little correlation to the independent Y variable "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then, partition the data into training and test sets using 80 $\\%$ to be the train data. The model will be fit to the training data and evaluated on the test set. \n",
    "\n",
    "1.2 Use the Logistic regression to train the model with all features. Then, use the test data to conduct the confusion matrix. Interpret the results carefully [10 Points]."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "X = df[[\"X2\", \"X3\", \"X4\"]]\n",
    "y = df[\"Y\"]\n",
    "\n",
    "X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.2, random_state=100) \n",
    "#split into train and test"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "LogisticRegression()"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "logistic_model = linear_model.LogisticRegression()\n",
    "logistic_model.fit(X_train, y_train)\n",
    "#train data with logistic model\n",
    "#put your code here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "              precision    recall  f1-score   support\n",
      "\n",
      "           0       0.77      1.00      0.87      4625\n",
      "           1       0.00      0.00      0.00      1375\n",
      "\n",
      "    accuracy                           0.77      6000\n",
      "   macro avg       0.39      0.50      0.44      6000\n",
      "weighted avg       0.59      0.77      0.67      6000\n",
      "\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "C:\\Users\\KC\\anaconda3\\lib\\site-packages\\sklearn\\metrics\\_classification.py:1318: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.\n",
      "  _warn_prf(average, modifier, msg_start, len(result))\n",
      "C:\\Users\\KC\\anaconda3\\lib\\site-packages\\sklearn\\metrics\\_classification.py:1318: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.\n",
      "  _warn_prf(average, modifier, msg_start, len(result))\n",
      "C:\\Users\\KC\\anaconda3\\lib\\site-packages\\sklearn\\metrics\\_classification.py:1318: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.\n",
      "  _warn_prf(average, modifier, msg_start, len(result))\n"
     ]
    }
   ],
   "source": [
    "from sklearn.metrics import classification_report,confusion_matrix\n",
    "print(classification_report(y_test, logistic_model.predict(X_test)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(0.7807916666666667, 0.7708333333333334)"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train_acc = logistic_model.score(X_train, y_train)\n",
    "test_acc = logistic_model.score(X_test, y_test)\n",
    "\n",
    "train_acc, test_acc"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### The model has a precision rate of 0.77 which is not bad. It means that the model can predict the credit card owner that did not defualt by 77%. The recall rate of 1 meaning that the model predicts relevant data from the sample very well. For the case of defaulting loan, as the data of gender, education, and status have a very low correlation to the defaulting cases. We can expect that the data will generally do badly when it comes to predicting defaulting credit cards since the X's doesn't have a clear effect on Y"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1.3 Use the KNN to train the model with all features (you should convert the features to be the standarized variables first). Then, use the test data to conduct the confusion matrix. Interpret the results carefully [10 Points]."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[ 0.81016074  0.18582826 -1.05729503  1.87637834]\n",
      " [ 0.81016074  0.18582826  0.85855728  1.87637834]\n",
      " [ 0.81016074  0.18582826  0.85855728 -0.53294156]\n",
      " ...\n",
      " [-1.23432296  0.18582826  0.85855728  1.87637834]\n",
      " [-1.23432296  1.45111372 -1.05729503  1.87637834]\n",
      " [-1.23432296  0.18582826 -1.05729503  1.87637834]]\n"
     ]
    }
   ],
   "source": [
    "from sklearn.preprocessing import StandardScaler\n",
    "\n",
    "scaler = StandardScaler()\n",
    "\n",
    "standardized_data = scaler.fit_transform(df1)\n",
    "\n",
    "print(standardized_data)\n",
    "\n",
    "\n",
    "#put your code here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "             X2        X3        X4         Y\n",
      "0      0.810161  0.185828 -1.057295  1.876378\n",
      "1      0.810161  0.185828  0.858557  1.876378\n",
      "2      0.810161  0.185828  0.858557 -0.532942\n",
      "3      0.810161  0.185828 -1.057295 -0.532942\n",
      "4     -1.234323  0.185828 -1.057295 -0.532942\n",
      "...         ...       ...       ...       ...\n",
      "29995 -1.234323  1.451114 -1.057295 -0.532942\n",
      "29996 -1.234323  1.451114  0.858557 -0.532942\n",
      "29997 -1.234323  0.185828  0.858557  1.876378\n",
      "29998 -1.234323  1.451114 -1.057295  1.876378\n",
      "29999 -1.234323  0.185828 -1.057295  1.876378\n",
      "\n",
      "[30000 rows x 4 columns]\n"
     ]
    }
   ],
   "source": [
    "standardized_df = pd.DataFrame(standardized_data, columns=df1.columns)\n",
    "\n",
    "print(standardized_df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.neighbors import KNeighborsClassifier\n",
    "classifier = KNeighborsClassifier(n_neighbors=9)\n",
    "from sklearn.model_selection  import KFold\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "kf=KFold(n_splits=5, shuffle= True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>X2</th>\n",
       "      <th>X3</th>\n",
       "      <th>X4</th>\n",
       "      <th>Y</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.810161</td>\n",
       "      <td>0.185828</td>\n",
       "      <td>-1.057295</td>\n",
       "      <td>1.876378</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.810161</td>\n",
       "      <td>0.185828</td>\n",
       "      <td>0.858557</td>\n",
       "      <td>1.876378</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.810161</td>\n",
       "      <td>0.185828</td>\n",
       "      <td>0.858557</td>\n",
       "      <td>-0.532942</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.810161</td>\n",
       "      <td>0.185828</td>\n",
       "      <td>-1.057295</td>\n",
       "      <td>-0.532942</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>-1.234323</td>\n",
       "      <td>0.185828</td>\n",
       "      <td>-1.057295</td>\n",
       "      <td>-0.532942</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29995</th>\n",
       "      <td>-1.234323</td>\n",
       "      <td>1.451114</td>\n",
       "      <td>-1.057295</td>\n",
       "      <td>-0.532942</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29996</th>\n",
       "      <td>-1.234323</td>\n",
       "      <td>1.451114</td>\n",
       "      <td>0.858557</td>\n",
       "      <td>-0.532942</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29997</th>\n",
       "      <td>-1.234323</td>\n",
       "      <td>0.185828</td>\n",
       "      <td>0.858557</td>\n",
       "      <td>1.876378</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29998</th>\n",
       "      <td>-1.234323</td>\n",
       "      <td>1.451114</td>\n",
       "      <td>-1.057295</td>\n",
       "      <td>1.876378</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29999</th>\n",
       "      <td>-1.234323</td>\n",
       "      <td>0.185828</td>\n",
       "      <td>-1.057295</td>\n",
       "      <td>1.876378</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>30000 rows × 4 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "             X2        X3        X4         Y\n",
       "0      0.810161  0.185828 -1.057295  1.876378\n",
       "1      0.810161  0.185828  0.858557  1.876378\n",
       "2      0.810161  0.185828  0.858557 -0.532942\n",
       "3      0.810161  0.185828 -1.057295 -0.532942\n",
       "4     -1.234323  0.185828 -1.057295 -0.532942\n",
       "...         ...       ...       ...       ...\n",
       "29995 -1.234323  1.451114 -1.057295 -0.532942\n",
       "29996 -1.234323  1.451114  0.858557 -0.532942\n",
       "29997 -1.234323  0.185828  0.858557  1.876378\n",
       "29998 -1.234323  1.451114 -1.057295  1.876378\n",
       "29999 -1.234323  0.185828 -1.057295  1.876378\n",
       "\n",
       "[30000 rows x 4 columns]"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "standardized_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "X2 = standardized_df[[\"X2\", \"X3\", \"X4\"]]\n",
    "y2 = df[\"Y\"]\n",
    "\n",
    "X2_train, X2_test, y2_train, y2_test = model_selection.train_test_split(X2, y2, test_size=0.2, random_state=100) \n",
    "#split into train and test"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "KNeighborsClassifier(n_neighbors=7)"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "knn = KNeighborsClassifier(n_neighbors=7)\n",
    "  \n",
    "knn.fit(X2_train, y2_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "              precision    recall  f1-score   support\n",
      "\n",
      "           0       0.77      1.00      0.87      4625\n",
      "           1       0.00      0.00      0.00      1375\n",
      "\n",
      "    accuracy                           0.77      6000\n",
      "   macro avg       0.39      0.50      0.44      6000\n",
      "weighted avg       0.59      0.77      0.67      6000\n",
      "\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "C:\\Users\\KC\\anaconda3\\lib\\site-packages\\sklearn\\metrics\\_classification.py:1318: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.\n",
      "  _warn_prf(average, modifier, msg_start, len(result))\n",
      "C:\\Users\\KC\\anaconda3\\lib\\site-packages\\sklearn\\metrics\\_classification.py:1318: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.\n",
      "  _warn_prf(average, modifier, msg_start, len(result))\n",
      "C:\\Users\\KC\\anaconda3\\lib\\site-packages\\sklearn\\metrics\\_classification.py:1318: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.\n",
      "  _warn_prf(average, modifier, msg_start, len(result))\n"
     ]
    }
   ],
   "source": [
    "print(classification_report(y2_test, logistic_model.predict(X2_test)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    " 1.4 Compare the predictive performance of two models. Which one you select? And Why? [10 Points]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### even with the standardized data the knn models and the logistic model seems to have the same results so I would say choosing one over the another would not make a difference in this case in particular when there is almost not correlation between X and Y\n",
    "\n",
    "\n",
    "#put your code here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}