{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Test3: (40 marks)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Your ID: 6204641218"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this excercise, we aimed at the case of customers default payments in Taiwan and compares the predictive accuracy of probability of default among data mining methods. From the perspective of risk management, the result of predictive accuracy of the estimated probability of default will be more valuable than the binary result of classification - credible or not credible clients. With the real probability of default as the response variable (Y), and the predictive probability of default as the independent variable (X), this research employed a binary variable, default payment (Yes = 1, No = 0), as the response variable. This study reviewed the literature and used the following 23 variables as explanatory variables: \\\\\n",
    "\n",
    "X1: Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit. \\\\\n",
    "\n",
    "X2: Gender (1 = male; 2 = female). \\\\\n",
    "\n",
    "X3: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others). \\\\\n",
    "\n",
    "X4: Marital status (1 = married; 2 = single; 3 = others). \\\\\n",
    "\n",
    "X5: Age (year). \\\\\n",
    "\n",
    "X6 - X11: History of past payment. We tracked the past monthly payment records (from April to September, 2005) as follows: X6 = the repayment status in September, 2005; X7 = the repayment status in August, 2005; . . .;X11 = the repayment status in April, 2005. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above. \\\\\n",
    "\n",
    "X12-X17: Amount of bill statement (NT dollar). X12 = amount of bill statement in September, 2005; X13 = amount of bill statement in August, 2005; . . .; X17 = amount of bill statement in April, 2005. \\\\\n",
    "\n",
    "X18-X23: Amount of previous payment (NT dollar). X18 = amount paid in September, 2005; X19 = amount paid in August, 2005; . . .;X23 = amount paid in April, 2005. \\\\\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Learning the Logistic Regression, KNN, in Python and Scikit-Learn  \n",
    "\n",
    "There are 4 sub-question:\n",
    "\n",
    "1.1 Create the table to report the proportion of case of customers default payments in Taiwan separated by gender, education, and marital status. What is/are the interesting result/s you can draw from this table?[10 Points].\\\\"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Unnamed: 0</th>\n",
       "      <th>X1</th>\n",
       "      <th>X2</th>\n",
       "      <th>X3</th>\n",
       "      <th>X4</th>\n",
       "      <th>X5</th>\n",
       "      <th>X6</th>\n",
       "      <th>X7</th>\n",
       "      <th>X8</th>\n",
       "      <th>X9</th>\n",
       "      <th>...</th>\n",
       "      <th>X15</th>\n",
       "      <th>X16</th>\n",
       "      <th>X17</th>\n",
       "      <th>X18</th>\n",
       "      <th>X19</th>\n",
       "      <th>X20</th>\n",
       "      <th>X21</th>\n",
       "      <th>X22</th>\n",
       "      <th>X23</th>\n",
       "      <th>Y</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>20000</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>24</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>689</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>120000</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>26</td>\n",
       "      <td>-1</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>3272</td>\n",
       "      <td>3455</td>\n",
       "      <td>3261</td>\n",
       "      <td>0</td>\n",
       "      <td>1000</td>\n",
       "      <td>1000</td>\n",
       "      <td>1000</td>\n",
       "      <td>0</td>\n",
       "      <td>2000</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>90000</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>34</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>14331</td>\n",
       "      <td>14948</td>\n",
       "      <td>15549</td>\n",
       "      <td>1518</td>\n",
       "      <td>1500</td>\n",
       "      <td>1000</td>\n",
       "      <td>1000</td>\n",
       "      <td>1000</td>\n",
       "      <td>5000</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>50000</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>37</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>28314</td>\n",
       "      <td>28959</td>\n",
       "      <td>29547</td>\n",
       "      <td>2000</td>\n",
       "      <td>2019</td>\n",
       "      <td>1200</td>\n",
       "      <td>1100</td>\n",
       "      <td>1069</td>\n",
       "      <td>1000</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>50000</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>57</td>\n",
       "      <td>-1</td>\n",
       "      <td>0</td>\n",
       "      <td>-1</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>20940</td>\n",
       "      <td>19146</td>\n",
       "      <td>19131</td>\n",
       "      <td>2000</td>\n",
       "      <td>36681</td>\n",
       "      <td>10000</td>\n",
       "      <td>9000</td>\n",
       "      <td>689</td>\n",
       "      <td>679</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29995</th>\n",
       "      <td>29996</td>\n",
       "      <td>220000</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>39</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>88004</td>\n",
       "      <td>31237</td>\n",
       "      <td>15980</td>\n",
       "      <td>8500</td>\n",
       "      <td>20000</td>\n",
       "      <td>5003</td>\n",
       "      <td>3047</td>\n",
       "      <td>5000</td>\n",
       "      <td>1000</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29996</th>\n",
       "      <td>29997</td>\n",
       "      <td>150000</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "      <td>43</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>...</td>\n",
       "      <td>8979</td>\n",
       "      <td>5190</td>\n",
       "      <td>0</td>\n",
       "      <td>1837</td>\n",
       "      <td>3526</td>\n",
       "      <td>8998</td>\n",
       "      <td>129</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29997</th>\n",
       "      <td>29998</td>\n",
       "      <td>30000</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>37</td>\n",
       "      <td>4</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "      <td>-1</td>\n",
       "      <td>...</td>\n",
       "      <td>20878</td>\n",
       "      <td>20582</td>\n",
       "      <td>19357</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>22000</td>\n",
       "      <td>4200</td>\n",
       "      <td>2000</td>\n",
       "      <td>3100</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29998</th>\n",
       "      <td>29999</td>\n",
       "      <td>80000</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>41</td>\n",
       "      <td>1</td>\n",
       "      <td>-1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>52774</td>\n",
       "      <td>11855</td>\n",
       "      <td>48944</td>\n",
       "      <td>85900</td>\n",
       "      <td>3409</td>\n",
       "      <td>1178</td>\n",
       "      <td>1926</td>\n",
       "      <td>52964</td>\n",
       "      <td>1804</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29999</th>\n",
       "      <td>30000</td>\n",
       "      <td>50000</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>46</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>36535</td>\n",
       "      <td>32428</td>\n",
       "      <td>15313</td>\n",
       "      <td>2078</td>\n",
       "      <td>1800</td>\n",
       "      <td>1430</td>\n",
       "      <td>1000</td>\n",
       "      <td>1000</td>\n",
       "      <td>1000</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>30000 rows × 25 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       Unnamed: 0      X1  X2  X3  X4  X5  X6  X7  X8  X9  ...    X15    X16  \\\n",
       "0               1   20000   2   2   1  24   2   2  -1  -1  ...      0      0   \n",
       "1               2  120000   2   2   2  26  -1   2   0   0  ...   3272   3455   \n",
       "2               3   90000   2   2   2  34   0   0   0   0  ...  14331  14948   \n",
       "3               4   50000   2   2   1  37   0   0   0   0  ...  28314  28959   \n",
       "4               5   50000   1   2   1  57  -1   0  -1   0  ...  20940  19146   \n",
       "...           ...     ...  ..  ..  ..  ..  ..  ..  ..  ..  ...    ...    ...   \n",
       "29995       29996  220000   1   3   1  39   0   0   0   0  ...  88004  31237   \n",
       "29996       29997  150000   1   3   2  43  -1  -1  -1  -1  ...   8979   5190   \n",
       "29997       29998   30000   1   2   2  37   4   3   2  -1  ...  20878  20582   \n",
       "29998       29999   80000   1   3   1  41   1  -1   0   0  ...  52774  11855   \n",
       "29999       30000   50000   1   2   1  46   0   0   0   0  ...  36535  32428   \n",
       "\n",
       "         X17    X18    X19    X20   X21    X22   X23  Y  \n",
       "0          0      0    689      0     0      0     0  1  \n",
       "1       3261      0   1000   1000  1000      0  2000  1  \n",
       "2      15549   1518   1500   1000  1000   1000  5000  0  \n",
       "3      29547   2000   2019   1200  1100   1069  1000  0  \n",
       "4      19131   2000  36681  10000  9000    689   679  0  \n",
       "...      ...    ...    ...    ...   ...    ...   ... ..  \n",
       "29995  15980   8500  20000   5003  3047   5000  1000  0  \n",
       "29996      0   1837   3526   8998   129      0     0  0  \n",
       "29997  19357      0      0  22000  4200   2000  3100  1  \n",
       "29998  48944  85900   3409   1178  1926  52964  1804  1  \n",
       "29999  15313   2078   1800   1430  1000   1000  1000  1  \n",
       "\n",
       "[30000 rows x 25 columns]"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "df = pd.read_excel('defaults.xlsx')\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Y</th>\n",
       "      <th>X2</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Unnamed: 0</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29996</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29997</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29998</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29999</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>30000</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>30000 rows × 2 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "            Y  X2\n",
       "Unnamed: 0       \n",
       "1           1   2\n",
       "2           1   2\n",
       "3           0   2\n",
       "4           0   2\n",
       "5           0   1\n",
       "...        ..  ..\n",
       "29996       0   1\n",
       "29997       0   1\n",
       "29998       1   1\n",
       "29999       1   1\n",
       "30000       1   1\n",
       "\n",
       "[30000 rows x 2 columns]"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_group = df[['Y','X2']] #age\n",
    "df_group"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "X2\n",
      "2     18112\n",
      "1     11888\n",
      "dtype: int64\n",
      "X4\n",
      "2     15964\n",
      "1     13659\n",
      "3       323\n",
      "0        54\n",
      "dtype: int64\n",
      "X3\n",
      "2     14030\n",
      "1     10585\n",
      "3      4917\n",
      "5       280\n",
      "4       123\n",
      "6        51\n",
      "0        14\n",
      "dtype: int64\n"
     ]
    }
   ],
   "source": [
    "for i in {'X2','X3','X4'}:\n",
    "    print(df[{i}].value_counts())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Unnamed: 0\n",
       "1        1\n",
       "2        1\n",
       "3        2\n",
       "4        2\n",
       "5        1\n",
       "        ..\n",
       "29996    1\n",
       "29997    1\n",
       "29998    1\n",
       "29999    1\n",
       "30000    1\n",
       "Name: Y, Length: 30000, dtype: int64"
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import numpy as np\n",
    "df['Y']=np.where(df['Y'] == 0, df['X2'] , df['Y'])\n",
    "df['Y']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [
    {
     "ename": "NameError",
     "evalue": "name 'SHOW_FIGURE' is not defined",
     "output_type": "error",
     "traceback": [
      "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[1;31mNameError\u001b[0m                                 Traceback (most recent call last)",
      "\u001b[1;32m<ipython-input-48-9de0cbb0cbfb>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m\u001b[0m\n\u001b[0;32m      3\u001b[0m \u001b[1;32mfrom\u001b[0m \u001b[0mmatplotlib\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mfont_manager\u001b[0m \u001b[1;32mimport\u001b[0m \u001b[0mFontProperties\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m      4\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 5\u001b[1;33m \u001b[1;32mif\u001b[0m \u001b[0mSHOW_FIGURE\u001b[0m \u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m      6\u001b[0m     \u001b[1;31m# 1=graduate school, 2=university, 3=high school 4=others\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m      7\u001b[0m     \u001b[0mdata\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;34m'X2'\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mvalue_counts\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mplot\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mkind\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;34m'bar'\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mfigsize\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;36m10\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;36m6\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
      "\u001b[1;31mNameError\u001b[0m: name 'SHOW_FIGURE' is not defined"
     ]
    }
   ],
   "source": [
    "import matplotlib\n",
    "import matplotlib.pyplot as plt\n",
    "from matplotlib.font_manager import FontProperties\n",
    "\n",
    "if SHOW_FIGURE :\n",
    "    # 1=graduate school, 2=university, 3=high school 4=others\n",
    "    data['X2'].value_counts().plot(kind='bar', figsize=(10,6))\n",
    "    # plt.title(\"Number of cars by make\")\n",
    "    # plt.xlabel('Education level', fontproperties=font)\n",
    "    plt.ylabel('Y', fontproperties=font)\n",
    "    # plt.show()\n",
    "    plt.savefig('gender_default.png')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "gender = 194/(194+444)*100\n",
    "edu = 170/(170+468)*100\n",
    "marstat = 514/(514+124)*100\n",
    "\n",
    "print(f\"The percentage of Southwest Airlines that serves the route is {round(SW,2)} %.\")\n",
    "print(f\"The percentage of being a vacation route is {round(VACATION,2)} %.\")\n",
    "print(f\"The percentage of having gate constraints is {round(GATE,2)} %.\")\n",
    "print(f\"The percentage of having a slot-controlled at the endpoint airport is {round(SLOT,2)} %.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then, partition the data into training and test sets using 80 $\\%$ to be the train data. The model will be fit to the training data and evaluated on the test set. \n",
    "\n",
    "1.2 Use the Logistic regression to train the model with all features. Then, use the test data to conduct the confusion matrix. Interpret the results carefully [10 Points]."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "y =df['Y']\n",
    "x =df['X2', 'X3', 'X4']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1.3 Use the KNN to train the model with all features (you should convert the features to be the standarized variables first). Then, use the test data to conduct the confusion matrix. Interpret the results carefully [10 Points]."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import StandardScaler\n",
    "scaler = StandardScaler()\n",
    "scaled_data = scaler.fit_transform(X_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.pipeline import Pipeline\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "\n",
    "classifier = KNeighborsClassifier(n_neighbors=9)\n",
    "classifier = Pipeline([('norm', StandardScaler()), ('knn', classifier)])   # once get the data then convert into z-score first ('norm', standardscaler), then KNN\n",
    "\n",
    "means = []\n",
    "for training,testing in kf.split(features):\n",
    "    # We learn a model for this fold with `fit` and then apply it to the\n",
    "    # testing data with `predict`:\n",
    "    classifier.fit(features[training], labels[training])\n",
    "    prediction = classifier.predict(features[testing])\n",
    "\n",
    "    # np.mean on an array of booleans returns fraction\n",
    "    # of correct decisions for this fold:\n",
    "    curmean = np.mean(prediction == labels[testing])\n",
    "    means.append(curmean)\n",
    "print('Mean accuracy: {:.1%}'.format(np.mean(means)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    " 1.4 Compare the predictive performance of two models. Which one you select? And Why? [10 Points]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#put your code here"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}