kaggle - 타이타닉 생존자 분석¶

https://www.kaggle.com/
캐글에서 타이타닉 데이터 다운로드
Competitions : 상금순(Prize), 마감순(Eariest deadline), 참가자수(Number of teams) 등 볼 수 있음
Number of teams > Titanic: Machine Learning from Disaster > Tutorials
Notebooks를 통해 다른 사람의 노트북 Copy and Edit 가능

Titanic Top 4% with ensemble modeling¶

https://www.kaggle.com/yassineghouzam/titanic-top-4-with-ensemble-modeling

EDA To Prediction (DieTanic)¶

https://www.kaggle.com/ash316/eda-to-prediction-dietanic

들어가기¶

타이타닉에 탑승한 사람들의 신상정보를 활용해 승선한 사람들의 생존여부를 예측하는 모델 생성

프로세스¶

데이터셋 확인
- 대부분의 캐글 데이터들은 잘 정제되어 있으나, 가끔 null data가 존재
탐색적 데이터 분석(exploratory data analysis)
- 여러 feature 들을 개별적으로 분석하고, feature 들 간의 상관관계를 확인
- 여러 시각화 툴을 사용하여 insight 찾기
feature engineering
- 모델을 세우기에 앞서, 모델의 성능을 높일 수 있도록 feature 들을 engineering
- one-hot encoding, class로 나누기, 구간으로 나누기, 텍스트 데이터 처리 등
model 만들기
- sklearn 을 사용해 모델 생성
- sklearn 을 사용하면 수많은 알고리즘을 일관된 문법으로 사용 가능
- 딥러닝을 위해 tensorflow, pytorch 등을 사용
모델 학습 및 예측
- trainset 을 가지고 모델을 학습
- testset 을 가지고 prediction
모델 평가
- 예측 성능이 원하는 수준인지 판단

데이터 수집¶

train 데이터셋과 test 데이터 셋을 케글에서 다운

# 결측치 시각화 패키지
# pip install missingno

Collecting missingno
  Downloading missingno-0.4.2-py3-none-any.whl (9.7 kB)
Requirement already satisfied: scipy in c:\users\205\.conda\envs\pydata\lib\site-packages (from missingno) (1.5.1)
Collecting seaborn
  Using cached seaborn-0.10.1-py3-none-any.whl (215 kB)
Requirement already satisfied: numpy in c:\users\205\.conda\envs\pydata\lib\site-packages (from missingno) (1.19.0)
Requirement already satisfied: matplotlib in c:\users\205\.conda\envs\pydata\lib\site-packages (from missingno) (3.3.0)
Requirement already satisfied: pandas>=0.22.0 in c:\users\205\.conda\envs\pydata\lib\site-packages (from seaborn->missingno) (1.0.5)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.3 in c:\users\205\.conda\envs\pydata\lib\site-packages (from matplotlib->missingno) (2.4.7)
Requirement already satisfied: cycler>=0.10 in c:\users\205\.conda\envs\pydata\lib\site-packages (from matplotlib->missingno) (0.10.0)
Requirement already satisfied: pillow>=6.2.0 in c:\users\205\.conda\envs\pydata\lib\site-packages (from matplotlib->missingno) (7.2.0)
Requirement already satisfied: python-dateutil>=2.1 in c:\users\205\.conda\envs\pydata\lib\site-packages (from matplotlib->missingno) (2.8.1)
Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\205\.conda\envs\pydata\lib\site-packages (from matplotlib->missingno) (1.2.0)
Requirement already satisfied: pytz>=2017.2 in c:\users\205\.conda\envs\pydata\lib\site-packages (from pandas>=0.22.0->seaborn->missingno) (2020.1)
Requirement already satisfied: six in c:\users\205\.conda\envs\pydata\lib\site-packages (from cycler>=0.10->matplotlib->missingno) (1.15.0)
Installing collected packages: seaborn, missingno
Successfully installed missingno-0.4.2 seaborn-0.10.1
Note: you may need to restart the kernel to use updated packages.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import seaborn as sns
plt.style.use('seaborn') # matplotlib 의 기본 scheme 말고 seaborn scheme 을 세팅, 또는  ggplot style  사용

import missingno as msno
sns.set(font_scale=2.5) # 개별적인 font size 를 지정할 필요없이 seaborn 의 font_scale 을 사용하면 편리

import warnings  #ignore warnings
warnings.filterwarnings('ignore')

%matplotlib inline

train = pd.read_csv('titanic/train.csv')
test = pd.read_csv('titanic/test.csv')

탐색적 자료 분석¶

train data set에서 891개의 행, 12개의 열 확인
train data set에서 정보를 받아서 머신러닝 모델 생성
test data set에서 탑승자의 생사여부 분석
feature는 Pclass, Age, SibSp, Parch, Fare
예측하려는 target label 은 Survived

# train data set 확인
train.head()

# test data set 확인
test.head()

# test data set에는 Survived 열이 없음 (예측하려는 target label 이므로)

데이터 딕셔너리¶

PassengerId : 승객 번호
Survived : 생존여부(1: 생존, 0 : 사망)
Pclass : 승선권 클래스(1 : 1st, 2 : 2nd ,3 : 3rd)
Name : 승객 이름
Sex : 승객 성별
Age : 승객 나이
SibSp : 동반한 형제자매, 배우자 수
Parch : 동반한 부모, 자식 수
Ticket : 티켓의 고유 넘버
Fare 티켓의 요금
Cabin : 객실 번호
Embarked : 승선한 항구명(C : Cherbourg, Q : Queenstown, S : Southampton)

퍼스트 클래스의 가격은 3등급의 10배
NaN은 Not a number로 data가 없다는 뜻
모델링을 하기 전에 이런 missing필드는 feature engineering을 통해서 빠진 값을 넣어준다거나 평균값을 넣어준다거나 삭제같은 preprocessing이 필요

# train data set의 형상정보 확인 (데이터의 행,열 크기)
train.shape

# (행, 열)

(891, 12)

test.shape

(418, 11)

# train data set의 DataFrame 정보 확인
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

결측치 처리 (Null data check)¶

# isnull : 결측값 여부
train.isnull().sum()

# 177개의 나이, 687개의 Cabin, 2개의 Embarked에 대한 데이터가 존재하지 않음

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

test.isnull().sum()

PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

# NaN 비율 구하는 함수 생성
def nan_prop(data):
    for col in data.columns:
        is_null_sum = data[col].isnull().sum()
        col_total_row = data[col].shape[0]
        is_null_prop = 100 * ( is_null_sum / col_total_row)
        msg = 'column: {:>10}\t Percent of NaN value: {:.2f}%'.format(col, is_null_prop)
        
        print(msg)

# train data set의 NaN 비율
nan_prop(train)

column: PassengerId	 Percent of NaN value: 0.00%
column:   Survived	 Percent of NaN value: 0.00%
column:     Pclass	 Percent of NaN value: 0.00%
column:       Name	 Percent of NaN value: 0.00%
column:        Sex	 Percent of NaN value: 0.00%
column:        Age	 Percent of NaN value: 19.87%
column:      SibSp	 Percent of NaN value: 0.00%
column:      Parch	 Percent of NaN value: 0.00%
column:     Ticket	 Percent of NaN value: 0.00%
column:       Fare	 Percent of NaN value: 0.00%
column:      Cabin	 Percent of NaN value: 77.10%
column:   Embarked	 Percent of NaN value: 0.22%

# tset data set의 NaN 비율
nan_prop(test)

column: PassengerId	 Percent of NaN value: 0.00%
column:     Pclass	 Percent of NaN value: 0.00%
column:       Name	 Percent of NaN value: 0.00%
column:        Sex	 Percent of NaN value: 0.00%
column:        Age	 Percent of NaN value: 20.57%
column:      SibSp	 Percent of NaN value: 0.00%
column:      Parch	 Percent of NaN value: 0.00%
column:     Ticket	 Percent of NaN value: 0.00%
column:       Fare	 Percent of NaN value: 0.24%
column:      Cabin	 Percent of NaN value: 78.23%
column:   Embarked	 Percent of NaN value: 0.00%

MISSINGNO (MSNO) 라이브러리 사용¶

결측치 시각화
null data를 그래프로 확인하기

# msno.matrix : 매트릭스 형태로 결측치 시각화
# train data set 결측치 시각화
msno.matrix(df=train.iloc[:, :], figsize=(8, 8), color=(0.5, 0.5, 0.5))

# 그래프의 마지막 열은 각 데이터마다 활용할 수 있는 변수의 갯수를 시각화한 것
# 현재 두 개의 변수에서 데이터 결측치가 있으므로 최소 활용 변수는 10개, 최대 활용 변수는 12개

<AxesSubplot:>

# test data set 결측치 시각화
msno.matrix(df=test.iloc[:, :], figsize=(8, 8), color=(0.5, 0.5, 0.5))

<AxesSubplot:>

# train data set 결측치 확인
# msno.bar : bar 형태로 결측치 시각화
msno.bar(df=train.iloc[:, :], figsize=(8, 8), color=(0.8, 0.5, 0.2))

<AxesSubplot:>

# test data set 결측치 확인
msno.bar(df=test.iloc[:, :], figsize=(8, 8), color=(0.8, 0.5, 0.2))

<AxesSubplot:>

Target label 확인 - Servived¶

전체 확인¶

target label 이 어떤 distribution 을 가지고 있는지 확인
binary classification 문제의 경우, 1과 0의 분포 확인

f, ax = plt.subplots(1, 2, figsize=(18,8))  # f:field 하나에 2개의 ax를 그릴 것이다

train['Survived'].value_counts().plot.pie(explode=[0, 0.1], autopct='%1.1f%%', ax=ax[0], shadow=True)
ax[0].set_title('Pie plot - Survived')
ax[0].set_ylabel('')
sns.countplot('Survived', data=train, ax=ax[1])
ax[1].set_title('Count plot - Survived')

plt.show()

사망 61.6% / 생존 38.4%
target label의 분포가 제법 균일(balanced)
불균일한 경우, 예를 들어서 100중 1이 99, 0이 1개인 경우 모델이 모든것을 1이라 해도 정확도가 99%
0(사망자)을 찾는 문제라면 이 모델은 원하는 결과를 줄 수 없게 됨

범주형 features에 대한 확인 - Survived¶

pclass
sex
sibsp
parch
embarked
cabin

# 생사여부 두 개의 막대차트로 표시하는 함수 생성
def bar_chart(feature):
    survived = train[train['Survived']==1][feature].value_counts()
    dead = train[train['Survived']==0][feature].value_counts()
    df = pd.DataFrame([survived, dead])
    df.index = ['Survived', 'Dead']
    df.plot(kind='bar', stacked=True, figsize=(10,5))

bar_chart('Sex')
# 여성이 남성보다 생존할 가능성이 더 높음

bar_chart('Pclass')
# first class가 다른 등급보다 생존할 가능성이 더 높음

bar_chart('SibSp')
# 형제나 배우자가 있는 탑승객이 생존할 가능성이 더 높음

bar_chart('Parch')
# 혼자인 사람보다 부모나 자식이 있는 사람들이 생존할 가능성이 더 높음

bar_chart('Embarked')
# S 선착장 탑승자의 경우 생존확률이 더 높음

# Both Sex and Pclass - seaborn factorplot
# hue : 카테고리형 데이터가 섞여있는 경우, hue 인수에 카테고리 변수 이름을 지정하여 카테고리 값에 따라 색상을 다르게 할 수 있음
sns.factorplot('Pclass', 'Survived', hue='Sex', data=train, sixe=6, aspect=1.5)

# 모든 클래스에서 여성의 생존 확률이 남성보다 높음
# 성별 상관 없이 클래스가 높을수록 생존 확률 높음

<seaborn.axisgrid.FacetGrid at 0x2538f730d08>

# hue 대신 column 그래프
# column별로 그래프 시각화
sns.factorplot(x='Sex', y='Survived', col='Pclass', data=train, satureation=.5, size=9, aspect=1)

<seaborn.axisgrid.FacetGrid at 0x2538fb96a08>

# Pclass, Sex, Age 3개 변수 그래프 seaborn 의 violinplot
f, ax=plt.subplots(1, 2, figsize=(18,8)) # f:field 하나에 2개의 ax를 그릴 것이다

sns.violinplot("Pclass","Age", hue="Survived", 
               data=train, scale='count', split=True,ax=ax[0])

ax[0].set_title('Pclass and Age vs Survived')
ax[0].set_yticks(range(0,110,10))

sns.violinplot("Sex","Age", hue="Survived", 
               data=train, scale='count', split=True,ax=ax[1])

ax[1].set_title('Sex and Age vs Survived')
ax[1].set_yticks(range(0,110,10))

plt.show()

Feature engineering¶

데이터에 대한 도메인 지식을 사용하여 기계 학습 알고리즘을 작동시키는 Feature(Feature vectors)를 만드는 과정
feature vector는 어떤 object를 나타내는 숫자 형상의 n차원 벡터
머신 러닝에있어 많은 알고리즘은 object들의 수치적 표현을 필요로 함

타이타닉은 어떻게 가라앉았나?¶

그림에서 타이타닉의 머릿부분인 오른쪽 뱃머리가 빙하에 부딪히면서 바다에 잠김
가장 먼저 잠긴 3등급 칸의 사망자 다수 발생
반면 반대쪽 3등급 칸의 사람들은 살아남을 가능성 높아 보임
1, 2등급은 생존하기에 유리한 위치
그러므로 티켓의 등급은 생존 유무를 구하는 좋은 변수

train.head(10)

Name¶

Name에 따라 탑승객의 생사여부는 판단할 수 없으나
Mr., Miss, Mrs. 와 같은 성별 정보 추출 가능

# 훈련 및 테스트 데이터 세트 결합
train_test_data = [train, test]

# Name값에서 성별 정보 추출
# 정규표현식으로 [문자]. 으로 끝나는 문자열 추출
for dataset in train_test_data:
    dataset['Title'] = dataset['Name'].str.extract('([A-za-z]+)\.', expand=False)

# Name값에서 추출한 성별정보의 갯수 확인
# train data set
train['Title'].value_counts()

Mr          517
Miss        182
Mrs         125
Master       40
Dr            7
Rev           6
Major         2
Mlle          2
Col           2
Sir           1
Mme           1
Jonkheer      1
Countess      1
Capt          1
Ms            1
Lady          1
Don           1
Name: Title, dtype: int64

# test data set
test['Title'].value_counts()

Mr        240
Miss       78
Mrs        72
Master     21
Rev         2
Col         2
Dr          1
Ms          1
Dona        1
Name: Title, dtype: int64

각 호칭에 숫자 매핑¶

Title mapping
Mr: 0
Miss: 1
Mrs: 2
Others: 3

# Title mapping
title_mapping = {"Mr":0, "Miss":1, "Mrs":2,
                "Master":3, "Dr":3, "Rev":3, "Col": 3, 'Ms': 3, 'Mlle': 3, "Major": 3, 'Lady': 3, 'Capt': 3,
                 'Sir': 3, 'Don': 3, 'Mme':3, 'Jonkheer': 3, 'Countess': 3 }

# 반복문으로 데이터에 Title mapping 적용
for dataset in train_test_data:
    dataset['Title'] = dataset['Title'].map(title_mapping)

# train data set 확인
train.head()

# 성별호칭에 따른 생존여부 그래프로 확인
def bar_chart(feature):
    survived = train[train['Survived'] == 1][feature].value_counts()
    dead = train[train['Survived'] == 0][feature].value_counts()
    df = pd.DataFrame([survived, dead])
    df.index = ['Survived','Dead']
    df.plot(kind='bar',stacked=True, figsize=(10,5))

bar_chart('Title')

# Title 값이 생존여부에 영향을 끼치지 않는 것을 확인

# 데이터 셋에서 불필요한 feature 삭제
train.drop('Name', axis=1, inplace=True)
test.drop('Name', axis=1, inplace=True)

train.head()

Sex¶

성별이 명확히 구분되어 있는 정보의 경우 텍스트를 숫자로 변환
male: 0 female: 1

# 성별에 숫자 매핑
sex_mapping = {'male': 0, 'female':1}

# 반복문으로 매핑결과 데이터에 적용
for dataset in train_test_data:
    dataset['Sex'] = dataset['Sex'].map(sex_mapping)

# 그래프로 확인
bar_chart('Sex')

# 여성의 생존율이 높음

Age¶

주의) 점은 중간중간에 결측치가 존재
missing information에 대해서는 과학적인 방법을 이용해서 채워줘야 함
방법 1) 나머지 모든 사람의 나이의 평균을 구해 채워주는 방법
방법 2) 위에서 구했던 Title에 따라 Mr,Mrs, Miss, Others 의 평균나이로 구분해서 채워주면 일괄적으로 전체 평균을 구하는 것보다 의미있는 결과가 나올 수 있다

# Missing Age를 각 Title에 대한 연령의 중간값으로 대체(Mr, Mrs, Miss, Others)
# 그룹단위 통계량 추가 transform() - https://rfriend.tistory.com/403

# train data set
train['Age'].fillna(train.groupby('Title')['Age'].transform('median'), inplace=True)

# test data set
test['Age'].fillna(test.groupby('Title')['Age'].transform('median'), inplace=True)

# 생존여부 시각화
facet = sns.FacetGrid(train, hue='Survived', aspect=4)
facet.map(sns.kdeplot, 'Age', shade=True)
facet.set(xlim=(0, train['Age'].max()))
facet.add_legend()
sns.axes_style("darkgrid")

plt.show()

# 나이대별로 구분하여 그래프 확대 출력 (0~20세)
facet = sns.FacetGrid(train, hue='Survived', aspect=4)
facet.map(sns.kdeplot, 'Age', shade=True)
facet.set(xlim=(0, train['Age'].max()))
facet.add_legend()

plt.xlim(0,20)

plt.style.use('ggplot')

# 나이대별로 구분하여 그래프 확대 출력 (20~30세)
facet = sns.FacetGrid(train, hue='Survived', aspect=4)
facet.map(sns.kdeplot, 'Age', shade=True)
facet.set(xlim=(0, train['Age'].max()))
facet.add_legend()

plt.xlim(20,30)

(20.0, 30.0)

# 나이대별로 구분하여 그래프 확대 출력 (30~40세)
facet = sns.FacetGrid(train, hue='Survived', aspect=4)
facet.map(sns.kdeplot, 'Age', shade=True)
facet.set(xlim=(0, train['Age'].max()))
facet.add_legend()

plt.xlim(30,40)

(30.0, 40.0)

# 나이대별로 구분하여 그래프 확대 출력 (40~60세)
facet = sns.FacetGrid(train, hue='Survived', aspect=4)
facet.map(sns.kdeplot, 'Age', shade=True)
facet.set(xlim=(0, train['Age'].max()))
facet.add_legend()

plt.xlim(40,60)

(40.0, 60.0)

# 나이대별로 구분하여 그래프 확대 출력 (60세 이상)
facet = sns.FacetGrid(train, hue="Survived", aspect=4)
facet.map(sns.kdeplot, 'Age', shade=True)
facet.set(xlim=(0, train['Age'].max()))
facet.add_legend()

plt.xlim(60)

(60.0, 80.0)

feature engineering - Binning¶

기존의 데이터를 카테고라이징하기위해 사용
잇달아 일어나는 형태의 데이터는 많은 정보를 주지 못하지만, 각각 하나의 카테고리로 분류하면 정보를 보다 명확하게 확인할 수 있다

# 나이대별 카테고라이징
for dataset in train_test_data:
    dataset.loc[ dataset['Age'] <=16, 'Age']=0,
    dataset.loc[(dataset['Age'] > 16) & (dataset['Age'] <=26), 'Age'] = 1,
    dataset.loc[(dataset['Age'] > 26) & (dataset['Age'] <=36), 'Age'] = 2,
    dataset.loc[(dataset['Age'] > 36) & (dataset['Age'] <=62), 'Age'] = 3,
    dataset.loc[(dataset['Age'] > 62), 'Age'] = 4

train.head()

bar_chart('Age')

# 나이대별 생사여부 확인
survived = train[train['Survived']==1]['Age'].value_counts()
dead = train[train['Survived']==0]['Age'].value_counts()

# 나이대별 생존자 
survived

2.0    116
1.0     97
3.0     69
0.0     57
4.0      3
Name: Age, dtype: int64

# 나이대별 사망자
dead

2.0    220
1.0    158
3.0    111
0.0     48
4.0     12
Name: Age, dtype: int64

Embarked¶

승선 항구별 부유한 사람과 가난한 사람의 비율 차이 확인하기

# 좌석별 승선 항구 확인하기
Pclass1 = train[train['Pclass']==1]['Embarked'].value_counts()
Pclass2 = train[train['Pclass']==2]['Embarked'].value_counts()
Pclass3 = train[train['Pclass']==3]['Embarked'].value_counts()

Pclass1
# 클래스별 승선 항구를 확인할 수 있다

S    127
C     85
Q      2
Name: Embarked, dtype: int64

# DataFrame으로 만들어 인덱스 주기 
df = pd.DataFrame([Pclass1, Pclass2, Pclass3])
df.index = ['1st class', '2nd class', '3rd class']

df

# 시각화
df.plot(kind='bar', stacked=True, figsize=(10,5))

<AxesSubplot:>

Q 항구에서 탄 사람들 중 1,2등급은 거의 없음
대다수의 사람들이 S 항구에서 탑승
만약 Embarked 정보가 없다면 S항구로 대치해도 무방할 것으로 보임

# Embarked 정보가 없을 경우 S 항구로 대치
# 반복문으로 전체 자료에 적용
for dataset in train_test_data:
    dataset['Embarked'] = dataset['Embarked'].fillna('S')

train.head()

# 머신러닝 Classifier를 위해 텍스트 숫자 변경(매핑)
embarked_mapping = {'S':0, 'C':1, 'Q':2}

# map 함수 사용해서 처리
for dataset in train_test_data:
    dataset['Embarked'] = dataset['Embarked'].map(embarked_mapping)

train.head()

Fare¶

탑승권 가격 정보가 없을 경우 어떻게 처리할 것인가?
탑승권 가격은 클래스 등급과 관련 있음
클래스는 missing value가 존재하지 않았으므로, 각 클래스 탑승권 가격의 중간값을 missing value에 사용

# train data set 중앙값 확인
train.groupby('Pclass')['Fare'].median()

Pclass
1    60.2875
2    14.2500
3     8.0500
Name: Fare, dtype: float64

# train data set 평균 확인
train.groupby('Pclass')['Fare'].mean()

# 평균과 중앙값 중 각 클래스 값을 더 대표할 수 있는 값으로 선택

Pclass
1    84.154687
2    20.662183
3    13.675550
Name: Fare, dtype: float64

# test data set 중앙값 확인
test.groupby('Pclass')['Fare'].median()

Pclass
1    60.0000
2    15.7500
3     7.8958
Name: Fare, dtype: float64

# 탑승권 가격이 결측값일 경우, 좌석 등급별 중간값으로 대치
# train data set
train["Fare"].fillna(train.groupby('Pclass')['Fare'].transform('median'), inplace=True)

# test data set
test["Fare"].fillna(test.groupby('Pclass')['Fare'].transform('median'), inplace=True)

# 시각화
# FacetGrid(data, row, col, hue) : 다중 플롯 그리드를 만들어서 여러가지 쌍 관계를 표현하기 위한 그리드 Class. 도화지에 축을 나누는것과 같음
# Seaborn에서 Multi-plot grid로 조건부 관계를 여러가지를 동시에 플롯팅 할 수 있는 클래스 중 하나

facet = sns.FacetGrid(train, hue="Survived", aspect=4)  # 데이터를 survived로 나누겠다
facet.map(sns.kdeplot, 'Fare', shade=True) # FacetGrid의 객체 facet에 'map'함수를 이용해 어떤 그래프를 그릴 것인지 명시 & 변수명 명시 
facet.set(xlim=(0, train['Fare'].max()))
facet.add_legend()

plt.show()

낮은 가격의 탑승권을 구매한 사람은 사망률이 높고 높은 가격의 탑승권을 구매한 사람은 생존률이 높다는 것을 알 수 있음

# x축 범위 설정해서 원하는 운임구간의 생존여부 확인
facet = sns.FacetGrid(train, hue="Survived", aspect=4)
facet.map(sns.kdeplot, 'Fare', shade=True)
facet.set(xlim=(0, train['Fare'].max()))
facet.add_legend()
plt.xlim(0,40)

(0.0, 40.0)

# x축 범위 설정해서 원하는 운임구간의 생존여부 확인
facet = sns.FacetGrid(train, hue="Survived", aspect=4)
facet.map(sns.kdeplot, 'Fare', shade=True)
facet.set(xlim=(0, train['Fare'].max()))
facet.add_legend()
plt.xlim(0,100)

(0.0, 100.0)

# 탑승권 최고 금액 확인하기(웃돈을 얹어서 탑승권을 구매한 경우)
facet = sns.FacetGrid(train, hue="Survived", aspect=4)
facet.map(sns.kdeplot, 'Fare', shade=True)
facet.set(xlim=(0, train['Fare'].max()))
facet.add_legend()

# 탑승권 최고 금액 확인하기 -> 탑승권의 갯수가 0이 되는(그 가격의 탑승권이 없다!) 탑승권의 가격 
plt.xlim(0)

(0.0, 512.3292)

# binning을 사용하여 각 구간별 탑승권 가격을 카테고라이징
for dataset in train_test_data:
    dataset.loc[ dataset['Fare'] <=17, 'Fare'] = 0,
    dataset.loc[(dataset['Fare'] > 17) & (dataset['Fare'] <=30), 'Fare'] = 1,
    dataset.loc[(dataset['Fare'] > 30) & (dataset['Fare'] <=100), 'Fare'] = 2,
    dataset.loc[ dataset['Fare'] > 100, 'Fare'] = 3

Cabin - 객실 번호¶

C85, G6, C123 등 알파벳과 숫자가 결합한 형태
알파벳만 처리(숫자는 처리가 어려움)

train['Cabin'].head()

0     NaN
1     C85
2     NaN
3    C123
4     NaN
Name: Cabin, dtype: object

# 반복문을 통해 객실번호의 알파벳과 숫자 분리 후, 알파벳만 뽑아오기
for dataset in train_test_data:
    dataset['Cabin'] = dataset['Cabin'].str[:1]

train['Cabin'].head()

0    NaN
1      C
2    NaN
3      C
4    NaN
Name: Cabin, dtype: object

# 클래스별로 객실 종류 count
Pclass1 = train[train['Pclass']==1]['Cabin'].value_counts()
Pclass2 = train[train['Pclass']==2]['Cabin'].value_counts()
Pclass3 = train[train['Pclass']==3]['Cabin'].value_counts()

df = pd.DataFrame([Pclass1, Pclass2, Pclass3])
df.index = ['1st class', '2nd class', '3rd class']

df.plot(kind='bar', stacked=True, figsize=(10,5))

<AxesSubplot:>

1등급은 ABCDET, 2등급은 DEF, 3등급은 EFG로 구성되어있음

# classifier를 위해 매핑
# feature scaling : raw data 전처리하는 과정 (feature들의 크기, 범위 정규화)/ 소수점 사용
# 숫자의 범위가 비슷하지 않으면 먼 거리에 있는 데이터를 조금 더 중요하게 생각할 수 있음 주의

cabin_mapping = {'A':0, 'B':0.4, 'C':0.8, 'D':1.2, 'E':1.6, 'F':2, 'G':2.4, 'T': 2.8}

for dataset in train_test_data:
    dataset['Cabin'] = dataset['Cabin'].map(cabin_mapping)

# Cabin의 missing field는 1등급 2등급 3등급 클래스와 밀접한 관계
# 각 클래스별 cabin의 중간값을 missing value 처리

train['Cabin'].fillna(train.groupby('Pclass')['Cabin'].transform('median'), inplace=True)

test['Cabin'].fillna(test.groupby('Pclass')['Cabin'].transform('median'), inplace=True)

train.tail(10)

FamilySize¶

동반 승선인 여부에 따른 생존율 확인
SibSp : 동반 승선한 형제자매, 배우자 수
Parch : 동반 승선한 부모, 자식 수

train.head()

# 혼자타면 SibSp, Parch 모두 0으로 표시되므로 +1 해주기
train['FamilySize'] = train['SibSp'] + train['Parch'] + 1
test['FamilySize'] = test['SibSp'] + test['Parch'] + 1

# 시각화
facet = sns.FacetGrid(train, hue="Survived", aspect=4)
facet.map(sns.kdeplot, 'FamilySize', shade=True)
facet.set(xlim=(0, train['FamilySize'].max()))
facet.add_legend()

# 0 : 사망 / 1: 생존
# 혼자 승선한 경우, 가족이 한 명 이상 함께 승선한 경우보다 생존율이 낮다

<seaborn.axisgrid.FacetGrid at 0x2538f7552c8>

# mapping
# feature scaling 
family_mapping = {1: 0, 2: 0.4, 3: 0.8, 4: 1.2, 5: 1.6, 6: 2, 7: 2.4, 8: 2.8, 9: 3.2, 10: 3.6, 11: 4}
for dataset in train_test_data:
    dataset['FamilySize'] = dataset['FamilySize'].map(family_mapping)

dataset['FamilySize'].head()

0    0.0
1    0.4
2    0.0
3    0.0
4    0.8
Name: FamilySize, dtype: float64

train.head()

# 불필요한 데이터 삭제 : drop
# Ticket, SibSp, Parch, PassengerId 정보 제거

features_drop = ['Ticket', 'SibSp', 'Parch', 'PassengerId']

train = train.drop(features_drop, axis=1)
test = test.drop(features_drop, axis=1)

train.head()

test.head()

숫자로 정보를 모두 표현 => 머신러닝 classifier를 통해 예측하기¶

타이타닉 문제는 target class인 survived가 있는 지도학습
target class는 0, 1 => binary classification (* 여러 분류값을 가진다면 multi classification

from sklearn.ensemble import RandomForestClassifier # randomforestclassfier
from sklearn import metrics # 모델 평가를 위해 사용
from sklearn.model_selection import train_test_split # traning set을 쉽게 나눠주는 함수

model = RandomForestClassifier()

# 정답과 공부할 문제 분리
train_data = train.drop('Survived', axis=1)
target = train['Survived']

train_data.shape, target.shape

((891, 8), (891,))

train_data.head(10)

# train - Title의 결측값 채우기
train_data['Title'] = train['Title'].fillna(0)

# 훈련: model에 fit 시키기
model.fit(train_data, target)

RandomForestClassifier()

test['Title'] = test['Title'].fillna(0)
test

# 예측
prediction = model.predict(test)

# 정확도 측정
accuracy = round(model.score(train_data, target) * 100, 2)
print("Accuracy : ", accuracy, "%")

Accuracy :  90.01 %

test2 = pd.read_csv('titanic/test.csv')

submission = pd.DataFrame(
    {
        "PassengerId":test2["PassengerId"], # 앞에서 PassendgerId 삭제했으므로 다시 불러 옴
        "Survived":prediction
    }
)
submission.to_csv('submission_rf_20200727.csv', index=False)

submission

# 캐글에 데이터 업로드해서 등수 확인해보기

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S
5	6	0	3	Moran, Mr. James	male	NaN	0	0	330877	8.4583	NaN	Q
6	7	0	1	McCarthy, Mr. Timothy J	male	54.0	0	0	17463	51.8625	E46	S
7	8	0	3	Palsson, Master. Gosta Leonard	male	2.0	3	1	349909	21.0750	NaN	S
8	9	1	3	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)	female	27.0	0	2	347742	11.1333	NaN	S
9	10	1	2	Nasser, Mrs. Nicholas (Adele Achem)	female	14.0	1	0	237736	30.0708	NaN	C

	PassengerId	Survived
0	892	0
1	893	0
2	894	0
3	895	0
4	896	1
...	...	...
413	1305	0
414	1306	1
415	1307	0
416	1308	0
417	1309	1

	PassengerId	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	892	3	Kelly, Mr. James	male	34.5	0	0	330911	7.8292	NaN	Q
1	893	3	Wilkes, Mrs. James (Ellen Needs)	female	47.0	1	0	363272	7.0000	NaN	S
2	894	2	Myles, Mr. Thomas Francis	male	62.0	0	0	240276	9.6875	NaN	Q
3	895	3	Wirz, Mr. Albert	male	27.0	0	0	315154	8.6625	NaN	S
4	896	3	Hirvonen, Mrs. Alexander (Helga E Lindqvist)	female	22.0	1	1	3101298	12.2875	NaN	S

	PassengerId	Survived	Pclass	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked	Title
881	882	0	3	0	2.0	0	0	349257	0.0	2.0	0	0
882	883	0	3	1	1.0	0	0	7552	0.0	2.0	0	1
883	884	0	2	0	2.0	0	0	C.A./SOTON 34068	0.0	1.8	0	0
884	885	0	3	0	1.0	0	0	SOTON/OQ 392076	0.0	2.0	0	0
885	886	0	3	1	3.0	0	5	382652	1.0	2.0	2	2
886	887	0	2	0	2.0	0	0	211536	0.0	1.8	0	3
887	888	1	1	1	1.0	0	0	112053	1.0	0.4	0	1
888	889	0	3	1	1.0	1	2	W./C. 6607	1.0	2.0	0	1
889	890	1	1	0	1.0	0	0	111369	1.0	0.8	1	0
890	891	0	3	0	2.0	0	0	370376	0.0	2.0	2	0

	Survived	Pclass	Sex	Age	Fare	Cabin	Embarked	Title	FamilySize
0	0	3	0	1.0	0.0	2.0	0	0	0.4
1	1	1	1	3.0	2.0	0.8	1	2	0.4
2	1	3	1	1.0	0.0	2.0	0	1	0.0
3	1	1	1	2.0	2.0	0.8	0	2	0.4
4	0	3	0	2.0	0.0	2.0	0	0	0.0

	Pclass	Sex	Age	Cabin	Embarked	Title	FamilySize
0	3	0	2.0	2.0	2	0.0	0.0
1	3	1	3.0	2.0	0	2.0	0.4
2	2	0	3.0	2.0	2	0.0	0.0
3	3	0	2.0	2.0	0	0.0	0.0
4	3	1	1.0	2.0	0	2.0	0.8

	Pclass	Sex	Age	Fare	Cabin	Embarked	Title	FamilySize
0	3	0	1.0	0.0	2.0	0	0	0.4
1	1	1	3.0	2.0	0.8	1	2	0.4
2	3	1	1.0	0.0	2.0	0	1	0.0
3	1	1	2.0	2.0	0.8	0	2	0.4
4	3	0	2.0	0.0	2.0	0	0	0.0
5	3	0	2.0	0.0	2.0	2	0	0.0
6	1	0	3.0	2.0	1.6	0	0	0.0
7	3	0	0.0	1.0	2.0	0	3	1.6
8	3	1	2.0	0.0	2.0	0	2	0.8
9	2	1	0.0	2.0	1.8	1	2	0.4

	Pclass	Sex	Age	Fare	Cabin	Embarked	Title	FamilySize
0	3	0	2.0	0.0	2.0	2	0.0	0.0
1	3	1	3.0	0.0	2.0	0	2.0	0.4
2	2	0	3.0	0.0	2.0	2	0.0	0.0
3	3	0	2.0	0.0	2.0	0	0.0	0.0
4	3	1	1.0	0.0	2.0	0	2.0	0.8
...	...	...	...	...	...	...	...	...
413	3	0	2.0	0.0	2.0	0	0.0	0.0
414	1	1	3.0	3.0	0.8	1	0.0	0.0
415	3	0	3.0	0.0	2.0	0	0.0	0.0
416	3	0	2.0	0.0	2.0	0	0.0	0.0
417	3	0	0.0	1.0	2.0	1	3.0	0.8

	S	C	Q
1st class	127	85	2
2nd class	164	17	3
3rd class	353	66	72

	Survived	Pclass	Sex	Age	Fare	Cabin	Embarked	Title	FamilySize
0	0	3	0	1.0	0.0	2.0	0	0	0.4
1	1	1	1	3.0	2.0	0.8	1	2	0.4
2	1	3	1	1.0	0.0	2.0	0	1	0.0
3	1	1	1	2.0	2.0	0.8	0	2	0.4
4	0	3	0	2.0	0.0	2.0	0	0	0.0

	Pclass	Sex	Age	Cabin	Embarked	Title	FamilySize
0	3	0	2.0	2.0	2	0.0	0.0
1	3	1	3.0	2.0	0	2.0	0.4
2	2	0	3.0	2.0	2	0.0	0.0
3	3	0	2.0	2.0	0	0.0	0.0
4	3	1	1.0	2.0	0	2.0	0.8

	Pclass	Sex	Age	Fare	Cabin	Embarked	Title	FamilySize
0	3	0	1.0	0.0	2.0	0	0	0.4
1	1	1	3.0	2.0	0.8	1	2	0.4
2	3	1	1.0	0.0	2.0	0	1	0.0
3	1	1	2.0	2.0	0.8	0	2	0.4
4	3	0	2.0	0.0	2.0	0	0	0.0
5	3	0	2.0	0.0	2.0	2	0	0.0
6	1	0	3.0	2.0	1.6	0	0	0.0
7	3	0	0.0	1.0	2.0	0	3	1.6
8	3	1	2.0	0.0	2.0	0	2	0.8
9	2	1	0.0	2.0	1.8	1	2	0.4

	Pclass	Sex	Age	Fare	Cabin	Embarked	Title	FamilySize
0	3	0	2.0	0.0	2.0	2	0.0	0.0
1	3	1	3.0	0.0	2.0	0	2.0	0.4
2	2	0	3.0	0.0	2.0	2	0.0	0.0
3	3	0	2.0	0.0	2.0	0	0.0	0.0
4	3	1	1.0	0.0	2.0	0	2.0	0.8
...	...	...	...	...	...	...	...	...
413	3	0	2.0	0.0	2.0	0	0.0	0.0
414	1	1	3.0	3.0	0.8	1	0.0	0.0
415	3	0	3.0	0.0	2.0	0	0.0	0.0
416	3	0	2.0	0.0	2.0	0	0.0	0.0
417	3	0	0.0	1.0	2.0	1	3.0	0.8

	Survived	Pclass	Sex	Age	Fare	Cabin	Embarked	Title	FamilySize
0	0	3	0	1.0	0.0	2.0	0	0	0.4
1	1	1	1	3.0	2.0	0.8	1	2	0.4
2	1	3	1	1.0	0.0	2.0	0	1	0.0
3	1	1	1	2.0	2.0	0.8	0	2	0.4
4	0	3	0	2.0	0.0	2.0	0	0	0.0

	Pclass	Sex	Age	Cabin	Embarked	Title	FamilySize
0	3	0	2.0	2.0	2	0.0	0.0
1	3	1	3.0	2.0	0	2.0	0.4
2	2	0	3.0	2.0	2	0.0	0.0
3	3	0	2.0	2.0	0	0.0	0.0
4	3	1	1.0	2.0	0	2.0	0.8

	Pclass	Sex	Age	Fare	Cabin	Embarked	Title	FamilySize
0	3	0	1.0	0.0	2.0	0	0	0.4
1	1	1	3.0	2.0	0.8	1	2	0.4
2	3	1	1.0	0.0	2.0	0	1	0.0
3	1	1	2.0	2.0	0.8	0	2	0.4
4	3	0	2.0	0.0	2.0	0	0	0.0
5	3	0	2.0	0.0	2.0	2	0	0.0
6	1	0	3.0	2.0	1.6	0	0	0.0
7	3	0	0.0	1.0	2.0	0	3	1.6
8	3	1	2.0	0.0	2.0	0	2	0.8
9	2	1	0.0	2.0	1.8	1	2	0.4

	Pclass	Sex	Age	Fare	Cabin	Embarked	Title	FamilySize
0	3	0	2.0	0.0	2.0	2	0.0	0.0
1	3	1	3.0	0.0	2.0	0	2.0	0.4
2	2	0	3.0	0.0	2.0	2	0.0	0.0
3	3	0	2.0	0.0	2.0	0	0.0	0.0
4	3	1	1.0	0.0	2.0	0	2.0	0.8
...	...	...	...	...	...	...	...	...
413	3	0	2.0	0.0	2.0	0	0.0	0.0
414	1	1	3.0	3.0	0.8	1	0.0	0.0
415	3	0	3.0	0.0	2.0	0	0.0	0.0
416	3	0	2.0	0.0	2.0	0	0.0	0.0
417	3	0	0.0	1.0	2.0	1	3.0	0.8