Pandas

  • 파이썬을 이용한 데이터 분석과 같은 작업에서 필수 라이브러리

Pandas 데이터 구조

  1. 시리즈(Series)
  2. 데이터프레임(DataFrame)

pandas 패키지 설치

  • pip install pandas
In [2]:
# 경고 무시
import warnings
warnings.filterwarnings(action='ignore')
In [3]:
# pandas 모듈 import
import pandas as pd

1. Series

  • 1차원 배열의 값(values)에 각 값에 대응되는 인덱스(index)를 부여할 수 있는 구조
  • 인덱스는 기본으로 0부터 부여됨
In [4]:
# Series 생성 1
se = pd.Series([1, 2, 3, 4])
se
Out[4]:
0    1
1    2
2    3
3    4
dtype: int64
In [5]:
print(se[0])
print(se[3])
print(se[len(se)-1])
1
4
4

특정값 추출 values, index

In [6]:
se.index   # 인덱스(index)
Out[6]:
RangeIndex(start=0, stop=4, step=1)
In [8]:
se.values # 값(value)   # Series 안에 Numpy가 포함되어 있음.
Out[8]:
array([1, 2, 3, 4], dtype=int64)

Series의 index 변경

In [7]:
se.index = ['a', 'b', 'c', 'd']
se
Out[7]:
a    1
b    2
c    3
d    4
dtype: int64
In [8]:
print(se['a'])
print(se['d'])
1
4
In [10]:
# Series 생성 2 - index 명시 
se2 = pd.Series([17000, 18000, 1000, 5000],
       index=["피자", "치킨", "콜라", "맥주"])
se2
Out[10]:
피자    17000
치킨    18000
콜라     1000
맥주     5000
dtype: int64
In [14]:
se2.index
Out[14]:
Index(['피자', '치킨', '콜라', '맥주'], dtype='object')
In [15]:
se2.values
Out[15]:
array([17000, 18000,  1000,  5000], dtype=int64)
In [11]:
se2['피자']
Out[11]:
17000
In [12]:
se2['치킨']
Out[12]:
18000

masking

In [13]:
se = pd.Series([1, 2, 3, 4])
se > 2
Out[13]:
0    False
1    False
2     True
3     True
dtype: bool
In [14]:
se[se > 2]
Out[14]:
2    3
3    4
dtype: int64
In [15]:
se[~(se > 2)]   # - => not 
Out[15]:
0    1
1    2
dtype: int64
In [16]:
se[-(se > 2)]   # - => not 
Out[16]:
0    1
1    2
dtype: int64

Series Slice

In [18]:
### Series명[시작값:끝값]
se = pd.Series([1,2,3,4], 
            index=['a','b','c','d'])
se
Out[18]:
a    1
b    2
c    3
d    4
dtype: int64
In [19]:
se['b':'c']
Out[19]:
b    2
c    3
dtype: int64
In [20]:
se = pd.Series([1,2,3,4])
se
Out[20]:
0    1
1    2
2    3
3    4
dtype: int64
In [21]:
se[0:3]
Out[21]:
0    1
1    2
2    3
dtype: int64

Series의 정렬

In [22]:
se = pd.Series([1,2,3,4], 
               index=['a','b','c','d'])
se
Out[22]:
a    1
b    2
c    3
d    4
dtype: int64
In [24]:
se.sort_values(ascending=False)
Out[24]:
d    4
c    3
b    2
a    1
dtype: int64

브로드캐스팅(Broadcasting)

In [25]:
se = pd.Series([1,2,3,4], 
               index=['a','b','c','d'])
se*2
Out[25]:
a    2
b    4
c    6
d    8
dtype: int64
In [26]:
se + 5
Out[26]:
a    6
b    7
c    8
d    9
dtype: int64

Pandas 자료구조 Series 생성 <-- 딕셔너리로 매핑

In [47]:
# 딕셔너리 생성 
dict_a = {
    '이름' : '홍길동',
    '직업' : '의사', 
    '취미' : '인공지능',
    '약점' : None
}
In [28]:
type(dict_a)
Out[28]:
dict
In [48]:
# 딕셔너리를 Series로 매핑
se3 = pd.Series(dict_a)
In [30]:
type(se3)
Out[30]:
pandas.core.series.Series
In [31]:
se3
Out[31]:
이름     홍길동
직업      의사
취미    인공지능
약점    None
dtype: object
In [32]:
se3.index
Out[32]:
Index(['이름', '직업', '취미', '약점'], dtype='object')
In [33]:
se3.values
Out[33]:
array(['홍길동', '의사', '인공지능', None], dtype=object)
In [37]:
se3.dtypes
Out[37]:
dtype('O')
In [38]:
se3['이름']
Out[38]:
'홍길동'
In [39]:
# Series 값 변경 
se3['이름'] = 'Hong' # key는 중복이 안됨 
se3['이름']
Out[39]:
'Hong'
In [40]:
se3
Out[40]:
이름    Hong
직업      의사
취미    인공지능
약점    None
dtype: object

결측치 처리

  • isnull() --> fill.na(), dropna()
In [41]:
se3.isnull()
Out[41]:
이름    False
직업    False
취미    False
약점     True
dtype: bool
In [49]:
se3[se3.isnull()]
Out[49]:
약점    None
dtype: object

결측치 처리 - dropna, fill.na()

In [45]:
# 1) fillna : 누락값 대체하기
se3 = se3.fillna("")  # "" - 널스트링(null string)
In [50]:
se3
Out[50]:
이름     홍길동
직업      의사
취미    인공지능
약점    None
dtype: object
In [51]:
# 2) dropna : 누락값 삭제하기
se3.dropna()
Out[51]:
이름     홍길동
직업      의사
취미    인공지능
dtype: object
In [52]:
se3    # 값 변경 => se3 = se3.dropna()
Out[52]:
이름     홍길동
직업      의사
취미    인공지능
약점    None
dtype: object

응용 예제) 결측치 처리

In [53]:
import numpy as np

se4 = pd.Series([1, np.nan, 2, np.nan, 3, 
                 np.nan, 4, np.nan, 5])
se4
Out[53]:
0    1.0
1    NaN
2    2.0
3    NaN
4    3.0
5    NaN
6    4.0
7    NaN
8    5.0
dtype: float64
In [54]:
# 결측치 제외하기 1: dropna()
se4.dropna()
Out[54]:
0    1.0
2    2.0
4    3.0
6    4.0
8    5.0
dtype: float64
In [57]:
# 결측치 제외하기 2: notnull()
se4[se4.notnull()]
Out[57]:
0    1.0
2    2.0
4    3.0
6    4.0
8    5.0
dtype: float64
In [58]:
se4
Out[58]:
0    1.0
1    NaN
2    2.0
3    NaN
4    3.0
5    NaN
6    4.0
7    NaN
8    5.0
dtype: float64
In [59]:
# 결측치 값 대체하기 1: fillna()
se4.fillna(0)
Out[59]:
0    1.0
1    0.0
2    2.0
3    0.0
4    3.0
5    0.0
6    4.0
7    0.0
8    5.0
dtype: float64
In [63]:
se4.sum(), se4.mean()
Out[63]:
(15.0, 3.0)
In [64]:
# 결측치 값 대체하기 2: fillna(mean())
se4.fillna(se4.mean())
Out[64]:
0    1.0
1    3.0
2    2.0
3    3.0
4    3.0
5    3.0
6    4.0
7    3.0
8    5.0
dtype: float64

2. DataFrame

  • 행과 열을 가지는 자료구조
  • 데이타프레임은 2차원
  • 행방향 인덱스(index)와 열방향 인덱스(column)가 존재
  • 시리즈 - 인덱스(index)와 값(values)으로 구성된다면,
  • 데이터프레임
    • 인덱스(index), 열(columns), 값(values)으로 구성

1) 데이터프레임의 생성

  • 리스트(List)
  • 시리즈(Series)
  • 딕셔너리(dict)
  • Numpy의 ndarrays
  • 또 다른 데이터프레임으로 생성 가능
In [65]:
import pandas as pd
  • (1) 리스트로 데이타프레임 생성하기
In [67]:
# 리스트로 데이타프레임 생성하기
data = [
    ['1000', 'Steve', 90.72], 
    ['1001', 'James', 78.09], 
    ['1002', 'Doyeon', 98.43] 
]
df = pd.DataFrame(data)
df
Out[67]:
0 1 2
0 1000 Steve 90.72
1 1001 James 78.09
2 1002 Doyeon 98.43
In [74]:
type(data)
Out[74]:
list
  • 생성된 데이터프레임에 열(columns) 이름과 순서 지정
In [68]:
df = pd.DataFrame(data, columns=['학번', '이름', '점수'])
df
Out[68]:
학번 이름 점수
0 1000 Steve 90.72
1 1001 James 78.09
2 1002 Doyeon 98.43
In [69]:
df.columns  # 열 출력
Out[69]:
Index(['학번', '이름', '점수'], dtype='object')
In [70]:
df.index    # 인덱스 출력 
Out[70]:
RangeIndex(start=0, stop=3, step=1)
In [71]:
df.values   # 값 출력
Out[71]:
array([['1000', 'Steve', 90.72],
       ['1001', 'James', 78.09],
       ['1002', 'Doyeon', 98.43]], dtype=object)
In [75]:
df.values.shape
Out[75]:
(3, 3)
  • (2) 딕셔너리로 데이타프레임 생성
In [76]:
friend_dict_list = [{'name': 'Jone', 'age': 20, 'job': 'student'},
         {'name': 'Jenny', 'age': 30, 'job': 'developer'},
         {'name': 'Nate', 'age': 30, 'job': 'teacher'}]
df = pd.DataFrame(friend_dict_list)
df
Out[76]:
name age job
0 Jone 20 student
1 Jenny 30 developer
2 Nate 30 teacher
  • 데이터프레임에 열(columns) 순서 지정하여 데이타프레임 생성
In [77]:
friend_dict_list = [{'name': 'Jone', 'age': 20, 'job': 'student'},
         {'name': 'Jenny', 'age': 30, 'job': 'developer'},
         {'name': 'Nate', 'age': 30, 'job': 'teacher'}]
df = pd.DataFrame(friend_dict_list, columns = ['age', 'job','name'])
df
Out[77]:
age job name
0 20 student Jone
1 30 developer Jenny
2 30 teacher Nate

딕셔너리로 데이타프레임 생성

In [78]:
import pandas as pd
# 딕셔너리로 데이타프레임 생성
data = {
    'id' : [1, 2, 3],
    'name' : ['Kim', 'Lee', 'Choi'],
    'age' : [10, 20, 30],
    'assets' : [150.4, 123.4, 56.6],
    'job' : ['student', 'CEO', 'Dad']               
}

df = pd.DataFrame(data)
df
Out[78]:
id name age assets job
0 1 Kim 10 150.4 student
1 2 Lee 20 123.4 CEO
2 3 Choi 30 56.6 Dad
In [79]:
type(df) 
Out[79]:
pandas.core.frame.DataFrame
In [80]:
df.dtypes # 각 컬럼의 데이타타입 확인
Out[80]:
id          int64
name       object
age         int64
assets    float64
job        object
dtype: object

2) 데이타프레임 조회하기

  • 1) df.head(n) - 앞 부분을 n개만 보기, df.tail(n) - 뒷 부분을 n개만 보기
  • 2) 인덱스로 조회하기 : df.loc[0] df.iloc[0] df[n:m]
  • 3) 컬럼명으로 조회하기 : df['컬럼명']
  • 4) 컬럼의 특정 위치(값) 조회하기 : df.loc[df.컬럼명 ==값, '컬럼명']
  • (1) 앞, 뒤 몇 행만 보기
In [81]:
df
Out[81]:
id name age assets job
0 1 Kim 10 150.4 student
1 2 Lee 20 123.4 CEO
2 3 Choi 30 56.6 Dad
In [82]:
df.head(2)
Out[82]:
id name age assets job
0 1 Kim 10 150.4 student
1 2 Lee 20 123.4 CEO
In [84]:
df.head() # 5개
Out[84]:
id name age assets job
0 1 Kim 10 150.4 student
1 2 Lee 20 123.4 CEO
2 3 Choi 30 56.6 Dad
In [83]:
df.tail(2)
Out[83]:
id name age assets job
1 2 Lee 20 123.4 CEO
2 3 Choi 30 56.6 Dad
In [85]:
df.tail() # 5개
Out[85]:
id name age assets job
0 1 Kim 10 150.4 student
1 2 Lee 20 123.4 CEO
2 3 Choi 30 56.6 Dad
  • (2) 인덱스로 조회하기
In [86]:
df
Out[86]:
id name age assets job
0 1 Kim 10 150.4 student
1 2 Lee 20 123.4 CEO
2 3 Choi 30 56.6 Dad
In [7]:
df.loc[0]   # 인덱스 0번 조회(가로)
Out[7]:
id              1
name          Kim
age            10
assets      150.4
job       student
Name: 0, dtype: object
In [87]:
df.iloc[0]    # 인덱스 검색 
Out[87]:
id              1
name          Kim
age            10
assets      150.4
job       student
Name: 0, dtype: object
In [88]:
df
Out[88]:
id name age assets job
0 1 Kim 10 150.4 student
1 2 Lee 20 123.4 CEO
2 3 Choi 30 56.6 Dad
In [89]:
df.loc[0:2]    # 인덱스를 이용한 검색  => df[0:2]와 동일 
Out[89]:
id name age assets job
0 1 Kim 10 150.4 student
1 2 Lee 20 123.4 CEO
2 3 Choi 30 56.6 Dad
In [90]:
df.iloc[0:2]
Out[90]:
id name age assets job
0 1 Kim 10 150.4 student
1 2 Lee 20 123.4 CEO
In [91]:
df
Out[91]:
id name age assets job
0 1 Kim 10 150.4 student
1 2 Lee 20 123.4 CEO
2 3 Choi 30 56.6 Dad
In [94]:
df.loc[[0, 2]]
Out[94]:
id name age assets job
0 1 Kim 10 150.4 student
2 3 Choi 30 56.6 Dad
In [11]:
df.loc[[0, 2]]   # 순서적이지 않는 로우(행) 선택
Out[11]:
id name age assets job
0 1 Kim 10 150.4 student
2 3 Choi 30 56.6 Dad
In [97]:
df[0:2]
Out[97]:
id name age assets job
0 1 Kim 10 150.4 student
1 2 Lee 20 123.4 CEO
  • (3) 컬럼명으로 조회
In [98]:
df
Out[98]:
id name age assets job
0 1 Kim 10 150.4 student
1 2 Lee 20 123.4 CEO
2 3 Choi 30 56.6 Dad
In [69]:
df['id']    # 컬럼명으로 조회(세로)
Out[69]:
0    1
1    2
2    3
Name: id, dtype: int64
In [101]:
df.columns
Out[101]:
Index(['id', 'name', 'age', 'assets', 'job'], dtype='object')
In [99]:
df.id
Out[99]:
0    1
1    2
2    3
Name: id, dtype: int64
In [100]:
df.name
Out[100]:
0     Kim
1     Lee
2    Choi
Name: name, dtype: object
In [103]:
df[['id', 'age']]
Out[103]:
id age
0 1 10
1 2 20
2 3 30
  • (4) 컬럼 값으로 조회하기
In [104]:
df
Out[104]:
id name age assets job
0 1 Kim 10 150.4 student
1 2 Lee 20 123.4 CEO
2 3 Choi 30 56.6 Dad
In [106]:
df.id==1
Out[106]:
0     True
1    False
2    False
Name: id, dtype: bool
In [108]:
df.loc[df.id==1]   # 컬럼 값에 의한 검색 
Out[108]:
id name age assets job
0 1 Kim 10 150.4 student
In [109]:
df.loc[df.name=="Kim"]
Out[109]:
id name age assets job
0 1 Kim 10 150.4 student
In [110]:
# 이름이 "Kim"인 사람의 나이는?
df.loc[df.name=="Kim", 'age']
Out[110]:
0    10
Name: age, dtype: int64
In [111]:
df.loc[df.name=="Kim", ['name','age']]
Out[111]:
name age
0 Kim 10
  • 필터링
In [113]:
df
Out[113]:
id name age assets job
0 1 Kim 10 150.4 student
1 2 Lee 20 123.4 CEO
2 3 Choi 30 56.6 Dad
In [112]:
df.age > 25
Out[112]:
0    False
1    False
2     True
Name: age, dtype: bool
In [115]:
df_filtered = df[df.age > 25]
In [116]:
df_filtered
Out[116]:
id name age assets job
2 3 Choi 30 56.6 Dad
In [117]:
df_filtered = df.query('age > 25')
In [118]:
df_filtered
Out[118]:
id name age assets job
2 3 Choi 30 56.6 Dad
In [119]:
df_filtered = df[(df.age > 25) & (df.name == 'Choi')]
df_filtered
Out[119]:
id name age assets job
2 3 Choi 30 56.6 Dad
In [120]:
df
Out[120]:
id name age assets job
0 1 Kim 10 150.4 student
1 2 Lee 20 123.4 CEO
2 3 Choi 30 56.6 Dad

3) 컬럼 필터하기

  • 1) 인덱스로 필터하기
  • 2) 컬럼 이름으로 필터하기
In [123]:
# 데이타 panas로 불러오기 - 컬럼이름 있을 때
df = pd.read_csv('./data/friend_list_no_head.csv') 
df
Out[123]:
John 20 student
0 Jenny 30 developer
1 Nate 30 teacher
2 Julia 40 dentist
3 Brian 45 manager
4 Chris 25 intern
In [157]:
# 데이타 pandas로 불러오기 - 컬럼 이름이 없을 때 
df = pd.read_csv('./data/friend_list_no_head.csv', 
                 header = None, 
                 names=['name', 'age', 'job'])
df
Out[157]:
name age job
0 John 20 student
1 Jenny 30 developer
2 Nate 30 teacher
3 Julia 40 dentist
4 Brian 45 manager
5 Chris 25 intern
  • (1) 인덱스로 필터하기
In [128]:
df.iloc[0:3] # 행(row)만 필터링
Out[128]:
name age job
0 John 20 student
1 Jenny 30 developer
2 Nate 30 teacher
In [131]:
df.iloc[:, 0:2]   # (모든 로우, 컬럼은 0부터 1까지만) 출력
Out[131]:
name age
0 John 20
1 Jenny 30
2 Nate 30
3 Julia 40
4 Brian 45
5 Chris 25
In [134]:
df.iloc[:,[0,2]]  # 모든 로우, 컬럼 0와 2만 출력
Out[134]:
name job
0 John student
1 Jenny developer
2 Nate teacher
3 Julia dentist
4 Brian manager
5 Chris intern
In [142]:
df.loc[:,['name', 'job']] 
Out[142]:
name job
0 John student
1 Jenny developer
2 Nate teacher
3 Julia dentist
4 Brian manager
5 Chris intern
In [143]:
df
Out[143]:
name age job
0 John 20 student
1 Jenny 30 developer
2 Nate 30 teacher
3 Julia 40 dentist
4 Brian 45 manager
5 Chris 25 intern
  • (2) 컬럼 이름으로 필터하기
In [144]:
df_filtered = df[['name', 'age']]
df_filtered
Out[144]:
name age
0 John 20
1 Jenny 30
2 Nate 30
3 Julia 40
4 Brian 45
5 Chris 25
In [146]:
df_filtered = df.loc[:, ['name', 'age']]
df_filtered
Out[146]:
name age
0 John 20
1 Jenny 30
2 Nate 30
3 Julia 40
4 Brian 45
5 Chris 25
In [147]:
df.filter(items=['age', 'job'])
Out[147]:
age job
0 20 student
1 30 developer
2 30 teacher
3 40 dentist
4 45 manager
5 25 intern
In [148]:
df.columns
Out[148]:
Index(['name', 'age', 'job'], dtype='object')
In [29]:
df.filter(like='a',axis=1)  # 컬럼 이름에 'a'가 포함된 컬럼 출력
Out[29]:
name age
0 John 20
1 Jenny 30
2 Nate 30
3 Julia 40
4 Brian 45
5 Chris 25
In [149]:
df.columns
Out[149]:
Index(['name', 'age', 'job'], dtype='object')
In [30]:
df.filter(regex='b$',axis=1) # 정규식을 이용한 필터 - 컬럼이름이 b로 끝나는 컬럼 출력
Out[30]:
job
0 student
1 developer
2 teacher
3 dentist
4 manager
5 intern
In [150]:
df.filter(regex='^a',axis=1) # 컬럼이름에서 a로 시작하는 컬럼 검색
Out[150]:
age
0 20
1 30
2 30
3 40
4 45
5 25
In [158]:
df.columns
Out[158]:
Index(['name', 'age', 'job'], dtype='object')
In [159]:
df2 = df.set_index('name')
In [153]:
df2
Out[153]:
age job
name
John 20 student
Jenny 30 developer
Nate 30 teacher
Julia 40 dentist
Brian 45 manager
Chris 25 intern
In [160]:
df2.filter(regex='^J',axis=0)
Out[160]:
age job
name
John 20 student
Jenny 30 developer
Julia 40 dentist

4) 데이타프레임에 추가하기

  • 1) 컬럼 추가하기 : df['컬럼명'] = '값'
  • 2) 기존의 컬럼을 이용하여 새로운 컬럼 추가하기
  • 3) 로우 추가하기 : df.loc[인덱스번호] = [값1, 값2, 값3, ...]
In [161]:
import pandas as pd
# 딕셔너리로 데이타프레임 생성
data = {
    'id' : [1, 2, 3],
    'name' : ['Kim', 'Lee', 'Choi'],
    'age' : [10, 20, 30],
    'assets' : [150.4, 123.4, 56.6],
    'job' : ['student', 'CEO', 'Dad']               
}

df = pd.DataFrame(data)
df
Out[161]:
id name age assets job
0 1 Kim 10 150.4 student
1 2 Lee 20 123.4 CEO
2 3 Choi 30 56.6 Dad
  • (1) 컬럼 추가하기
In [32]:
# 컬럼 추가하기 
df['hobby'] = '영화감상'
df
Out[32]:
id name age assets job hobby
0 1 Kim 10 150.4 student 영화감상
1 2 Lee 20 123.4 CEO 영화감상
2 3 Choi 30 56.6 Dad 영화감상
In [163]:
# 컬럼 추가하기 - Series를 이용하여 다른 값 추가 
newcol2 = pd.Series([ '서울' , None , '대구' ], 
                   index = [ 2 , 1 , 0 ])
df[ 'address' ] = newcol2
df
Out[163]:
id name age assets job address
0 1 Kim 10 150.4 student 대구
1 2 Lee 20 123.4 CEO None
2 3 Choi 30 56.6 Dad 서울
In [34]:
# numpy를 이용하여 컬럼 추가하기 
import numpy as np
df['salary'] = np.where(df['job'] != 'student' , 'yes', 'no')
df
Out[34]:
id name age assets job hobby salary
0 1 Kim 10 150.4 student 영화감상 no
1 2 Lee 20 123.4 CEO 영화감상 yes
2 3 Choi 30 56.6 Dad 영화감상 yes
In [162]:
# 컬럼 추가하기 - Series를 이용하여 다른 값 추가 
newcol1 = pd.Series([ '서울' , None , '대구' ])
df['address'] = newcol1
df
Out[162]:
id name age assets job address
0 1 Kim 10 150.4 student 서울
1 2 Lee 20 123.4 CEO None
2 3 Choi 30 56.6 Dad 대구
  • (2) 기존의 컬럼을 이용하여 새로운 컬럼 추가하기
In [181]:
sungjuk = [{'name': 'John', 'midterm': 95, 'final': 85},
         {'name': 'Jenny', 'midterm': 85, 'final': 80},
         {'name': 'Nate', 'midterm': 10, 'final': 30}]
df = pd.DataFrame(sungjuk, 
                  columns = ['name', 'midterm', 'final'])
df
Out[181]:
name midterm final
0 John 95 85
1 Jenny 85 80
2 Nate 10 30
  • 기존에 있는 두 컬럼값을 더해서 새로운 컬럼 생성하기
In [166]:
 df['midterm']
Out[166]:
0    95
1    85
2    10
Name: midterm, dtype: int64
In [182]:
df['total'] = df['midterm'] + df['final']
df
Out[182]:
name midterm final total
0 John 95 85 180
1 Jenny 85 80 165
2 Nate 10 30 40
In [183]:
df['average'] = df['total'] / 2
df
Out[183]:
name midterm final total average
0 John 95 85 180 90.0
1 Jenny 85 80 165 82.5
2 Nate 10 30 40 20.0
  • 리스트에 조건별 값을 담아서 새로운 컬럼으로 추가하기
In [171]:
df['average']
Out[171]:
0    90.0
1    82.5
2    20.0
Name: average, dtype: float64
In [184]:
grade = []
for row in df['average']:
    if row >= 90:
        grade.append('A')
    elif row >= 80:
        grade.append('B')
    elif row >= 70:
        grade.append('C')
    else:
        grade.append('F')        
grade   
Out[184]:
['A', 'B', 'F']
In [185]:
df['grade'] = grade
In [186]:
df
Out[186]:
name midterm final total average grade
0 John 95 85 180 90.0 A
1 Jenny 85 80 165 82.5 B
2 Nate 10 30 40 20.0 F

apply 함수 사용

In [187]:
def pass_or_fail(row):
    print(row)
    if row != "F":
        return 'Pass'
    else:
        return 'Fail'
In [188]:
df['grade2'] = df.grade.apply(pass_or_fail)
A
B
F
In [189]:
df
Out[189]:
name midterm final total average grade grade2
0 John 95 85 180 90.0 A Pass
1 Jenny 85 80 165 82.5 B Pass
2 Nate 10 30 40 20.0 F Fail
  • (3) 로우 추가하기
In [190]:
import pandas as pd
# 딕셔너리로 데이타프레임 생성
data = {
    'id' : [1, 2, 3],
    'name' : ['Kim', 'Lee', 'Choi'],
    'age' : [10, 20, 30],
    'assets' : [150.4, 123.4, 56.6],
    'job' : ['student', 'CEO', 'Dad']               
}

df = pd.DataFrame(data)
df
Out[190]:
id name age assets job
0 1 Kim 10 150.4 student
1 2 Lee 20 123.4 CEO
2 3 Choi 30 56.6 Dad
In [191]:
# 로우(행) 추가 - 인덱스 이용
df.loc[3] = [4, 'Hong', 30, 786.9, 'CEO']
df
Out[191]:
id name age assets job
0 1 Kim 10 150.4 student
1 2 Lee 20 123.4 CEO
2 3 Choi 30 56.6 Dad
3 4 Hong 30 786.9 CEO
In [192]:
# 새로운 데이타 프레임을 생성하여 데이타프레임 추가하기 
df2 = pd.DataFrame([['5', 'Park', 30, 300.6, 'Mom']], 
            columns = ['id', 'name', 'age', 'assets', 'job'])
df2
Out[192]:
id name age assets job
0 5 Park 30 300.6 Mom
In [196]:
df = df.append(df2, ignore_index=True)
In [197]:
df
Out[197]:
id name age assets job
0 1 Kim 10 150.4 student
1 2 Lee 20 123.4 CEO
2 3 Choi 30 56.6 Dad
3 4 Hong 30 786.9 CEO
4 5 Park 30 300.6 Mom

추가) apply 사용 예제

In [199]:
date_list = [{'birthday': '2000-06-27'},
         {'birthday': '2002-09-24'},
         {'birthday': '2005-12-20'}]
df = pd.DataFrame(date_list, columns = ['birthday'])
df
Out[199]:
birthday
0 2000-06-27
1 2002-09-24
2 2005-12-20
In [200]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 1 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   birthday  3 non-null      object
dtypes: object(1)
memory usage: 152.0+ bytes
In [201]:
'2020-01-01'.split('-')
Out[201]:
['2020', '01', '01']
In [204]:
def extract_year(row):
    return row.split('-')[0]
In [202]:
df['birthday']
Out[202]:
0    2000-06-27
1    2002-09-24
2    2005-12-20
Name: birthday, dtype: object
In [205]:
df['year'] = df['birthday'].apply(extract_year)
In [206]:
df
Out[206]:
birthday year
0 2000-06-27 2000
1 2002-09-24 2002
2 2005-12-20 2005
In [207]:
df.dtypes
Out[207]:
birthday    object
year        object
dtype: object

apply 함수에 파라미터 전달하기

키워드 파라미터를 사용하시면, apply가 적용된 함수에 파라미터를 전달하실 수 있습니다.

In [213]:
def extract_year(year, current_year):
    return current_year - int(year) 
In [214]:
df['age'] = df['year'].apply(extract_year, current_year=2020)
df
Out[214]:
birthday year age
0 2000-06-27 2000 20
1 2002-09-24 2002 18
2 2005-12-20 2005 15

apply 함수에 한 개 이상의 파라미터 전달하기

키워드 파라미터를 추가해주시면, 원하시는만큼의 파라미터를 함수에 전달 가능합니다.

In [216]:
def get_introduce(age, prefix, suffix):
    return prefix + str(age) + suffix
In [217]:
df['introduce'] = df['age'].apply(get_introduce, 
                    prefix="I am ", suffix=" years old")
df
Out[217]:
birthday year age introduce
0 2000-06-27 2000 20 I am 20 years old
1 2002-09-24 2002 18 I am 18 years old
2 2005-12-20 2005 15 I am 15 years old

apply 함수에 여러개의 컬럼을 동시에 전달하기

axis=1이라는 키워드 파라미터를 apply 함수에 전달해주면, 모든 컬럼을 지정된 함수에서 사용 가능합니다.

In [71]:
def get_introduce2(row):
    return "I was born in "+str(row.year)+" my age is "+str(row.age)

df.introduce = df.apply(get_introduce2, axis=1)
In [72]:
df
Out[72]:
birthday year age introduce
0 2000-06-27 2000 19 I was born in 2000 my age is 19
1 2002-09-24 2002 17 I was born in 2002 my age is 17
2 2005-12-20 2005 14 I was born in 2005 my age is 14

5) 데이타프레임에서 제거하기 : drop, del

1) 컬럼 제거하기 1 : df.drop('컬럼명', axis = 1)
2) 컬럼 제거하기 2 : del df.['컬럼명'] => 주의: 실제로 컬럼이 지워짐
3) 로우 제거하기 : df.drop(index = 인덱스번호, axis = 0)
In [240]:
import pandas as pd
# 딕셔너리로 데이타프레임 생성
data = {
    'id' : [1, 2, 3],
    'name' : ['Kim', 'Lee', 'Choi'],
    'age' : [10, 20, 30],
    'assets' : [150.4, 123.4, 56.6],
    'job' : ['student', 'CEO', 'Dad']               
}

df = pd.DataFrame(data)
df
Out[240]:
id name age assets job
0 1 Kim 10 150.4 student
1 2 Lee 20 123.4 CEO
2 3 Choi 30 56.6 Dad
  • (1) 컬럼 제거하기 1 : df.drop('컬럼명', axis = 1)
In [219]:
# 컬럼 제거하기  : df.drop() 
df.drop('id', axis = 1)    # axis = 1 세로축 
Out[219]:
name age assets job
0 Kim 10 150.4 student
1 Lee 20 123.4 CEO
2 Choi 30 56.6 Dad
In [241]:
# 컬럼 제거하기  : df.drop() => 바로 적용하는 옵션 inplace=True
df.drop('id', axis = 1, inplace=True) 
In [242]:
df
Out[242]:
name age assets job
0 Kim 10 150.4 student
1 Lee 20 123.4 CEO
2 Choi 30 56.6 Dad
In [229]:
# 컬럼값을 이용한 컬럼 제거하기 
df[df.age != 30]
Out[229]:
id name age assets job
0 1 Kim 10 150.4 student
1 2 Lee 20 123.4 CEO
In [230]:
df
Out[230]:
id name age assets job
0 1 Kim 10 150.4 student
1 2 Lee 20 123.4 CEO
2 3 Choi 30 56.6 Dad
  • (2) 컬럼 제거하기 2 : del df.['컬럼명'] => 주의: 실제로 컬럼이 지워짐
In [231]:
df
Out[231]:
id name age assets job
0 1 Kim 10 150.4 student
1 2 Lee 20 123.4 CEO
2 3 Choi 30 56.6 Dad
In [232]:
# 컬럼 제거하기 : del df['컬럼명'] => 주의) 실제로 데이타프레임에서 지워짐
del df['id'] 
In [233]:
df
Out[233]:
name age assets job
0 Kim 10 150.4 student
1 Lee 20 123.4 CEO
2 Choi 30 56.6 Dad
  • (3) 로우 제거하기 : df.drop(index = 인덱스번호, axis = 0)
In [234]:
import pandas as pd
# 딕셔너리로 데이타프레임 생성
data = {
    'id' : [1, 2, 3],
    'name' : ['Kim', 'Lee', 'Choi'],
    'age' : [10, 20, 30],
    'assets' : [150.4, 123.4, 56.6],
    'job' : ['student', 'CEO', 'Dad']               
}

df = pd.DataFrame(data)
df
Out[234]:
id name age assets job
0 1 Kim 10 150.4 student
1 2 Lee 20 123.4 CEO
2 3 Choi 30 56.6 Dad
In [235]:
# 인덱스를 이용하여 로우 제거하기 
df.drop(index=0, axis=0)  # df.drop(0, axis = 0)  # axis = 0  가로축 
Out[235]:
id name age assets job
1 2 Lee 20 123.4 CEO
2 3 Choi 30 56.6 Dad
In [237]:
df
Out[237]:
id name age assets job
0 1 Kim 10 150.4 student
1 2 Lee 20 123.4 CEO
2 3 Choi 30 56.6 Dad
In [238]:
df = df.drop(df.index[[0,2]]) # 로우가 제거된 데이타 프레임 저장
In [239]:
df
Out[239]:
id name age assets job
1 2 Lee 20 123.4 CEO

드롭된 결과를 바로 데이터프레임에 저장하는 방법

  • inplace 키워드를 사용하시면, 따로 저장할 필요없이, 드롭된 결과가 데이터프레임에 반영됩니다.
In [13]:
df.drop(['Kim', 'Choi'], inplace = True)
In [14]:
df
Out[14]:
id age assets job
name
Lee 2 20 123.4 CEO

인덱스 변경

  • 인덱스 변경 1
In [277]:
import pandas as pd
# 딕셔너리로 데이타프레임 생성
data = {
    'id' : [1, 2, 3],
    'name' : ['Kim', 'Lee', 'Choi'],
    'age' : [10, 20, 30],
    'assets' : [150.4, 123.4, 56.6],
    'job' : ['student', 'CEO', 'Dad']               
}

df = pd.DataFrame(data, index = ['Kim', 'Lee', 'Choi'])
df
Out[277]:
id name age assets job
Kim 1 Kim 10 150.4 student
Lee 2 Lee 20 123.4 CEO
Choi 3 Choi 30 56.6 Dad
  • 인덱스 변경 2
In [282]:
import pandas as pd
# 딕셔너리로 데이타프레임 생성
data = {
    'id' : [1, 2, 3],
    'name' : ['Kim', 'Lee', 'Choi'],
    'age' : [10, 20, 30],
    'assets' : [150.4, 123.4, 56.6],
    'job' : ['student', 'CEO', 'Dad']               
}

df = pd.DataFrame(data, index = [1, 2, 3])
df
Out[282]:
id name age assets job
1 1 Kim 10 150.4 student
2 2 Lee 20 123.4 CEO
3 3 Choi 30 56.6 Dad
In [285]:
df.set_index('name', drop = True, inplace=True )
df
Out[285]:
id age assets job
name
Kim 1 10 150.4 student
Lee 2 20 123.4 CEO
Choi 3 30 56.6 Dad
In [286]:
# 인덱스를 이용한 로우 제거하기 
df.drop(['Kim', 'Choi'])
Out[286]:
id age assets job
name
Lee 2 20 123.4 CEO
In [287]:
df
Out[287]:
id age assets job
name
Kim 1 10 150.4 student
Lee 2 20 123.4 CEO
Choi 3 30 56.6 Dad

중복 데이터 처리

  • 데이타프레임에서 각 행은 분석 대상이 갖고 있는 모든 속성(변수)에 대한 관측값을 뜻함
  • 하나의 데이타셋에서 동일한 관측값이 2개 이상 중복되는 경우 -> 중복을 찾아 제거
  • 동일한 대상이 중복으로 존재하는 것 -> 분석 결과를 왜곡할 수 있음

1) 중복 데이타 확인 - duplicated()

  • duplicated() 메소드 : 각 행의 중복 여부를 나타내는 불린 시리즈를 반환
    • 전에 나온 행들과 비교하여 중복되는 행 -> True를 반환
    • 처음 나오는 행 -> False를 반환

(1) 중복 데이타를 갖는 데이타프레임 생성

In [291]:
import pandas as pd
# 중복 데이타를 갖는 데이타프레임 만들기
df = pd.DataFrame({'c1': ['a', 'a', 'b', 'a', 'b'],
                  'c2': [1, 1, 1, 2, 2],
                  'c3': [1, 1, 2, 3, 3]})
df
Out[291]:
c1 c2 c3
0 a 1 1
1 a 1 1
2 b 1 2
3 a 2 3
4 b 2 3

(2) 데이타프레임 전체 행 데이타 중에서 중복값 찾기

In [292]:
df.duplicated()
Out[292]:
0    False
1     True
2    False
3    False
4    False
dtype: bool
In [14]:
df_dup = df.duplicated()
print(df_dup)
0    False
1     True
2    False
3    False
4    False
dtype: bool

(3) 데이타프레임의 특정열 데이타에서 중복값 찾기

In [293]:
df
Out[293]:
c1 c2 c3
0 a 1 1
1 a 1 1
2 b 1 2
3 a 2 3
4 b 2 3
In [294]:
df['c2'].duplicated()
Out[294]:
0    False
1     True
2     True
3    False
4     True
Name: c2, dtype: bool
In [15]:
col_dup = df['c2'].duplicated()
print(col_dup)
0    False
1     True
2     True
3    False
4     True
Name: c2, dtype: bool

2) 중복 데이타 제거 - drop_duplicates()

  • drop_duplicates() 메소드 : 중복되는 행을 제거하고 고유한 관측값을 자진 행만 남긴다.
  • 원본 객체를 변경하려면 inplace=True 옵션을 추가한다.

(1) 중복 데이타를 갖고 있는 데이타프레임 생성

In [295]:
import pandas as pd

# 중복 데이타를 갖는 데이타프레임 만들기
df = pd.DataFrame({'c1': ['a', 'a', 'b', 'a', 'b'],
                  'c2': [1, 1, 1, 2, 2],
                  'c3': [1, 1, 2, 3, 3]})

print(df)
  c1  c2  c3
0  a   1   1
1  a   1   1
2  b   1   2
3  a   2   3
4  b   2   3

(2) 데이타프레임에서 중복 행 제거

In [296]:
df.drop_duplicates()
Out[296]:
c1 c2 c3
0 a 1 1
2 b 1 2
3 a 2 3
4 b 2 3
In [17]:
df2 = df.drop_duplicates()
print(df2)
  c1  c2  c3
0  a   1   1
2  b   1   2
3  a   2   3
4  b   2   3

(3) drop_duplicates() 메소드의 subset 옵션에 '열 이름 리스트'를 전달

  • 데이타의 중복 여부를 판별할 때, subset 옵션에 해당하는 열을 기준으로 판단한다.
  • 데이타프레임 df의 c2, c3 열을 기준으로 판별하면 0행, 1행, 3행, 4행의 데이타가 중복된다.-> 1행, 3행, 4행 제
In [297]:
df.drop_duplicates(subset=['c2'])
Out[297]:
c1 c2 c3
0 a 1 1
3 a 2 3
In [298]:
df
Out[298]:
c1 c2 c3
0 a 1 1
1 a 1 1
2 b 1 2
3 a 2 3
4 b 2 3
In [299]:
df.drop_duplicates(subset=['c2', 'c3'])
Out[299]:
c1 c2 c3
0 a 1 1
2 b 1 2
3 a 2 3

범주형 카테고리 처리 - 구간 분할

구간 분할

  • 연속적인 데이타를 일정한 구간(bin)으로 나눠서 분석하는 것이 효율적일 때가 있음
  • 가격, 비용, 효율 등 연속적인 값을 일정한 수준이나 정도를 나타내는 이산적인 값을 구간별 차이로 나타나게 하는 것
  • 연속 변수를 일정한 구간으로 나누고, 각 구간을 범주형 이산 변수로 변환하는 과정 -> 구간 분할(binning)
  • 구간 분할은 판다스의 cut() 함수를 이용

문제) horsepower 열은 엔진 출력, 엔진출력을 숫자 대신 저출력, 보통출력, 고출력 등의 구간분할하여 출력

(1) 데이타 불러오기, 열이름 지정하기

In [300]:
import pandas as pd
import numpy as np

# read_csv() 함수로 df 생성
df = pd.read_csv('./data/auto-mpg.csv', header=None)

# 열 이름 지정
df.columns = ['mpg', 'cylinders', 'displacement', 'horsepower', 
              'weight', 'acceleration', 'model year', 
              'origin', 'name']

df.head()
Out[300]:
mpg cylinders displacement horsepower weight acceleration model year origin name
0 18.0 8 307.0 130.0 3504.0 12.0 70 1 chevrolet chevelle malibu
1 15.0 8 350.0 165.0 3693.0 11.5 70 1 buick skylark 320
2 18.0 8 318.0 150.0 3436.0 11.0 70 1 plymouth satellite
3 16.0 8 304.0 150.0 3433.0 12.0 70 1 amc rebel sst
4 17.0 8 302.0 140.0 3449.0 10.5 70 1 ford torino
In [303]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           398 non-null    float64
 1   cylinders     398 non-null    int64  
 2   displacement  398 non-null    float64
 3   horsepower    398 non-null    object 
 4   weight        398 non-null    float64
 5   acceleration  398 non-null    float64
 6   model year    398 non-null    int64  
 7   origin        398 non-null    int64  
 8   name          398 non-null    object 
dtypes: float64(4), int64(3), object(2)
memory usage: 28.1+ KB
In [307]:
df['horsepower'].unique()
Out[307]:
array(['130.0', '165.0', '150.0', '140.0', '198.0', '220.0', '215.0',
       '225.0', '190.0', '170.0', '160.0', '95.00', '97.00', '85.00',
       '88.00', '46.00', '87.00', '90.00', '113.0', '200.0', '210.0',
       '193.0', '?', '100.0', '105.0', '175.0', '153.0', '180.0', '110.0',
       '72.00', '86.00', '70.00', '76.00', '65.00', '69.00', '60.00',
       '80.00', '54.00', '208.0', '155.0', '112.0', '92.00', '145.0',
       '137.0', '158.0', '167.0', '94.00', '107.0', '230.0', '49.00',
       '75.00', '91.00', '122.0', '67.00', '83.00', '78.00', '52.00',
       '61.00', '93.00', '148.0', '129.0', '96.00', '71.00', '98.00',
       '115.0', '53.00', '81.00', '79.00', '120.0', '152.0', '102.0',
       '108.0', '68.00', '58.00', '149.0', '89.00', '63.00', '48.00',
       '66.00', '139.0', '103.0', '125.0', '133.0', '138.0', '135.0',
       '142.0', '77.00', '62.00', '132.0', '84.00', '64.00', '74.00',
       '116.0', '82.00'], dtype=object)

(2) horsepower 열의 누락 데이타를 삭제하고 실수형으로 데이타타입 변환

In [308]:
df['horsepower'].replace('?', np.nan, inplace=True)      # '?'을 np.nap으로 변경
In [309]:
df['horsepower'].unique()
Out[309]:
array(['130.0', '165.0', '150.0', '140.0', '198.0', '220.0', '215.0',
       '225.0', '190.0', '170.0', '160.0', '95.00', '97.00', '85.00',
       '88.00', '46.00', '87.00', '90.00', '113.0', '200.0', '210.0',
       '193.0', nan, '100.0', '105.0', '175.0', '153.0', '180.0', '110.0',
       '72.00', '86.00', '70.00', '76.00', '65.00', '69.00', '60.00',
       '80.00', '54.00', '208.0', '155.0', '112.0', '92.00', '145.0',
       '137.0', '158.0', '167.0', '94.00', '107.0', '230.0', '49.00',
       '75.00', '91.00', '122.0', '67.00', '83.00', '78.00', '52.00',
       '61.00', '93.00', '148.0', '129.0', '96.00', '71.00', '98.00',
       '115.0', '53.00', '81.00', '79.00', '120.0', '152.0', '102.0',
       '108.0', '68.00', '58.00', '149.0', '89.00', '63.00', '48.00',
       '66.00', '139.0', '103.0', '125.0', '133.0', '138.0', '135.0',
       '142.0', '77.00', '62.00', '132.0', '84.00', '64.00', '74.00',
       '116.0', '82.00'], dtype=object)
In [310]:
df.dropna(subset=['horsepower'], axis=0, inplace=True)   # 누락 데이타 행 삭제
In [311]:
df['horsepower'] = df['horsepower'].astype('float')       # 문자열을 실수형으로 변환
In [312]:
df['horsepower'].dtypes
Out[312]:
dtype('float64')
In [314]:
df['horsepower'].describe() # 기술통계 
Out[314]:
count    392.000000
mean     104.469388
std       38.491160
min       46.000000
25%       75.000000
50%       93.500000
75%      126.000000
max      230.000000
Name: horsepower, dtype: float64
In [315]:
df.describe()
Out[315]:
mpg cylinders displacement horsepower weight acceleration model year origin
count 392.000000 392.000000 392.000000 392.000000 392.000000 392.000000 392.000000 392.000000
mean 23.445918 5.471939 194.411990 104.469388 2977.584184 15.541327 75.979592 1.576531
std 7.805007 1.705783 104.644004 38.491160 849.402560 2.758864 3.683737 0.805518
min 9.000000 3.000000 68.000000 46.000000 1613.000000 8.000000 70.000000 1.000000
25% 17.000000 4.000000 105.000000 75.000000 2225.250000 13.775000 73.000000 1.000000
50% 22.750000 4.000000 151.000000 93.500000 2803.500000 15.500000 76.000000 1.000000
75% 29.000000 8.000000 275.750000 126.000000 3614.750000 17.025000 79.000000 2.000000
max 46.600000 8.000000 455.000000 230.000000 5140.000000 24.800000 82.000000 3.000000

(3) np.histogram 함수로 3개의 bin으로 구분할 경계값의 리스트 구하기

  • histogram() 함수 : 나누려는 구간(bin) 개수를 bins 옵션에 입력하면 각 구간에 속하는 값의와 개수(count)와 경계값 리스트(bin_divisors)를 반환
In [319]:
df['horsepower'].hist(bins=3)
Out[319]:
<AxesSubplot:>
In [320]:
count, bin_dividers = np.histogram(df['horsepower'], bins=3)  # 4개의 경계값, 3개의 구간이 만들어짐
print(bin_dividers)
[ 46.         107.33333333 168.66666667 230.        ]

(4) 3개의 bin에 이름 지정

In [321]:
bin_names = ['저출력', '보통출력', '고출력']

(5) pd.cut() 함수로 각 데이터를 3개의 bin에 할당

In [322]:
df['hp_bin'] = pd.cut(x=df['horsepower'],   # 데이타 배열
                     bins=bin_dividers,     # 경계값 리스트
                     labels=bin_names,       # bin 이름
                     include_lowest=True)   # 첫 경계값 포함
In [324]:
df[['horsepower', 'hp_bin']]
Out[324]:
horsepower hp_bin
0 130.0 보통출력
1 165.0 보통출력
2 150.0 보통출력
3 150.0 보통출력
4 140.0 보통출력
... ... ...
393 86.0 저출력
394 52.0 저출력
395 84.0 저출력
396 79.0 저출력
397 82.0 저출력

392 rows × 2 columns

(6) horsepower 열과 hp_bin 열 출력

In [25]:
print(df[['horsepower', 'hp_bin']].head(20))
    horsepower hp_bin
0        130.0   보통출력
1        165.0   보통출력
2        150.0   보통출력
3        150.0   보통출력
4        140.0   보통출력
5        198.0    고출력
6        220.0    고출력
7        215.0    고출력
8        225.0    고출력
9        190.0    고출력
10       170.0    고출력
11       160.0   보통출력
12       150.0   보통출력
13       225.0    고출력
14        95.0    저출력
15        95.0    저출력
16        97.0    저출력
17        85.0    저출력
18        88.0    저출력
19        46.0    저출력

응용예제) 사용자가 직접 구간 나누어 처리

In [325]:
df[['horsepower']].describe()
Out[325]:
horsepower
count 392.000000
mean 104.469388
std 38.491160
min 46.000000
25% 75.000000
50% 93.500000
75% 126.000000
max 230.000000
In [326]:
bin_dividers = np.array([0., 100., 126., 231.])
bin_dividers 
Out[326]:
array([  0., 100., 126., 231.])
In [327]:
bin_names = ['저출력', '보통출력', '고출력']
In [29]:
df['hp_bin'] = pd.cut(x=df['horsepower'],   # 데이타 배열
                     bins=bin_dividers,     # 경계값 리스트
                     labels=bin_names,       # bin 이름
                     include_lowest=True)   # 첫 경계값 포함
In [30]:
print(df[['horsepower', 'hp_bin']].head(20))
    horsepower hp_bin
0        130.0    고출력
1        165.0    고출력
2        150.0    고출력
3        150.0    고출력
4        140.0    고출력
5        198.0    고출력
6        220.0    고출력
7        215.0    고출력
8        225.0    고출력
9        190.0    고출력
10       170.0    고출력
11       160.0    고출력
12       150.0    고출력
13       225.0    고출력
14        95.0    저출력
15        95.0    저출력
16        97.0    저출력
17        85.0    저출력
18        88.0    저출력
19        46.0    저출력

6) Group by

  • 데이터에서 정보를 취하기 위해서 그룹별로 묶는 방법
In [359]:
student_list = [{'name': 'John', 'major': "Computer Science", 'sex': "male"},
                {'name': 'Nate', 'major': "Computer Science", 'sex': "male"},
                {'name': 'Abraham', 'major': "Physics", 'sex': "male"},
                {'name': 'Brian', 'major': "Psychology", 'sex': "male"},
                {'name': 'Janny', 'major': "Economics", 'sex': "female"},
                {'name': 'Yuna', 'major': "Economics", 'sex': "female"},
                {'name': 'Jeniffer', 'major': "Computer Science", 'sex': "female"},
                {'name': 'Edward', 'major': "Computer Science", 'sex': "male"},
                {'name': 'Zara', 'major': "Psychology", 'sex': "female"},
                {'name': 'Wendy', 'major': "Economics", 'sex': "female"},
                {'name': 'Sera', 'major': "Psychology", 'sex': "female"}
         ]
df = pd.DataFrame(student_list, columns = ['name', 'major', 'sex'])
df
Out[359]:
name major sex
0 John Computer Science male
1 Nate Computer Science male
2 Abraham Physics male
3 Brian Psychology male
4 Janny Economics female
5 Yuna Economics female
6 Jeniffer Computer Science female
7 Edward Computer Science male
8 Zara Psychology female
9 Wendy Economics female
10 Sera Psychology female

전공별 그룹 생성

In [361]:
df.groupby(by='major')['name'].size()
Out[361]:
major
Computer Science    4
Economics           3
Physics             1
Psychology          3
Name: name, dtype: int64
In [363]:
df.groupby(by='sex')['name'].size()
Out[363]:
sex
female    6
male      5
Name: name, dtype: int64
In [364]:
df_major = pd.DataFrame(df.groupby('major')['name'].size())
df_major
Out[364]:
name
major
Computer Science 4
Economics 3
Physics 1
Psychology 3
In [365]:
df_major.columns 
Out[365]:
Index(['name'], dtype='object')
In [366]:
df_major.columns = ['count']
In [367]:
df_major
Out[367]:
count
major
Computer Science 4
Economics 3
Physics 1
Psychology 3
In [368]:
df_major.plot(kind='bar')
Out[368]:
<AxesSubplot:xlabel='major'>
In [353]:
df_major.plot(kind='barh')
Out[353]:
<AxesSubplot:ylabel='major'>
In [369]:
df_major.plot(kind='box')
Out[369]:
<AxesSubplot:>
In [370]:
df_major.index
Out[370]:
Index(['Computer Science', 'Economics', 'Physics', 'Psychology'], dtype='object', name='major')
In [371]:
df_major.columns
Out[371]:
Index(['count'], dtype='object')
In [372]:
df_major.loc['Economics']
Out[372]:
count    3
Name: Economics, dtype: int64

전공별 그룹

In [373]:
groupby_major = df.groupby('major')
In [374]:
groupby_major.groups
Out[374]:
{'Computer Science': Int64Index([0, 1, 6, 7], dtype='int64'),
 'Economics': Int64Index([4, 5, 9], dtype='int64'),
 'Physics': Int64Index([2], dtype='int64'),
 'Psychology': Int64Index([3, 8, 10], dtype='int64')}
In [133]:
for name, group in groupby_major:
    print(name + ": " + str(len(group)))
    print(group)
    print()
Computer Science: 4
       name             major     sex
0      John  Computer Science    male
1      Nate  Computer Science    male
6  Jeniffer  Computer Science  female
7    Edward  Computer Science    male

Economics: 3
    name      major     sex
4  Janny  Economics  female
5   Yuna  Economics  female
9  Wendy  Economics  female

Physics: 1
      name    major   sex
2  Abraham  Physics  male

Psychology: 3
     name       major     sex
3   Brian  Psychology    male
8    Zara  Psychology  female
10   Sera  Psychology  female

  • 그룹 객체를 다시 데이터프레임으로 생성
In [375]:
df_major_cnt = pd.DataFrame({'count' : groupby_major.size()}).reset_index()
df_major_cnt
Out[375]:
major count
0 Computer Science 4
1 Economics 3
2 Physics 1
3 Psychology 3
  • 성별로 그룹 생성
In [376]:
groupby_sex = df.groupby('sex')
In [377]:
for name, group in groupby_sex:
    print(name + ": " + str(len(group)))
    print(group)
    print()
female: 6
        name             major     sex
4      Janny         Economics  female
5       Yuna         Economics  female
6   Jeniffer  Computer Science  female
8       Zara        Psychology  female
9      Wendy         Economics  female
10      Sera        Psychology  female

male: 5
      name             major   sex
0     John  Computer Science  male
1     Nate  Computer Science  male
2  Abraham           Physics  male
3    Brian        Psychology  male
7   Edward  Computer Science  male

  • 그룹 객체를 다시 데이터프레임으로 생성
In [378]:
df_sex_cnt = pd.DataFrame({'count' : groupby_sex.size()}).reset_index()
df_sex_cnt
Out[378]:
sex count
0 female 6
1 male 5

추가) groupby

In [379]:
import seaborn as sns

df = sns.load_dataset("mpg")
df.shape
Out[379]:
(398, 9)
In [380]:
df.head(1)
Out[380]:
mpg cylinders displacement horsepower weight acceleration model_year origin name
0 18.0 8 307.0 130.0 3504 12.0 70 usa chevrolet chevelle malibu
In [386]:
df['origin'].value_counts()
Out[386]:
usa       249
japan      79
europe     70
Name: origin, dtype: int64
In [387]:
# origin별 cylinder 수
df.groupby(by="origin")['cylinders'].size()
Out[387]:
origin
europe     70
japan      79
usa       249
Name: cylinders, dtype: int64
In [268]:
df2 = pd.DataFrame(df.groupby(by="origin")['cylinders'].size())
df2
Out[268]:
cylinders
origin
europe 70
japan 79
usa 249
In [262]:
df['origin'].value_counts()
Out[262]:
usa       249
japan      79
europe     70
Name: origin, dtype: int64
In [392]:
df3 = pd.DataFrame(df.groupby(['origin'])['cylinders'].mean())
df3
Out[392]:
cylinders
origin
europe 4.157143
japan 4.101266
usa 6.248996
In [393]:
df3 = pd.DataFrame(df.groupby(['model_year', 'origin'])['cylinders'].mean())
In [394]:
df3
Out[394]:
cylinders
model_year origin
70 europe 4.000000
japan 4.000000
usa 7.636364
71 europe 4.000000
japan 4.000000
usa 6.200000
72 europe 4.000000
japan 3.800000
usa 6.888889
73 europe 4.000000
japan 4.250000
usa 7.241379
74 europe 4.000000
japan 4.000000
usa 6.266667
75 europe 4.000000
japan 4.000000
usa 6.400000
76 europe 4.250000
japan 4.500000
usa 6.363636
77 europe 4.000000
japan 4.166667
usa 6.222222
78 europe 4.833333
japan 4.000000
usa 6.000000
79 europe 4.250000
japan 4.000000
usa 6.260870
80 europe 4.111111
japan 4.076923
usa 4.285714
81 europe 4.500000
japan 4.333333
usa 4.923077
82 europe 4.000000
japan 4.000000
usa 4.300000
In [395]:
df3.index
Out[395]:
MultiIndex([(70, 'europe'),
            (70,  'japan'),
            (70,    'usa'),
            (71, 'europe'),
            (71,  'japan'),
            (71,    'usa'),
            (72, 'europe'),
            (72,  'japan'),
            (72,    'usa'),
            (73, 'europe'),
            (73,  'japan'),
            (73,    'usa'),
            (74, 'europe'),
            (74,  'japan'),
            (74,    'usa'),
            (75, 'europe'),
            (75,  'japan'),
            (75,    'usa'),
            (76, 'europe'),
            (76,  'japan'),
            (76,    'usa'),
            (77, 'europe'),
            (77,  'japan'),
            (77,    'usa'),
            (78, 'europe'),
            (78,  'japan'),
            (78,    'usa'),
            (79, 'europe'),
            (79,  'japan'),
            (79,    'usa'),
            (80, 'europe'),
            (80,  'japan'),
            (80,    'usa'),
            (81, 'europe'),
            (81,  'japan'),
            (81,    'usa'),
            (82, 'europe'),
            (82,  'japan'),
            (82,    'usa')],
           names=['model_year', 'origin'])

7) 결측치 처리 : None 처리

In [396]:
school_id_list = [{'name': 'John', 'job': "teacher", 'age': 40},
                {'name': 'Nate', 'job': "teacher", 'age': 35},
                {'name': 'Yuna', 'job': "teacher", 'age': 37},
                {'name': 'Abraham', 'job': "student", 'age': 10},
                {'name': 'Brian', 'job': "student", 'age': 12},
                {'name': 'Janny', 'job': "student", 'age': 11},
                {'name': 'Nate', 'job': "teacher", 'age': None},
                {'name': 'John', 'job': "student", 'age': None}
         ]
df = pd.DataFrame(school_id_list, columns = ['name', 'job', 'age'])
df
Out[396]:
name job age
0 John teacher 40.0
1 Nate teacher 35.0
2 Yuna teacher 37.0
3 Abraham student 10.0
4 Brian student 12.0
5 Janny student 11.0
6 Nate teacher NaN
7 John student NaN

1) Null 또는 NaN 확인하기

In [397]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   name    8 non-null      object 
 1   job     8 non-null      object 
 2   age     6 non-null      float64
dtypes: float64(1), object(2)
memory usage: 320.0+ bytes
In [400]:
df.isna()
Out[400]:
name job age
0 False False False
1 False False False
2 False False False
3 False False False
4 False False False
5 False False False
6 False False True
7 False False True
In [401]:
df.isnull()
Out[401]:
name job age
0 False False False
1 False False False
2 False False False
3 False False False
4 False False False
5 False False False
6 False False True
7 False False True

2) Null 또는 NaN 값 변경하기

  • 0으로 변경하기
In [402]:
# 0으로 변경하기 
tmp = df.copy()
tmp["age"] = tmp["age"].fillna(0)
tmp
Out[402]:
name job age
0 John teacher 40.0
1 Nate teacher 35.0
2 Yuna teacher 37.0
3 Abraham student 10.0
4 Brian student 12.0
5 Janny student 11.0
6 Nate teacher 0.0
7 John student 0.0
  • 나이의 전체 평균으로 변경하기
In [403]:
school_id_list = [{'name': 'John', 'job': "teacher", 'age': 40},
                {'name': 'Nate', 'job': "teacher", 'age': 35},
                {'name': 'Yuna', 'job': "teacher", 'age': 37},
                {'name': 'Abraham', 'job': "student", 'age': 10},
                {'name': 'Brian', 'job': "student", 'age': 12},
                {'name': 'Janny', 'job': "student", 'age': 11},
                {'name': 'Nate', 'job': "teacher", 'age': None},
                {'name': 'John', 'job': "student", 'age': None}
         ]
df = pd.DataFrame(school_id_list, columns = ['name', 'job', 'age'])
df
Out[403]:
name job age
0 John teacher 40.0
1 Nate teacher 35.0
2 Yuna teacher 37.0
3 Abraham student 10.0
4 Brian student 12.0
5 Janny student 11.0
6 Nate teacher NaN
7 John student NaN
In [404]:
df.isna()
Out[404]:
name job age
0 False False False
1 False False False
2 False False False
3 False False False
4 False False False
5 False False False
6 False False True
7 False False True
In [405]:
tmp = df.copy()
tmp["age"].mean()
Out[405]:
24.166666666666668
In [406]:
tmp = df.copy()
tmp["age"] = tmp["age"].fillna(tmp["age"].mean())
tmp
Out[406]:
name job age
0 John teacher 40.000000
1 Nate teacher 35.000000
2 Yuna teacher 37.000000
3 Abraham student 10.000000
4 Brian student 12.000000
5 Janny student 11.000000
6 Nate teacher 24.166667
7 John student 24.166667
In [407]:
school_id_list = [{'name': 'John', 'job': "teacher", 'age': 40},
                {'name': 'Nate', 'job': "teacher", 'age': 35},
                {'name': 'Yuna', 'job': "teacher", 'age': 37},
                {'name': 'Abraham', 'job': "student", 'age': 10},
                {'name': 'Brian', 'job': "student", 'age': 12},
                {'name': 'Janny', 'job': "student", 'age': 11},
                {'name': 'Nate', 'job': "teacher", 'age': None},
                {'name': 'John', 'job': "student", 'age': None}
         ]
df = pd.DataFrame(school_id_list, columns = ['name', 'job', 'age'])
df
Out[407]:
name job age
0 John teacher 40.0
1 Nate teacher 35.0
2 Yuna teacher 37.0
3 Abraham student 10.0
4 Brian student 12.0
5 Janny student 11.0
6 Nate teacher NaN
7 John student NaN
  • 0으로 설정하기 보다는 선생님의 중간 나이, 학생의 중간 나이로, 각각의 직업군에 맞게 Null값을 변경해줍니다.
In [414]:
df.groupby("job")["age"].median()
Out[414]:
job
student    11.0
teacher    37.0
Name: age, dtype: float64
In [411]:
df.groupby("job")["age"].transform("median")
Out[411]:
0    37.0
1    37.0
2    37.0
3    11.0
4    11.0
5    11.0
6    37.0
7    11.0
Name: age, dtype: float64
In [412]:
df["age"].fillna(df.groupby("job")["age"].transform("median"), inplace=True)
In [413]:
df
Out[413]:
name job age
0 John teacher 40.0
1 Nate teacher 35.0
2 Yuna teacher 37.0
3 Abraham student 10.0
4 Brian student 12.0
5 Janny student 11.0
6 Nate teacher 37.0
7 John student 11.0
In [ ]:
 

7) 데이타프레임 교차시키기

In [125]:
import pandas as pd
# 딕셔너리로 데이타프레임 생성
data = {
    'id' : [1, 2, 3],
    'name' : ['Kim', 'Lee', 'Choi'],
    'age' : [10, 20, 30],
    'assets' : [150.4, 123.4, 56.6],
    'job' : ['student', 'CEO', 'Dad']               
}

df = pd.DataFrame(data)
df
Out[125]:
id name age assets job
0 1 Kim 10 150.4 student
1 2 Lee 20 123.4 CEO
2 3 Choi 30 56.6 Dad
In [126]:
df.T
Out[126]:
0 1 2
id 1 2 3
name Kim Lee Choi
age 10 20 30
assets 150.4 123.4 56.6
job student CEO Dad

8) 데이타프레임의 컬럼 값 추출

In [127]:
df.values  # 전체 값 추출하기 - Numpy 형태
Out[127]:
array([[1, 'Kim', 10, 150.4, 'student'],
       [2, 'Lee', 20, 123.4, 'CEO'],
       [3, 'Choi', 30, 56.6, 'Dad']], dtype=object)
In [128]:
# 특정 컬럼 값 추출 
df['name'].values
Out[128]:
array(['Kim', 'Lee', 'Choi'], dtype=object)
In [129]:
# 로우와 컬럼 지정으로 특정 값 선택
df.iloc[1]['name']
Out[129]:
'Lee'

9) 데이타 정렬

1) Serices 정렬

In [88]:
import numpy as np
import pandas as pd

se = pd.Series(range(10), 
               index=[10, 9, 8, 7, 6, 5, 4, 3, 2, 1])
se.index.name = "index"
se
Out[88]:
index
10    0
9     1
8     2
7     3
6     4
5     5
4     6
3     7
2     8
1     9
dtype: int64
In [89]:
se = se.sort_index()  # 오름차순 정렬
se
Out[89]:
index
1     9
2     8
3     7
4     6
5     5
6     4
7     3
8     2
9     1
10    0
dtype: int64
In [90]:
se = se.sort_index(ascending = False)  # 내림차순 정렬 
se
Out[90]:
index
10    0
9     1
8     2
7     3
6     4
5     5
4     6
3     7
2     8
1     9
dtype: int64

2) Dataframe 정렬

In [91]:
# 데이타프레임 생성
data = {
    'id' : [1, 2, 3],
    'name' : ['Kim', 'Lee', 'Choi'],
    'age' : [10, 20, 30],
    'assets' : [150.4, 123.4, 56.6],
    'job' : ['student', 'CEO', 'Dad']               
}

df = pd.DataFrame(data)
df
Out[91]:
id name age assets job
0 1 Kim 10 150.4 student
1 2 Lee 20 123.4 CEO
2 3 Choi 30 56.6 Dad
In [92]:
# 나이의 역순으로 정렬 
df.sort_index(by='age', ascending=False)
Out[92]:
id name age assets job
2 3 Choi 30 56.6 Dad
1 2 Lee 20 123.4 CEO
0 1 Kim 10 150.4 student
In [93]:
# 나이의 assets(자산) 순으로 정렬 
df.sort_index(by='assets', ascending=True)
Out[93]:
id name age assets job
2 3 Choi 30 56.6 Dad
1 2 Lee 20 123.4 CEO
0 1 Kim 10 150.4 student
In [94]:
# Kim의 나이 컬럼을 30으로 변경하기
df.loc[df.name=='Kim', 'age'] = 30
df
Out[94]:
id name age assets job
0 1 Kim 30 150.4 student
1 2 Lee 20 123.4 CEO
2 3 Choi 30 56.6 Dad
In [95]:
# 나이 순으로 내림차순, 자산순으로 오름차순 정렬 
df.sort_index(by=['age', 'assets'], ascending=[False, True])
Out[95]:
id name age assets job
2 3 Choi 30 56.6 Dad
0 1 Kim 30 150.4 student
1 2 Lee 20 123.4 CEO

10) 여러 개의 데이타 프레임 합치기

  • 1) pd.concat() : 데이터프레임들을 새로운 로우(행)로 합치기, axis=0, 데이터프레임들을 새로운 컬럼으로 합치기, axis=1
  • 2) append() : 데이터프레임들을 새로운 로우(행)로 합치기
  • 3) merge() : 열을 기반으로 데이터프레임을 JOIN(inner or outer) 할 때 사용
In [161]:
l1 = [{'name': 'John', 'job': "teacher"},
      {'name': 'Nate', 'job': "student"},
      {'name': 'Fred', 'job': "developer"}]

l2 = [{'name': 'Ed', 'job': "dentist"},
      {'name': 'Jack', 'job': "farmer"},
      {'name': 'Ted', 'job': "designer"}]
         
df1 = pd.DataFrame(l1, columns = ['name', 'job'])
df2 = pd.DataFrame(l2, columns = ['name', 'job'])

(1) pd.concat

  • 두번째 데이터프레임을 첫번째 데이터프레임의 새로운 로우(행)로 합칩니다.
In [163]:
frames = [df1, df2]
result = pd.concat(frames, ignore_index=True)
result
Out[163]:
name job
0 John teacher
1 Nate student
2 Fred developer
3 Ed dentist
4 Jack farmer
5 Ted designer
  • 두번째 데이터프레임을 첫번째 데이터프레임의 새로운 컬럼(열)으로 합칩니다.
In [ ]:
l1 = [{'name': 'John', 'job': "teacher"},
      {'name': 'Nate', 'job': "student"},
      {'name': 'Jack', 'job': "developer"}]

l2 = [{'age': 25, 'country': "U.S"},
      {'age': 30, 'country': "U.K"},
      {'age': 45, 'country': "Korea"}]
         
df1 = pd.DataFrame(l1, columns = ['name', 'job'])
df2 = pd.DataFrame(l2, columns = ['age', 'country'])
In [168]:
result = pd.concat([df1, df2], axis=1, ignore_index=True)
result
Out[168]:
0 1 2 3
0 John teacher Ed dentist
1 Nate student Jack farmer
2 Fred developer Ted designer

(2) append

  • 두번째 데이터프레임을 첫번째 데이터프레임의 새로운 로우(행)로 합칩니다.
In [166]:
l1 = [{'name': 'John', 'job': "teacher"},
      {'name': 'Nate', 'job': "student"},
      {'name': 'Fred', 'job': "developer"}]

l2 = [{'name': 'Ed', 'job': "dentist"},
      {'name': 'Jack', 'job': "farmer"},
      {'name': 'Ted', 'job': "designer"}]
         
df1 = pd.DataFrame(l1, columns = ['name', 'job'])
df2 = pd.DataFrame(l2, columns = ['name', 'job'])
In [167]:
result = df1.append(df2, ignore_index=True)
result
Out[167]:
name job
0 John teacher
1 Nate student
2 Fred developer
3 Ed dentist
4 Jack farmer
5 Ted designer

(3) merge() 함수 : 열을 기반으로 데이터프레임을 JOIN(inner or outer) 할 때 사용

In [169]:
import pandas as pd

member_df = pd.DataFrame({
    '고객번호': [1001, 1002, 1003, 1004, 1005, 1006, 1007],
    '이름': ['AAA', 'BBB', 'CCC', 'DDD', 'EEE', 'FFF', 'GGG']
}, columns=['고객번호', '이름'])
print(member_df)


transaction_df = pd.DataFrame({
    '고객번호': [1001, 1001, 1005, 1006, 1008, 1001],
    '금액': [10000, 20000, 15000, 5000, 100000, 30000]
}, columns=['고객번호', '금액'])
print(transaction_df)

inner_result = pd.merge(member_df, transaction_df, how = 'inner')
print(inner_result)
   고객번호   이름
0  1001  AAA
1  1002  BBB
2  1003  CCC
3  1004  DDD
4  1005  EEE
5  1006  FFF
6  1007  GGG
   고객번호      금액
0  1001   10000
1  1001   20000
2  1005   15000
3  1006    5000
4  1008  100000
5  1001   30000
   고객번호   이름     금액
0  1001  AAA  10000
1  1001  AAA  20000
2  1001  AAA  30000
3  1005  EEE  15000
4  1006  FFF   5000
In [293]:
inner_result = pd.merge(member_df, transaction_df, how = 'left')
print(inner_result)
   고객번호   이름       금액
0  1001  AAA  10000.0
1  1001  AAA  20000.0
2  1001  AAA  30000.0
3  1002  BBB      NaN
4  1003  CCC      NaN
5  1004  DDD      NaN
6  1005  EEE  15000.0
7  1006  FFF   5000.0
8  1007  GGG      NaN
In [294]:
inner_result = pd.merge(member_df, transaction_df, how = 'right')
print(inner_result)
   고객번호   이름      금액
0  1001  AAA   10000
1  1001  AAA   20000
2  1001  AAA   30000
3  1005  EEE   15000
4  1006  FFF    5000
5  1008  NaN  100000
In [295]:
inner_result = pd.merge(member_df, transaction_df, how = 'outer')
print(inner_result)
   고객번호   이름        금액
0  1001  AAA   10000.0
1  1001  AAA   20000.0
2  1001  AAA   30000.0
3  1002  BBB       NaN
4  1003  CCC       NaN
5  1004  DDD       NaN
6  1005  EEE   15000.0
7  1006  FFF    5000.0
8  1007  GGG       NaN
9  1008  NaN  100000.0

11) 데이터의 기본연산

(1) Series 의 연산

In [96]:
#### Series 연산
fruit1 = pd.Series([ 5 , 9 , 10  ], 
                index = [ 'apple' , 'banana' , 'cherry'])
fruit2 = pd.Series([ 3 , 2 , 9 , 5 ], 
                index = [ 'apple' , 'orange' , 'banana', 'mango'])
print (fruit1, '\n')
print (fruit2, '\n')
fruit1 + fruit2
apple      5
banana     9
cherry    10
dtype: int64 

apple     3
orange    2
banana    9
mango     5
dtype: int64 

Out[96]:
apple      8.0
banana    18.0
cherry     NaN
mango      NaN
orange     NaN
dtype: float64

(2) Dataframe 연산

In [97]:
#### Dataframe 연산
fruitData1 = { 'Ohio' : [ 4 , 8 , 3], 
              'Texas' : [ 0 , 1 , 2]}
fruitFrame1 = pd.DataFrame(fruitData1, 
                        columns = [ 'Ohio' , 'Texas' ], 
                        index = [ 'apple' , 'banana' , 'cherry'])
fruitData2 = { 'Ohio' : [ 3 , 0 , 2, 7], 
              
              'Colorado' : [ 5 , 4 , 3 , 6]}
fruitFrame2 = pd.DataFrame(fruitData2, 
                        columns = [ 'Ohio' , 'Colorado' ], 
                        index = [ 'apple' , 'orange' , 'banana' , 'mango' ])
print (fruitFrame1)
print (fruitFrame2)
fruitFrame1 + fruitFrame2
        Ohio  Texas
apple      4      0
banana     8      1
cherry     3      2
        Ohio  Colorado
apple      3         5
orange     0         4
banana     2         3
mango      7         6
Out[97]:
Colorado Ohio Texas
apple NaN 7.0 NaN
banana NaN 10.0 NaN
cherry NaN NaN NaN
mango NaN NaN NaN
orange NaN NaN NaN

12) 데이타프레임 컬럼의 데이타 타입 변경

  • astype() : Series의 dtype을 변경하고 새로운 Series를 반환
  • astype(float64), astype(float32), astype(float16)
  • pd.to_numeric(), pd.to_datetime(), pd.to_timedelta()
In [98]:
import pandas as pd

df = pd.DataFrame({'A': [1, 2, 3], 
                   'B': [1.0, 2.0, 3.0], 
                   'C': ['1.1.2019', '2.1.2019', '3.1.2019'], 
                   'D': ['1 days', '2 days', '3 days'],
                   'E': ['1', '2', '3']})

df
Out[98]:
A B C D E
0 1 1.0 1.1.2019 1 days 1
1 2 2.0 2.1.2019 2 days 2
2 3 3.0 3.1.2019 3 days 3
In [99]:
df.dtypes # 컬럼의 데이타타입 확인
Out[99]:
A      int64
B    float64
C     object
D     object
E     object
dtype: object

1) 데이타타입 변경

  • pd.astype('float)
  • pd.astype('int)
In [100]:
# 컬럼 A의 데이타타입을 실수로 변경하고, 컬럼 B를 정수로 데이타타입 변경.
df['A'].astype('float')
Out[100]:
0    1.0
1    2.0
2    3.0
Name: A, dtype: float64
In [101]:
df['B'].astype('int')
Out[101]:
0    1
1    2
2    3
Name: B, dtype: int32

2) 데이타타입을 숫자로 변경 : pd.to_numeric()

  • 입력을 숫자로 변환 할 수없는 경우 기본적으로 pd.to_numeric 은 오류를 발생
  • pd.to_numeric . errors 매개 변수를 사용하여 해당 동작을 변경할 수 있습니다.
In [102]:
import pandas as pd

df = pd.DataFrame({'A': [1, 2, 3], 
                   'B': [1.0, 2.0, 3.0], 
                   'C': ['1.1.2019', '2.1.2019', '3.1.2019'], 
                   'D': ['1 days', '2 days', '3 days'],
                   'E': ['1', '2', '3']})

df
Out[102]:
A B C D E
0 1 1.0 1.1.2019 1 days 1
1 2 2.0 2.1.2019 2 days 2
2 3 3.0 3.1.2019 3 days 3
In [103]:
pd.to_numeric(df['E'])
Out[103]:
0    1
1    2
2    3
Name: E, dtype: int64
In [104]:
# Ignore the error, return the original input if it cannot be converted
pd.to_numeric(pd.Series(['1', '2', 'a']), errors='ignore')
Out[104]:
0    1
1    2
2    a
dtype: object
  • 필요한 경우 입력이있는 모든 행을 확인하여 숫자로 변환 할 수 없습니다
  • isnull 사용하여 boolean indexing 사용
In [105]:
df = pd.DataFrame({'A': [1, 'x', 'z'],
                           'B': [1.0, 2.0, 3.0],
                           'C': [True, False, True]})
df
Out[105]:
A B C
0 1 1.0 True
1 x 2.0 False
2 z 3.0 True
In [106]:
pd.to_numeric(df.A, errors='coerce').isnull()
Out[106]:
0    False
1     True
2     True
Name: A, dtype: bool
In [107]:
df[pd.to_numeric(df.A, errors='coerce').isnull()]
Out[107]:
A B C
1 x 2.0 False
2 z 3.0 True

3) 데이타타입을 datetime으로 변경 : to_datetime()

In [108]:
import pandas as pd

df = pd.DataFrame({'A': [1, 2, 3], 'B': [1.0, 2.0, 3.0], 
                           'C': ['1.1.2019', '2.1.2019', '3.1.2019'], 
                           'D': ['1 days', '2 days', '3 days'],
                           'E': ['1', '2', '3']})

df
Out[108]:
A B C D E
0 1 1.0 1.1.2019 1 days 1
1 2 2.0 2.1.2019 2 days 2
2 3 3.0 3.1.2019 3 days 3
In [109]:
pd.to_datetime(df['C'])
Out[109]:
0   2019-01-01
1   2019-02-01
2   2019-03-01
Name: C, dtype: datetime64[ns]
  • 2.1.2019은 2011 년 2 월 1 dayfirst 됩니다.
  • 2019 년 1 월 2 일을 대신 사용하려면 dayfirst 매개 변수를 사용해야합니다.
In [110]:
pd.to_datetime('2.1.2011', dayfirst=True)
Out[110]:
Timestamp('2011-01-02 00:00:00')
  • 유형을 timedelta로 변경
In [111]:
pd.to_timedelta(df['D'])
Out[111]:
0   1 days
1   2 days
2   3 days
Name: D, dtype: timedelta64[ns]

13) Panas의 파일 입출력

1) 데이타프레임을 csv 파일로 저장

  • 리스트를 이용하여 데이타프레임 생성하기
In [178]:
friend_list = [ ['John', 20, 'student'],['Jenny', 30, 'developer'],['Nate', 30, 'teacher'] ]
column_name = ['name', 'age', 'job']
df = pd.DataFrame.from_records(friend_list, columns=column_name)
df
Out[178]:
name age job
0 John 20 student
1 Jenny 30 developer
2 Nate 30 teacher
  • to_csv 함수를 사용하여 파일로 저장
In [179]:
df.to_csv('friend_list.csv')
  • 헤더가 없는 데이타프레임의 csv 파일 저장
In [180]:
friend_list = [ ['John', 20, 'student'],
               ['Jenny', 30, 'developer'],
               ['Nate', 30, 'teacher'] ]
df = pd.DataFrame.from_records(friend_list)
df
Out[180]:
0 1 2
0 John 20 student
1 Jenny 30 developer
2 Nate 30 teacher
In [181]:
df.to_csv('friend_list2.csv', header = False, index = False)
  • 헤더 정보를 저장할 경우, header 키워드로 컬럼 이름을 지정
In [182]:
df.to_csv('friend_list3.csv', header = ['name', 'age', 'job'])
  • 결측치(None) 값이 있는 데이타 저장
In [187]:
friend_list = { 
                'name': ['John', None, 'nate'],
                'age': [20,None,30],
                'job':['student', 'developer', 'teacher'] 
               }
df = pd.DataFrame(friend_list)
df
Out[187]:
name age job
0 John 20.0 student
1 None NaN developer
2 nate 30.0 teacher
In [188]:
df.to_csv('friend_list4.csv')
  • na_rep 을 사용하시면 None 을 원하시는 값으로 변경하여 저장
In [189]:
df.to_csv('friend_list5', na_rep = '-')

2) csv 파일 불러오기

In [190]:
import pandas as pd
data_frame = pd.read_csv('./data/friend_list.csv')
data_frame.head()
Out[190]:
name age job
0 John 20 student
1 Jenny 30 developer
2 Nate 30 teacher
3 Julia 40 dentist
4 Brian 45 manager

3) excel 파일 읽어오기

In [191]:
# 엑셀파일을 데이타프레임으로 불러오기
# pip install xlrd
df2 = pd.read_excel('./data/friend_list.xlsx')
df2
Out[191]:
name age job
0 John 20 student
1 Jenny 30 developer
2 Nate 30 teacher
3 Julia 40 dentist
4 Brian 45 manager
5 Chris 25 intern

4) excel 파일로 저장하기

In [196]:
# 엑셀로 저장하기 
df2.to_excel('./data/friend_list2.xlsx', 'sheet1')   
In [198]:
# 엑셀로 저장하기 
df2.to_excel('./data/friend_list2.xlsx', 'sheet1', index=False)   
In [199]:
# pandas -> excel -> pandas 읽기
df3 = pd.read_excel('./data/friend_list2.xlsx', 'sheet1')
df3
Out[199]:
name age job
0 John 20 student
1 Jenny 30 developer
2 Nate 30 teacher
3 Julia 40 dentist
4 Brian 45 manager
5 Chris 25 intern

5) html 불러오기 - 웹에서 테이블 형태 자료 불러오기

In [122]:
# html에서 <table> 태그로 되어 있는 테이블 불러오기
df_list = pd.read_html('https://www.fdic.gov/bank/individual/failed/banklist.html')
df_list
Out[122]:
[                                             Bank Name                City  \
 0                                 The Enloe State Bank              Cooper   
 1                  Washington Federal Bank for Savings             Chicago   
 2      The Farmers and Merchants State Bank of Argonia             Argonia   
 3                                  Fayette County Bank          Saint Elmo   
 4    Guaranty Bank, (d/b/a BestBank in Georgia & Mi...           Milwaukee   
 5                                       First NBC Bank         New Orleans   
 6                                        Proficio Bank  Cottonwood Heights   
 7                        Seaway Bank and Trust Company             Chicago   
 8                               Harvest Community Bank          Pennsville   
 9                                          Allied Bank            Mulberry   
 10                        The Woodbury Banking Company            Woodbury   
 11                              First CornerStone Bank     King of Prussia   
 12                                  Trust Company Bank             Memphis   
 13                          North Milwaukee State Bank           Milwaukee   
 14                              Hometown National Bank            Longview   
 15                                 The Bank of Georgia      Peachtree City   
 16                                        Premier Bank              Denver   
 17                                      Edgebrook Bank             Chicago   
 18                              Doral Bank  En Español            San Juan   
 19                   Capitol City Bank & Trust Company             Atlanta   
 20                             Highland Community Bank             Chicago   
 21                    First National Bank of Crestview           Crestview   
 22                                  Northern Star Bank             Mankato   
 23              Frontier Bank, FSB D/B/A El Paseo Bank         Palm Desert   
 24               The National Republic Bank of Chicago             Chicago   
 25                                      NBRS Financial          Rising Sun   
 26                               GreenChoice Bank, fsb             Chicago   
 27                            Eastside Commercial Bank             Conyers   
 28                              The Freedom State Bank             Freedom   
 29                                         Valley Bank     Fort Lauderdale   
 ..                                                 ...                 ...   
 526                                  ANB Financial, NA         Bentonville   
 527                                          Hume Bank                Hume   
 528                             Douglass National Bank         Kansas City   
 529                                  Miami Valley Bank            Lakeview   
 530                                            NetBank          Alpharetta   
 531                          Metropolitan Savings Bank          Pittsburgh   
 532                                    Bank of Ephraim             Ephraim   
 533                                      Reliance Bank        White Plains   
 534              Guaranty National Bank of Tallahassee         Tallahassee   
 535                                Dollar Savings Bank              Newark   
 536                               Pulaski Savings Bank        Philadelphia   
 537              First National Bank of Blanchardville      Blanchardville   
 538                              Southern Pacific Bank            Torrance   
 539                        Farmers Bank of Cheneyville         Cheneyville   
 540                                      Bank of Alamo               Alamo   
 541             AmTrade International Bank  En Español             Atlanta   
 542                     Universal Federal Savings Bank             Chicago   
 543                       Connecticut Bank of Commerce            Stamford   
 544                                   New Century Bank     Shelby Township   
 545                              Net 1st National Bank          Boca Raton   
 546                                       NextBank, NA             Phoenix   
 547                           Oakwood Deposit Bank Co.             Oakwood   
 548                              Bank of Sierra Blanca       Sierra Blanca   
 549                      Hamilton Bank, NA  En Español               Miami   
 550                             Sinclair National Bank            Gravette   
 551                                 Superior Bank, FSB            Hinsdale   
 552                                Malta National Bank               Malta   
 553                    First Alliance Bank & Trust Co.          Manchester   
 554                  National State Bank of Metropolis          Metropolis   
 555                                   Bank of Honolulu            Honolulu   
 
      ST   CERT                Acquiring Institution        Closing Date  \
 0    TX  10716                   Legend Bank, N. A.        May 31, 2019   
 1    IL  30570                   Royal Savings Bank   December 15, 2017   
 2    KS  17719                          Conway Bank    October 13, 2017   
 3    IL   1802            United Fidelity Bank, fsb        May 26, 2017   
 4    WI  30003  First-Citizens Bank & Trust Company         May 5, 2017   
 5    LA  58302                         Whitney Bank      April 28, 2017   
 6    UT  35495                    Cache Valley Bank       March 3, 2017   
 7    IL  19328                  State Bank of Texas    January 27, 2017   
 8    NJ  34951  First-Citizens Bank & Trust Company    January 13, 2017   
 9    AR     91                         Today's Bank  September 23, 2016   
 10   GA  11297                          United Bank     August 19, 2016   
 11   PA  35312  First-Citizens Bank & Trust Company         May 6, 2016   
 12   TN   9956           The Bank of Fayette County      April 29, 2016   
 13   WI  20364  First-Citizens Bank & Trust Company      March 11, 2016   
 14   WA  35156                       Twin City Bank     October 2, 2015   
 15   GA  35259                        Fidelity Bank     October 2, 2015   
 16   CO  34112            United Fidelity Bank, fsb       July 10, 2015   
 17   IL  57772             Republic Bank of Chicago         May 8, 2015   
 18   PR  32102         Banco Popular de Puerto Rico   February 27, 2015   
 19   GA  33938  First-Citizens Bank & Trust Company   February 13, 2015   
 20   IL  20290            United Fidelity Bank, fsb    January 23, 2015   
 21   FL  17557                       First NBC Bank    January 16, 2015   
 22   MN  34983                            BankVista   December 19, 2014   
 23   CA  34738    Bank of Southern California, N.A.    November 7, 2014   
 24   IL    916                  State Bank of Texas    October 24, 2014   
 25   MD   4862                          Howard Bank    October 17, 2014   
 26   IL  28462                 Providence Bank, LLC       July 25, 2014   
 27   GA  58125            Community & Southern Bank       July 18, 2014   
 28   OK  12483      Alva State Bank & Trust Company       June 27, 2014   
 29   FL  21793  Landmark Bank, National Association       June 20, 2014   
 ..   ..    ...                                  ...                 ...   
 526  AR  33901       Pulaski Bank and Trust Company         May 9, 2008   
 527  MO   1971                        Security Bank       March 7, 2008   
 528  MO  24660       Liberty Bank and Trust Company    January 25, 2008   
 529  OH  16848         The Citizens Banking Company     October 4, 2007   
 530  GA  32575                           ING DIRECT  September 28, 2007   
 531  PA  35353  Allegheny Valley Bank of Pittsburgh    February 2, 2007   
 532  UT   1249                        Far West Bank       June 25, 2004   
 533  NY  26778                     Union State Bank      March 19, 2004   
 534  FL  26838              Hancock Bank of Florida      March 12, 2004   
 535  NJ  31330                          No Acquirer   February 14, 2004   
 536  PA  27203                       Earthstar Bank   November 14, 2003   
 537  WI  11639                        The Park Bank         May 9, 2003   
 538  CA  27094                            Beal Bank    February 7, 2003   
 539  LA  16445            Sabine State Bank & Trust   December 17, 2002   
 540  TN   9961                          No Acquirer    November 8, 2002   
 541  GA  33784                          No Acquirer  September 30, 2002   
 542  IL  29355               Chicago Community Bank       June 27, 2002   
 543  CT  19183                   Hudson United Bank       June 26, 2002   
 544  MI  34979                          No Acquirer      March 28, 2002   
 545  FL  26652                       Bank Leumi USA       March 1, 2002   
 546  AZ  22314                          No Acquirer    February 7, 2002   
 547  OH   8966       The State Bank & Trust Company    February 1, 2002   
 548  TX  22002     The Security State Bank of Pecos    January 18, 2002   
 549  FL  24382     Israel Discount Bank of New York    January 11, 2002   
 550  AR  34248                   Delta Trust & Bank   September 7, 2001   
 551  IL  32646                Superior Federal, FSB       July 27, 2001   
 552  OH   6629                    North Valley Bank         May 3, 2001   
 553  NH  34264  Southern New Hampshire Bank & Trust    February 2, 2001   
 554  IL   3815              Banterra Bank of Marion   December 14, 2000   
 555  HI  21029                   Bank of the Orient    October 13, 2000   
 
            Updated Date  
 0       August 22, 2019  
 1         July 24, 2019  
 2       August 12, 2019  
 3      January 29, 2019  
 4        March 22, 2018  
 5      January 29, 2019  
 6      January 29, 2019  
 7      January 29, 2019  
 8    September 20, 2019  
 9          May 13, 2019  
 10    December 13, 2018  
 11    November 13, 2018  
 12   September 14, 2018  
 13     January 29, 2019  
 14    February 19, 2018  
 15         July 9, 2018  
 16    February 20, 2018  
 17     January 29, 2019  
 18     January 29, 2019  
 19     January 29, 2019  
 20    November 15, 2017  
 21    November 15, 2017  
 22      January 3, 2018  
 23    November 10, 2016  
 24      January 6, 2016  
 25     January 29, 2019  
 26    December 12, 2016  
 27      October 6, 2017  
 28    February 21, 2018  
 29     January 29, 2019  
 ..                  ...  
 526    February 1, 2019  
 527    January 31, 2019  
 528    October 26, 2012  
 529  September 12, 2016  
 530    January 31, 2019  
 531    October 27, 2010  
 532       April 9, 2008  
 533       April 9, 2008  
 534      April 17, 2018  
 535       April 9, 2008  
 536     October 6, 2017  
 537        June 5, 2012  
 538    October 20, 2008  
 539    October 20, 2004  
 540      March 18, 2005  
 541  September 11, 2006  
 542     October 6, 2017  
 543   February 14, 2012  
 544      March 18, 2005  
 545       April 9, 2008  
 546    February 5, 2015  
 547    October 25, 2012  
 548    November 6, 2003  
 549  September 21, 2015  
 550     October 6, 2017  
 551     August 19, 2014  
 552   November 18, 2002  
 553   February 18, 2003  
 554      March 17, 2005  
 555      March 17, 2005  
 
 [556 rows x 7 columns]]
In [123]:
# 테이블 개수 확인
len(df_list) # 1개 
Out[123]:
1
In [124]:
df_list[0] # 1개 테이블 확인하기
Out[124]:
Bank Name City ST CERT Acquiring Institution Closing Date Updated Date
0 The Enloe State Bank Cooper TX 10716 Legend Bank, N. A. May 31, 2019 August 22, 2019
1 Washington Federal Bank for Savings Chicago IL 30570 Royal Savings Bank December 15, 2017 July 24, 2019
2 The Farmers and Merchants State Bank of Argonia Argonia KS 17719 Conway Bank October 13, 2017 August 12, 2019
3 Fayette County Bank Saint Elmo IL 1802 United Fidelity Bank, fsb May 26, 2017 January 29, 2019
4 Guaranty Bank, (d/b/a BestBank in Georgia & Mi... Milwaukee WI 30003 First-Citizens Bank & Trust Company May 5, 2017 March 22, 2018
5 First NBC Bank New Orleans LA 58302 Whitney Bank April 28, 2017 January 29, 2019
6 Proficio Bank Cottonwood Heights UT 35495 Cache Valley Bank March 3, 2017 January 29, 2019
7 Seaway Bank and Trust Company Chicago IL 19328 State Bank of Texas January 27, 2017 January 29, 2019
8 Harvest Community Bank Pennsville NJ 34951 First-Citizens Bank & Trust Company January 13, 2017 September 20, 2019
9 Allied Bank Mulberry AR 91 Today's Bank September 23, 2016 May 13, 2019
10 The Woodbury Banking Company Woodbury GA 11297 United Bank August 19, 2016 December 13, 2018
11 First CornerStone Bank King of Prussia PA 35312 First-Citizens Bank & Trust Company May 6, 2016 November 13, 2018
12 Trust Company Bank Memphis TN 9956 The Bank of Fayette County April 29, 2016 September 14, 2018
13 North Milwaukee State Bank Milwaukee WI 20364 First-Citizens Bank & Trust Company March 11, 2016 January 29, 2019
14 Hometown National Bank Longview WA 35156 Twin City Bank October 2, 2015 February 19, 2018
15 The Bank of Georgia Peachtree City GA 35259 Fidelity Bank October 2, 2015 July 9, 2018
16 Premier Bank Denver CO 34112 United Fidelity Bank, fsb July 10, 2015 February 20, 2018
17 Edgebrook Bank Chicago IL 57772 Republic Bank of Chicago May 8, 2015 January 29, 2019
18 Doral Bank En Español San Juan PR 32102 Banco Popular de Puerto Rico February 27, 2015 January 29, 2019
19 Capitol City Bank & Trust Company Atlanta GA 33938 First-Citizens Bank & Trust Company February 13, 2015 January 29, 2019
20 Highland Community Bank Chicago IL 20290 United Fidelity Bank, fsb January 23, 2015 November 15, 2017
21 First National Bank of Crestview Crestview FL 17557 First NBC Bank January 16, 2015 November 15, 2017
22 Northern Star Bank Mankato MN 34983 BankVista December 19, 2014 January 3, 2018
23 Frontier Bank, FSB D/B/A El Paseo Bank Palm Desert CA 34738 Bank of Southern California, N.A. November 7, 2014 November 10, 2016
24 The National Republic Bank of Chicago Chicago IL 916 State Bank of Texas October 24, 2014 January 6, 2016
25 NBRS Financial Rising Sun MD 4862 Howard Bank October 17, 2014 January 29, 2019
26 GreenChoice Bank, fsb Chicago IL 28462 Providence Bank, LLC July 25, 2014 December 12, 2016
27 Eastside Commercial Bank Conyers GA 58125 Community & Southern Bank July 18, 2014 October 6, 2017
28 The Freedom State Bank Freedom OK 12483 Alva State Bank & Trust Company June 27, 2014 February 21, 2018
29 Valley Bank Fort Lauderdale FL 21793 Landmark Bank, National Association June 20, 2014 January 29, 2019
... ... ... ... ... ... ... ...
526 ANB Financial, NA Bentonville AR 33901 Pulaski Bank and Trust Company May 9, 2008 February 1, 2019
527 Hume Bank Hume MO 1971 Security Bank March 7, 2008 January 31, 2019
528 Douglass National Bank Kansas City MO 24660 Liberty Bank and Trust Company January 25, 2008 October 26, 2012
529 Miami Valley Bank Lakeview OH 16848 The Citizens Banking Company October 4, 2007 September 12, 2016
530 NetBank Alpharetta GA 32575 ING DIRECT September 28, 2007 January 31, 2019
531 Metropolitan Savings Bank Pittsburgh PA 35353 Allegheny Valley Bank of Pittsburgh February 2, 2007 October 27, 2010
532 Bank of Ephraim Ephraim UT 1249 Far West Bank June 25, 2004 April 9, 2008
533 Reliance Bank White Plains NY 26778 Union State Bank March 19, 2004 April 9, 2008
534 Guaranty National Bank of Tallahassee Tallahassee FL 26838 Hancock Bank of Florida March 12, 2004 April 17, 2018
535 Dollar Savings Bank Newark NJ 31330 No Acquirer February 14, 2004 April 9, 2008
536 Pulaski Savings Bank Philadelphia PA 27203 Earthstar Bank November 14, 2003 October 6, 2017
537 First National Bank of Blanchardville Blanchardville WI 11639 The Park Bank May 9, 2003 June 5, 2012
538 Southern Pacific Bank Torrance CA 27094 Beal Bank February 7, 2003 October 20, 2008
539 Farmers Bank of Cheneyville Cheneyville LA 16445 Sabine State Bank & Trust December 17, 2002 October 20, 2004
540 Bank of Alamo Alamo TN 9961 No Acquirer November 8, 2002 March 18, 2005
541 AmTrade International Bank En Español Atlanta GA 33784 No Acquirer September 30, 2002 September 11, 2006
542 Universal Federal Savings Bank Chicago IL 29355 Chicago Community Bank June 27, 2002 October 6, 2017
543 Connecticut Bank of Commerce Stamford CT 19183 Hudson United Bank June 26, 2002 February 14, 2012
544 New Century Bank Shelby Township MI 34979 No Acquirer March 28, 2002 March 18, 2005
545 Net 1st National Bank Boca Raton FL 26652 Bank Leumi USA March 1, 2002 April 9, 2008
546 NextBank, NA Phoenix AZ 22314 No Acquirer February 7, 2002 February 5, 2015
547 Oakwood Deposit Bank Co. Oakwood OH 8966 The State Bank & Trust Company February 1, 2002 October 25, 2012
548 Bank of Sierra Blanca Sierra Blanca TX 22002 The Security State Bank of Pecos January 18, 2002 November 6, 2003
549 Hamilton Bank, NA En Español Miami FL 24382 Israel Discount Bank of New York January 11, 2002 September 21, 2015
550 Sinclair National Bank Gravette AR 34248 Delta Trust & Bank September 7, 2001 October 6, 2017
551 Superior Bank, FSB Hinsdale IL 32646 Superior Federal, FSB July 27, 2001 August 19, 2014
552 Malta National Bank Malta OH 6629 North Valley Bank May 3, 2001 November 18, 2002
553 First Alliance Bank & Trust Co. Manchester NH 34264 Southern New Hampshire Bank & Trust February 2, 2001 February 18, 2003
554 National State Bank of Metropolis Metropolis IL 3815 Banterra Bank of Marion December 14, 2000 March 17, 2005
555 Bank of Honolulu Honolulu HI 21029 Bank of the Orient October 13, 2000 March 17, 2005

556 rows × 7 columns

판다스 내장 그래프 도구 활용

  • 형식 : DataFrame 객체.plot(kind=“그래프 종류”)

1) 선 그래프

In [31]:
import pandas as pd

df = pd.read_excel('./data/남북한발전전력량.xlsx')  # 데이터프레임 변환 

df_ns = df.iloc[[0, 5], 3:]            # 남한, 북한 발전량 합계 데이터만 추출
df_ns.index = ['South','North']        # 행 인덱스 변경
df_ns.columns = df_ns.columns.map(int) # 열 이름의 자료형을 정수형으로 변경
print(df_ns.head())
       1991  1992  1993  1994  1995  1996  1997  1998  1999  2000  ...  2007  \
South  1186  1310  1444  1650  1847  2055  2244  2153  2393  2664  ...  4031   
North   263   247   221   231   230   213   193   170   186   194  ...   236   

       2008  2009  2010  2011  2012  2013  2014  2015  2016  
South  4224  4336  4747  4969  5096  5171  5220  5281  5404  
North   255   235   237   211   215   221   216   190   239  

[2 rows x 26 columns]
In [32]:
# 선 그래프 그리기
df_ns.plot()
Out[32]:
<matplotlib.axes._subplots.AxesSubplot at 0x1f5916dd088>
In [33]:
# 행, 열 전치하여 다시 그리기
tdf_ns = df_ns.T
print(tdf_ns.head())
     South North
1991  1186   263
1992  1310   247
1993  1444   221
1994  1650   231
1995  1847   230
In [34]:
tdf_ns.plot()
Out[34]:
<matplotlib.axes._subplots.AxesSubplot at 0x1f591f41a48>

2) 막대 그래프

In [36]:
import pandas as pd

df = pd.read_excel('./data/남북한발전전력량.xlsx')  # 데이터프레임 변환 

df_ns = df.iloc[[0, 5], 3:]            # 남한, 북한 발전량 합계 데이터만 추출
df_ns.index = ['South','North']        # 행 인덱스 변경
df_ns.columns = df_ns.columns.map(int) # 열 이름의 자료형을 정수형으로 변경

# 행, 열 전치하여 막대 그래프 그리기
tdf_ns = df_ns.T
print(tdf_ns.head())
     South North
1991  1186   263
1992  1310   247
1993  1444   221
1994  1650   231
1995  1847   230
In [37]:
tdf_ns.plot(kind='bar')
Out[37]:
<matplotlib.axes._subplots.AxesSubplot at 0x1f591fa7b08>

3) 히스토그램

In [39]:
import pandas as pd

df = pd.read_excel('./data/남북한발전전력량.xlsx')  # 데이터프레임 변환 

df_ns = df.iloc[[0, 5], 3:]            # 남한, 북한 발전량 합계 데이터만 추출
df_ns.index = ['South','North']        # 행 인덱스 변경
df_ns.columns = df_ns.columns.map(int) # 열 이름의 자료형을 정수형으로 변경

# 행, 열 전치하여 히스토그램 그리기
tdf_ns = df_ns.T
tdf_ns.plot(kind='hist')
Out[39]:
<matplotlib.axes._subplots.AxesSubplot at 0x1f58fedd1c8>

4) 산점도

In [42]:
import pandas as pd

# read_csv() 함수로 df 생성
df = pd.read_csv('./data/auto-mpg.csv', header=None)

# 열 이름을 지정
df.columns = ['mpg','cylinders','displacement','horsepower','weight',
              'acceleration','model year','origin','name']

# 2개의 열을 선택하여 산점도 그리기
df.plot(x='weight',y='mpg', kind='scatter')
Out[42]:
<matplotlib.axes._subplots.AxesSubplot at 0x1f593200bc8>

5) 박스 플롯

In [44]:
import pandas as pd

# read_csv() 함수로 df 생성
df = pd.read_csv('./data/auto-mpg.csv', header=None)

# 열 이름을 지정
df.columns = ['mpg','cylinders','displacement','horsepower','weight',
              'acceleration','model year','origin','name']

# 열을 선택하여 박스 플롯 그리기
df[['mpg','cylinders']].plot(kind='box')
Out[44]:
<matplotlib.axes._subplots.AxesSubplot at 0x1f593279808>
In [45]:
df.plot(kind='box')
Out[45]:
<matplotlib.axes._subplots.AxesSubplot at 0x1f5932c7b48>