Brazilian_Ecommerce_EDA¶

1. 사전 준비¶

데이터¶

  • 브라질에서 가장 큰 백화점의 이커머스 쇼핑몰 (https://olist.com/solucoes/distribuidoras-e-lojas-de-bebidas/)

    • 2016년도부터 2018년도 100k 개의 구매 데이터 정보
    • 구매 상태, 가격, 지불수단, 물류 관련, 리뷰관련, 상품 정보, 구매자 지역 관련 정보
  • Kaggle Dataset(https://www.kaggle.com/datasets/olistbr/brazilian-ecommerce/)

  • 데이터셋 (dataset)

    • olist_customers_dataset.csv
    • olist_geolocation_dataset.csv
    • olist_order_items_dataset.csv
    • olist_order_payments_dataset.csv
    • olist_order_reviews_dataset.csv
    • olist_orders_dataset.csv
    • olist_sellers_dataset.csv
    • olist_products_dataset.csv

일반적으로 현업에서는 하이브(HIVE), MySQL(SQL), mongodb(NoSQL) 등 데이터를 저장/관리하는 데이터베이스에서 SQL 과 유사한 형태의 명령을 통해 데이터셋을 만드는 것이 일반적임

탐색적 데이터 분석: 1. 데이터의 출처와 주제에 대해 이해

전체 판매 프로세스¶

  1. 해당 쇼핑몰에 중소업체가 계약을 맺고
  2. 중소업체가 해당 쇼핑몰에 직접 상품을 올리고
  3. 고객이 구매하면, 중소업체가 Olist가 제공하는 물류 파트너를 활용해서 배송을 하고,
  4. 고객이 상품을 받으면, 고객에게 이메일 survey 가 전송되고,
  5. 고객이 이메일 survey 에 별점과 커멘트를 남겨서 제출하게 됨

데이터 출처¶

  • 브라질에서 가장 큰 백화점의 이커머스 쇼핑몰 (https://olist.com/)
    • 2016년도부터 2018년도 100k 개의 구매 데이터 정보
    • 구매 상태, 가격, 지불수단, 물류 관련, 리뷰관련, 상품 정보, 구매자 지역 관련 정보

주요 질문(탐색하고자 하는 질문 리스트)¶

  • 얼마나 많은 고객이 있는가?
  • 고객은 어디에 주로 사는가?
  • 고객은 주로 어떤 지불방법을 사용하는가?
  • 평균 거래액은 얼마일까?
  • 일별, 주별, 월별 판매 트렌드는?
  • 어떤 카테고리가 가장 많은 상품이 팔렸을까?
  • 평균 배송 시간은?
탐색적 데이터 분석: 2. 데이터의 크기 확인
탐색적 데이터 분석: 3. 데이터 구성 요소(feature)의 속성(특징) 확인
  • 수치형 데이터일 경우에는 다음과 같이 EDA 5 수치 + 평균(mean) 확인

    • 최소값(minimum), 제1사분위수, 중간값(mediam)=제2사분위수, 제3사분위수, 최대값(maximum) + 평균(mean) 확인
    • 특잇값(outlier) 확인
    • 필요하면 boxplot 과 histogram 그려보기
  • 범주형 데이터일 경우에는 각 수준별 갯수 세기

    • 필요하면 절대 빈도(bar 그래프), 상대 빈도(원 그래프) 그려보기
  • 시계열 데이터일 경우에는 필요하면 line 또는 bar 그래프 그리기

  • feature 간 상관관계 분석이 필요할 경우에는 heatmap 또는 scatter 그래프 그리기

시각화를 위해 데이터 조작이 필요하므로, 가볍게 각 데이터만 확인

2. 데이터 불러오기¶

In [1]:
import pandas as pd
PATH = "C:/Users/Admin/Data Analysis/Ecommerce\Dataset/"
In [2]:
products = pd.read_csv(PATH + "olist_products_dataset.csv", encoding='utf-8-sig')
customers = pd.read_csv(PATH + "olist_customers_dataset.csv", encoding='utf-8-sig')
geolocation = pd.read_csv(PATH + "olist_geolocation_dataset.csv", encoding='utf-8-sig')
order_items = pd.read_csv(PATH + "olist_order_items_dataset.csv", encoding='utf-8-sig')
payments = pd.read_csv(PATH + "olist_order_payments_dataset.csv", encoding='utf-8-sig')
reviews = pd.read_csv(PATH + "olist_order_reviews_dataset.csv", encoding='utf-8-sig')
orders = pd.read_csv(PATH + "olist_orders_dataset.csv", encoding='utf-8-sig')
sellers = pd.read_csv(PATH + "olist_sellers_dataset.csv", encoding='utf-8-sig')
category_name = pd.read_csv(PATH + "product_category_name_translation.csv", encoding='utf-8-sig')

3. EDA¶

  • head()
  • info()
  • describe()
  • shape

3-1. 얼마나 많은 고객이 있는가?¶

In [22]:
customers.head()
Out[22]:
customer_id customer_unique_id customer_zip_code_prefix customer_city customer_state
0 06b8999e2fba1a1fbc88172c00ba8bc7 861eff4711a542e4b93843c6dd7febb0 14409 franca SP
1 18955e83d337fd6b2def6b18a428ac77 290c77bc529b7ac935b93aa66c333dc3 9790 sao bernardo do campo SP
2 4e7b3e00288586ebd08712fdd0374a03 060e732b5b29e8181a18229c7b0b2b5e 1151 sao paulo SP
3 b2b6027bc5c5109e529d4dc6358b12c3 259dac757896d24d7702b9acbbff3f3c 8775 mogi das cruzes SP
4 4f2d8ab171c80ec8364f7c12e35b23ad 345ecd01c38d18a9036ed96c73b8d066 13056 campinas SP
In [23]:
customers.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99441 entries, 0 to 99440
Data columns (total 5 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   customer_id               99441 non-null  object
 1   customer_unique_id        99441 non-null  object
 2   customer_zip_code_prefix  99441 non-null  int64 
 3   customer_city             99441 non-null  object
 4   customer_state            99441 non-null  object
dtypes: int64(1), object(4)
memory usage: 3.8+ MB
In [24]:
customers['customer_id'].nunique()
Out[24]:
99441
In [25]:
customers['customer_unique_id'].nunique()
Out[25]:
96096
고객 분석1: 실제 고객 수는 99441 로 볼 수 있음

3-2. 고객은 어디에 주로 사는가?¶

In [26]:
customers.head()
Out[26]:
customer_id customer_unique_id customer_zip_code_prefix customer_city customer_state
0 06b8999e2fba1a1fbc88172c00ba8bc7 861eff4711a542e4b93843c6dd7febb0 14409 franca SP
1 18955e83d337fd6b2def6b18a428ac77 290c77bc529b7ac935b93aa66c333dc3 9790 sao bernardo do campo SP
2 4e7b3e00288586ebd08712fdd0374a03 060e732b5b29e8181a18229c7b0b2b5e 1151 sao paulo SP
3 b2b6027bc5c5109e529d4dc6358b12c3 259dac757896d24d7702b9acbbff3f3c 8775 mogi das cruzes SP
4 4f2d8ab171c80ec8364f7c12e35b23ad 345ecd01c38d18a9036ed96c73b8d066 13056 campinas SP
In [27]:
customers_location = customers.groupby('customer_city').count().sort_values(by='customer_id', ascending=False)
customers_location.head(10)
Out[27]:
customer_id customer_unique_id customer_zip_code_prefix customer_state
customer_city
sao paulo 15540 15540 15540 15540
rio de janeiro 6882 6882 6882 6882
belo horizonte 2773 2773 2773 2773
brasilia 2131 2131 2131 2131
curitiba 1521 1521 1521 1521
campinas 1444 1444 1444 1444
porto alegre 1379 1379 1379 1379
salvador 1245 1245 1245 1245
guarulhos 1189 1189 1189 1189
sao bernardo do campo 938 938 938 938
In [30]:
customers_location = customers.groupby('customer_city')['customer_id'].nunique().sort_values(ascending=False)
customers_location.head(10)
Out[30]:
customer_city
sao paulo                15540
rio de janeiro            6882
belo horizonte            2773
brasilia                  2131
curitiba                  1521
campinas                  1444
porto alegre              1379
salvador                  1245
guarulhos                 1189
sao bernardo do campo      938
Name: customer_id, dtype: int64
In [41]:
import chart_studio.plotly as py
import cufflinks as cf
cf.go_offline(connected=True)
In [44]:
customers_location_top20 = customers_location.head(20)
customers_location_top20.iplot(kind='bar', theme='white')
고객 분석2: 고객이 주로 사는 지역 TOP 20
In [45]:
top20_customer_locations = customers_location_top20.index
for index, location in enumerate(list(top20_customer_locations)):
    print ("TOP", index + 1, ":", location)
TOP 1 : sao paulo
TOP 2 : rio de janeiro
TOP 3 : belo horizonte
TOP 4 : brasilia
TOP 5 : curitiba
TOP 6 : campinas
TOP 7 : porto alegre
TOP 8 : salvador
TOP 9 : guarulhos
TOP 10 : sao bernardo do campo
TOP 11 : niteroi
TOP 12 : santo andre
TOP 13 : osasco
TOP 14 : santos
TOP 15 : goiania
TOP 16 : sao jose dos campos
TOP 17 : fortaleza
TOP 18 : sorocaba
TOP 19 : recife
TOP 20 : florianopolis

고객은 주로 어떤 지불방법을 사용할까?¶

In [46]:
payments.head()
Out[46]:
order_id payment_sequential payment_type payment_installments payment_value
0 b81ef226f3fe1789b1e8b2acac839d17 1 credit_card 8 99.33
1 a9810da82917af2d9aefd1278f1dcfa0 1 credit_card 1 24.39
2 25e8ea4e93396b6fa0d3dd708e76c1bd 1 credit_card 1 65.71
3 ba78997921bbcdc1373bb41e913ab953 1 credit_card 8 107.78
4 42fdf880ba16b47b59251dd489d4441a 1 credit_card 2 128.45
In [47]:
payments.isnull().sum()
Out[47]:
order_id                0
payment_sequential      0
payment_type            0
payment_installments    0
payment_value           0
dtype: int64
In [48]:
payments['payment_type'].unique()
Out[48]:
array(['credit_card', 'boleto', 'voucher', 'debit_card', 'not_defined'],
      dtype=object)

특정 값을 가진 행 삭제하기¶

In [49]:
payments = payments[payments['payment_type'] != 'not_defined']
In [50]:
payments['payment_type'].unique()
Out[50]:
array(['credit_card', 'boleto', 'voucher', 'debit_card'], dtype=object)
In [51]:
payment_type_count = payments.groupby('payment_type')['order_id'].nunique().sort_values(ascending=False)
payment_type_count
Out[51]:
payment_type
credit_card    76505
boleto         19784
voucher         3866
debit_card      1528
Name: order_id, dtype: int64
In [52]:
payment_type_count.iplot(kind='bar', theme='white')
고객 분석3: 고객은 주로 어떤 지불방법을 사용할까?
In [54]:
import plotly.graph_objects as go

fig = go.Figure()
fig.add_trace(
    go.Pie(    
        labels=payment_type_count.index, values=payment_type_count.values,
        textinfo='label+percent', insidetextorientation='horizontal'
    )
)

fig.update_layout(
    {
        "title": {
            "text": "Payment Type Analysis",
            "x": 0.5,
            "y": 0.9,
            "font": {
                "size": 15
            }
        },
        "showlegend": True
    }
)

fig.show()
  • 참고: https://plotly.com/python/pie-charts/
  • 그래프 세부 조정 : https://plotly.com/python/reference/

3-4 평균 거래액은 얼마일까?¶

  • orders
  • order_items
  • payments
In [55]:
orders.head()
Out[55]:
order_id customer_id order_status order_purchase_timestamp order_approved_at order_delivered_carrier_date order_delivered_customer_date order_estimated_delivery_date
0 e481f51cbdc54678b7cc49136f2d6af7 9ef432eb6251297304e76186b10a928d delivered 2017-10-02 10:56:33 2017-10-02 11:07:15 2017-10-04 19:55:00 2017-10-10 21:25:13 2017-10-18 00:00:00
1 53cdb2fc8bc7dce0b6741e2150273451 b0830fb4747a6c6d20dea0b8c802d7ef delivered 2018-07-24 20:41:37 2018-07-26 03:24:27 2018-07-26 14:31:00 2018-08-07 15:27:45 2018-08-13 00:00:00
2 47770eb9100c2d0c44946d9cf07ec65d 41ce2a54c0b03bf3443c3d931a367089 delivered 2018-08-08 08:38:49 2018-08-08 08:55:23 2018-08-08 13:50:00 2018-08-17 18:06:29 2018-09-04 00:00:00
3 949d5b44dbf5de918fe9c16f97b45f8a f88197465ea7920adcdbec7375364d82 delivered 2017-11-18 19:28:06 2017-11-18 19:45:59 2017-11-22 13:39:59 2017-12-02 00:28:42 2017-12-15 00:00:00
4 ad21c59c0840e6cb83a9ceb5573f8159 8ab97904e6daea8866dbdbc4fb7aad2c delivered 2018-02-13 21:18:39 2018-02-13 22:20:29 2018-02-14 19:46:34 2018-02-16 18:17:02 2018-02-26 00:00:00
In [56]:
order_items.head()
Out[56]:
order_id order_item_id product_id seller_id shipping_limit_date price freight_value
0 00010242fe8c5a6d1ba2dd792cb16214 1 4244733e06e7ecb4970a6e2683c13e61 48436dade18ac8b2bce089ec2a041202 2017-09-19 09:45:35 58.90 13.29
1 00018f77f2f0320c557190d7a144bdd3 1 e5f2d52b802189ee658865ca93d83a8f dd7ddc04e1b6c2c614352b383efe2d36 2017-05-03 11:05:13 239.90 19.93
2 000229ec398224ef6ca0657da4fc703e 1 c777355d18b72b67abbeef9df44fd0fd 5b51032eddd242adc84c38acab88f23d 2018-01-18 14:48:30 199.00 17.87
3 00024acbcdf0a6daa1e931b038114c75 1 7634da152a4610f1595efa32f14722fc 9d7a1d34a5052409006425275ba1c2b4 2018-08-15 10:10:18 12.99 12.79
4 00042b26cf59d7ce69dfabb4e55b4fd9 1 ac6c3623068f30de03045865e4e10089 df560393f3a51e74553ab94004ba5c87 2017-02-13 13:57:51 199.90 18.14
In [57]:
payments.head()
Out[57]:
order_id payment_sequential payment_type payment_installments payment_value
0 b81ef226f3fe1789b1e8b2acac839d17 1 credit_card 8 99.33
1 a9810da82917af2d9aefd1278f1dcfa0 1 credit_card 1 24.39
2 25e8ea4e93396b6fa0d3dd708e76c1bd 1 credit_card 1 65.71
3 ba78997921bbcdc1373bb41e913ab953 1 credit_card 8 107.78
4 42fdf880ba16b47b59251dd489d4441a 1 credit_card 2 128.45

orders 의 구매 날짜와 payments 의 총 구매 금액을 가지고 월별 평균 거래액 분석을 하자

In [58]:
orders.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99441 entries, 0 to 99440
Data columns (total 8 columns):
 #   Column                         Non-Null Count  Dtype 
---  ------                         --------------  ----- 
 0   order_id                       99441 non-null  object
 1   customer_id                    99441 non-null  object
 2   order_status                   99441 non-null  object
 3   order_purchase_timestamp       99441 non-null  object
 4   order_approved_at              99281 non-null  object
 5   order_delivered_carrier_date   97658 non-null  object
 6   order_delivered_customer_date  96476 non-null  object
 7   order_estimated_delivery_date  99441 non-null  object
dtypes: object(8)
memory usage: 6.1+ MB
In [59]:
orders.isnull().sum()
Out[59]:
order_id                            0
customer_id                         0
order_status                        0
order_purchase_timestamp            0
order_approved_at                 160
order_delivered_carrier_date     1783
order_delivered_customer_date    2965
order_estimated_delivery_date       0
dtype: int64
In [60]:
orders = orders.dropna()
orders.isnull().sum()
Out[60]:
order_id                         0
customer_id                      0
order_status                     0
order_purchase_timestamp         0
order_approved_at                0
order_delivered_carrier_date     0
order_delivered_customer_date    0
order_estimated_delivery_date    0
dtype: int64
In [61]:
orders.info()
<class 'pandas.core.frame.DataFrame'>
Index: 96461 entries, 0 to 99440
Data columns (total 8 columns):
 #   Column                         Non-Null Count  Dtype 
---  ------                         --------------  ----- 
 0   order_id                       96461 non-null  object
 1   customer_id                    96461 non-null  object
 2   order_status                   96461 non-null  object
 3   order_purchase_timestamp       96461 non-null  object
 4   order_approved_at              96461 non-null  object
 5   order_delivered_carrier_date   96461 non-null  object
 6   order_delivered_customer_date  96461 non-null  object
 7   order_estimated_delivery_date  96461 non-null  object
dtypes: object(8)
memory usage: 6.6+ MB
In [62]:
payments.isnull().sum()
Out[62]:
order_id                0
payment_sequential      0
payment_type            0
payment_installments    0
payment_value           0
dtype: int64

orders 와 payments 사이즈 비교¶

In [63]:
payments.info()
<class 'pandas.core.frame.DataFrame'>
Index: 103883 entries, 0 to 103885
Data columns (total 5 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   order_id              103883 non-null  object 
 1   payment_sequential    103883 non-null  int64  
 2   payment_type          103883 non-null  object 
 3   payment_installments  103883 non-null  int64  
 4   payment_value         103883 non-null  float64
dtypes: float64(1), int64(2), object(2)
memory usage: 4.8+ MB
In [64]:
orders.info()
<class 'pandas.core.frame.DataFrame'>
Index: 96461 entries, 0 to 99440
Data columns (total 8 columns):
 #   Column                         Non-Null Count  Dtype 
---  ------                         --------------  ----- 
 0   order_id                       96461 non-null  object
 1   customer_id                    96461 non-null  object
 2   order_status                   96461 non-null  object
 3   order_purchase_timestamp       96461 non-null  object
 4   order_approved_at              96461 non-null  object
 5   order_delivered_carrier_date   96461 non-null  object
 6   order_delivered_customer_date  96461 non-null  object
 7   order_estimated_delivery_date  96461 non-null  object
dtypes: object(8)
memory usage: 6.6+ MB

order_id 중 중복된 데이터가 있는지 확인¶

  • value_counts(): 각 값이 전체에서 중복된 횟수를 리턴 (unique할 경우, 1을 리턴)
  • max(): 최대값 가져오기
  • value_counts().max(): 최대 중복된 데이터의 횟수 리턴
In [65]:
payments['order_id'].value_counts()
Out[65]:
order_id
fa65dad1b0e818e3ccc5cb0e39231352    29
ccf804e764ed5650cd8759557269dc13    26
285c2e15bebd4ac83635ccc563dc71f4    22
895ab968e7bb0d5659d16cd74cd1650c    21
fedcd9f7ccdc8cba3a18defedd1a5547    19
                                    ..
a7737f6d9208dd56ea498a322ed3c37f     1
646e62df54f3e236eb6d5ff3b31429b8     1
e115da7a49ec2acf622e1f31da65cfb9     1
f22592e01ddc0711e6794f900fa5e787     1
28bbae6599b09d39ca406b747b6632b1     1
Name: count, Length: 99437, dtype: int64
In [66]:
payments[payments['order_id'] == 'fa65dad1b0e818e3ccc5cb0e39231352']
Out[66]:
order_id payment_sequential payment_type payment_installments payment_value
4885 fa65dad1b0e818e3ccc5cb0e39231352 27 voucher 1 66.02
9985 fa65dad1b0e818e3ccc5cb0e39231352 4 voucher 1 29.16
14321 fa65dad1b0e818e3ccc5cb0e39231352 1 voucher 1 3.71
17274 fa65dad1b0e818e3ccc5cb0e39231352 9 voucher 1 1.08
19565 fa65dad1b0e818e3ccc5cb0e39231352 10 voucher 1 12.86
23074 fa65dad1b0e818e3ccc5cb0e39231352 2 voucher 1 8.51
24879 fa65dad1b0e818e3ccc5cb0e39231352 25 voucher 1 3.68
28330 fa65dad1b0e818e3ccc5cb0e39231352 5 voucher 1 0.66
29648 fa65dad1b0e818e3ccc5cb0e39231352 6 voucher 1 5.02
32519 fa65dad1b0e818e3ccc5cb0e39231352 11 voucher 1 4.03
36822 fa65dad1b0e818e3ccc5cb0e39231352 14 voucher 1 0.00
39108 fa65dad1b0e818e3ccc5cb0e39231352 29 voucher 1 19.26
39111 fa65dad1b0e818e3ccc5cb0e39231352 28 voucher 1 29.05
63369 fa65dad1b0e818e3ccc5cb0e39231352 15 voucher 1 14.04
65641 fa65dad1b0e818e3ccc5cb0e39231352 3 voucher 1 2.95
68853 fa65dad1b0e818e3ccc5cb0e39231352 8 voucher 1 26.02
73537 fa65dad1b0e818e3ccc5cb0e39231352 12 voucher 1 3.60
74239 fa65dad1b0e818e3ccc5cb0e39231352 21 voucher 1 4.11
74721 fa65dad1b0e818e3ccc5cb0e39231352 19 voucher 1 5.02
77885 fa65dad1b0e818e3ccc5cb0e39231352 13 voucher 1 0.00
78050 fa65dad1b0e818e3ccc5cb0e39231352 16 voucher 1 12.86
79587 fa65dad1b0e818e3ccc5cb0e39231352 26 voucher 1 28.27
79670 fa65dad1b0e818e3ccc5cb0e39231352 17 voucher 1 3.72
80321 fa65dad1b0e818e3ccc5cb0e39231352 18 voucher 1 0.73
82593 fa65dad1b0e818e3ccc5cb0e39231352 7 voucher 1 0.32
85166 fa65dad1b0e818e3ccc5cb0e39231352 23 voucher 1 18.86
93337 fa65dad1b0e818e3ccc5cb0e39231352 20 voucher 1 150.00
99213 fa65dad1b0e818e3ccc5cb0e39231352 24 voucher 1 0.42
100606 fa65dad1b0e818e3ccc5cb0e39231352 22 voucher 1 4.03

중복된 order_id 에 대한 지불 가격을 합치자¶

중복된 order_id 에 대해 orders 필드값이 덮어씌워져서 본래 orders 보다 많은 행이 생김

In [67]:
payments = payments.groupby('order_id').sum()
payments[payments.index == 'fa65dad1b0e818e3ccc5cb0e39231352']
Out[67]:
payment_sequential payment_type payment_installments payment_value
order_id
fa65dad1b0e818e3ccc5cb0e39231352 435 vouchervouchervouchervouchervouchervouchervouc... 29 457.99

orders 의 구매 날짜와 payments 의 총 지불 금액을 합침¶

In [68]:
merged_order = pd.merge(orders, payments, on='order_id')
In [69]:
merged_order.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 96460 entries, 0 to 96459
Data columns (total 12 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   order_id                       96460 non-null  object 
 1   customer_id                    96460 non-null  object 
 2   order_status                   96460 non-null  object 
 3   order_purchase_timestamp       96460 non-null  object 
 4   order_approved_at              96460 non-null  object 
 5   order_delivered_carrier_date   96460 non-null  object 
 6   order_delivered_customer_date  96460 non-null  object 
 7   order_estimated_delivery_date  96460 non-null  object 
 8   payment_sequential             96460 non-null  int64  
 9   payment_type                   96460 non-null  object 
 10  payment_installments           96460 non-null  int64  
 11  payment_value                  96460 non-null  float64
dtypes: float64(1), int64(2), object(9)
memory usage: 8.8+ MB
In [70]:
merged_order[merged_order['order_id'] == 'fa65dad1b0e818e3ccc5cb0e39231352']
Out[70]:
order_id customer_id order_status order_purchase_timestamp order_approved_at order_delivered_carrier_date order_delivered_customer_date order_estimated_delivery_date payment_sequential payment_type payment_installments payment_value

날짜 다루기¶

pandas.to_datetime() 사용법¶

  • 문자열 타입의 시간을 pandas 의 datetime (datetime64) 형으로 변경
  • 주요 사용법
    • Series 변수 = to_datetime(Series 변수)
      • return 된 Series 변수 데이터는 datetime64 형으로 변형되어 저장
    • Series 변수 = to_datetime(Series 변수, format='~~~')
      • Series 에 변환될 문자열이 특별한 포맷을 가져서, 자동변환이 어려운 경우 명시적으로 format 지정 (옵션)
    • Series 변수 = to_datetime(Series 변수, errors='raise')
      • 디폴트 raise
      • errors 가능한 값: ignore(무시), raise(에러 발생), coerce(NaT 로 값 변경해서 저장) (옵션)
표시 의미
%y 연도를 축약해서 표시, 예: 21
%Y 연도를 축약하지 않고 표시, 예: 2021
%m 월을 두자릿 수로 표시, 예: 01 ~ 12
%d 일을 두자릿 수로 표시, 예: 01 ~ 31
%H 시간 표시(24시간), 예: 00 ~ 23
%M 분 표시, 예: 00 ~ 59
%S 초 표시, 예: 00 ~ 59

참고: https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior

In [71]:
# order 한 시간 정보 데이터만 datetime64 로 변환하기
merged_order['order_purchase_timestamp'] = pd.to_datetime(merged_order['order_purchase_timestamp'], 
               format='%Y-%m-%d %H:%M:%S', errors='raise') 

pandas.DataFrame.copy¶

  • 데이터프레임 중 일부를 선택 후, 조작하면 해당 데이터프레임도 변경
  • copy() 를 통해, 복사본을 만들어서 조작하여, 원본 데이터프레임은 보존 가능
In [72]:
merged_order_payment_date = merged_order[['order_purchase_timestamp', 'payment_value']].copy()
In [73]:
merged_order_payment_date.head()
Out[73]:
order_purchase_timestamp payment_value
0 2017-10-02 10:56:33 38.71
1 2018-07-24 20:41:37 141.46
2 2018-08-08 08:38:49 179.12
3 2017-11-18 19:28:06 72.20
4 2018-02-13 21:18:39 28.62

pandas.Grouper¶

  • pandas groupby 명령에 보다 세부적인 grouping 이 가능토록 하는 명령
  • pandas groupby 함수와 함께 쓰여서, 시간별로 데이터를 분류할 수 있는 기능
  • 특정 시간별로 grouping 할 수 있음
데이터프레임.groupby(pd.Groper(key='그루핑기준이되는 컬럼', freq='세부 기준'))

freq 옵션: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases

시간대별 거래액 확인하기¶

In [74]:
# 월별 거래액 확인하기
merged_order_month_sum = merged_order_payment_date.groupby(pd.Grouper(key='order_purchase_timestamp', freq='M')).sum() # key 는 기본이 index 임
merged_order_month_sum.head()
Out[74]:
payment_value
order_purchase_timestamp
2016-10-31 47271.20
2016-11-30 0.00
2016-12-31 19.62
2017-01-31 127430.74
2017-02-28 269458.98
In [75]:
merged_order_month_sum.iplot(kind='bar', theme='white')
월별 평균 거래액
In [76]:
merged_order_month_sum['payment_value'].mean()
Out[76]:
670420.9934782608
최대 거래액을 기록한 월
In [78]:
merged_order_month_sum[merged_order_month_sum['payment_value'] == merged_order_month_sum['payment_value'].max()]
Out[78]:
payment_value
order_purchase_timestamp
2017-11-30 1153393.22
In [79]:
import plotly.graph_objects as go
fig = go.Figure()
fig.add_trace(
    go.Bar(
        x=merged_order_month_sum.index, 
        y=merged_order_month_sum['payment_value'], 
        text=merged_order_month_sum['payment_value'], 
        textposition='auto', 
        texttemplate='R$ %{text:.0f}'
    )
)

fig.update_layout(
    {
        "title": {
            "text": "<b>Turnover per Month in Brazilian Olist E-Commerce company</b>",
            "x": 0.5,
            "y": 0.9,
            "font": {
                "size": 15
            }
        },
        "xaxis": {
            "title": "from Oct. 2016 to Sep. 2018",
            "showticklabels":True,
            "dtick": "M1",
            "tickfont": {
                "size": 8               
            }
        },
        "yaxis": {
            "title": "Turnover per Month"
        }
    }
)

fig.show()
In [81]:
# 불필요한 데이터 삭제
merged_order_month_sum_from2017 = merged_order_month_sum[merged_order_month_sum.index > '2017-01-01']
merged_order_month_sum_from2017
Out[81]:
payment_value
order_purchase_timestamp
2017-01-31 127430.74
2017-02-28 269458.98
2017-03-31 414369.39
2017-04-30 390952.18
2017-05-31 566872.73
2017-06-30 490225.60
2017-07-31 566403.93
2017-08-31 646000.61
2017-09-30 700976.01
2017-10-31 751140.27
2017-11-30 1153393.22
2017-12-31 843199.17
2018-01-31 1078606.86
2018-02-28 966554.97
2018-03-31 1120678.00
2018-04-30 1132933.95
2018-05-31 1128836.69
2018-06-30 1011561.35
2018-07-31 1027383.10
2018-08-31 985414.28
In [82]:
import plotly.graph_objects as go
fig = go.Figure()
fig.add_trace(
    go.Bar(
        x=merged_order_month_sum_from2017.index, 
        y=merged_order_month_sum_from2017['payment_value'], 
        text=merged_order_month_sum_from2017['payment_value'], 
        textposition='auto', 
        texttemplate='R$ %{text:.0f}'
    )
)

fig.update_layout(
    {
        "title": {
            "text": "<b>Turnover per Month in Brazilian Olist E-Commerce company</b>",
            "x": 0.5,
            "y": 0.9,
            "font": {
                "size": 15
            }
        },
        "xaxis": {
            "title": "from Jan. 2017 to Sep. 2018",
            "showticklabels":True,
            "dtick": "M1",
            "tickfont": {
                "size": 7                
            }
        },
        "yaxis": {
            "title": "Turnover per Month"
        }
    }
)

fig.show()

3. 그래프 테마 변경해보기¶

In [83]:
import plotly.io as pio
pio.templates
Out[83]:
Templates configuration
-----------------------
    Default template: 'plotly'
    Available templates:
        ['ggplot2', 'seaborn', 'simple_white', 'plotly',
         'plotly_white', 'plotly_dark', 'presentation', 'xgridoff',
         'ygridoff', 'gridon', 'none']
In [84]:
import plotly.graph_objects as go
for template in pio.templates:
    fig = go.Figure()
    fig.add_trace(
        go.Bar(
            x=merged_order_month_sum_from2017.index, 
            y=merged_order_month_sum_from2017['payment_value'], 
            text=merged_order_month_sum_from2017['payment_value'], 
            textposition='auto', 
            texttemplate='R$ %{text:.0f}'
        )
    )
    fig.update_layout(
        {
            "title": {
                "text": "<b>Turnover per Month in Brazilian Olist E-Commerce company</b> by " + template,
                "x": 0.5,
                "y": 0.9,
                "font": {
                    "size": 15
                }
            },
            "xaxis": {
                "title": "from Feb. 2017 to Sep. 2018",
                "showticklabels":True,
                "tick0": "2017-01-31", # 처음 tick 을 설정을 해주지 않을 경우, x 축이 밀리는 경우가 있음
                "dtick": "M1", # 한 달 단위로 tick 설정
                "tickfont": {
                    "size": 7                
                }
            },
            "yaxis": {
                "title": "Turnover per Month"
            },
            "template":template
        }
    )
    fig.show()
In [85]:
# 테마 선택
import plotly.graph_objects as go

fig = go.Figure()
fig.add_trace(
    go.Bar(
        x=merged_order_month_sum_from2017.index, 
        y=merged_order_month_sum_from2017['payment_value'], 
        text=merged_order_month_sum_from2017['payment_value'], 
        textposition='auto', 
        texttemplate='R$ %{text:,.0f}'
    )
)
fig.update_layout(
    {
        "title": {
            "text": "<b>Turnover per Month in Brazilian Olist E-Commerce company</b>",
            "x": 0.5,
            "y": 0.9,
            "font": {
                "size": 15
            }
        },
        "xaxis": {
            "title": "from Jan. 2017 to Sep. 2018",
            "showticklabels":True,
            "tick0": "2017-01-31", # 처음 tick 을 설정을 해주지 않을 경우, x 축이 밀리는 경우가 있음
            "dtick": "M1", # 한 달 단위로 tick 설정
            "tickfont": {
                "size": 7                
            }
        },
        "yaxis": {
            "title": "Turnover per Month"
        },
        "template":'plotly_white'
    }
)
fig.show()

bar 색상 바꾸기 (최대 거래액을 가진 달은 별도 색상으로 변경하기)¶

  • 참고 사이트: https://color.adobe.com/ko/trends
In [87]:
colors = ['#03588C',] * len(merged_order_month_sum_from2017.index)
colors[10] = '#F24472'
In [88]:
import plotly.graph_objects as go

fig = go.Figure()
fig.add_trace(
    go.Bar(
        x=merged_order_month_sum_from2017.index, 
        y=merged_order_month_sum_from2017['payment_value'], 
        text=merged_order_month_sum_from2017['payment_value'], 
        textposition='auto', 
        texttemplate='R$ %{text:,.0f}',
        marker_color=colors
    )
)
fig.update_layout(
    {
        "title": {
            "text": "<b>Turnover per Month in Brazilian Olist E-Commerce company</b>",
            "x": 0.5,
            "y": 0.9,
            "font": {
                "size": 15
            }
        },
        "xaxis": {
            "title": "from Jan. 2017 to Aug. 2018",
            "showticklabels":True,
            "tick0": "2017-01-31",
            "dtick": "M1",
            "tickfont": {
                "size": 7                
            }
        },
        "yaxis": {
            "title": "Turnover per Month",
            "tickfont": {
                "size": 10                
            }
        },
        "template":'plotly_white'
    }
)

fig.show()

annotation 추가하기¶

  • 참고 사이트: https://plotly.com/python/text-and-annotations/
  • 상세 옵션: https://plotly.com/python/reference/#layout-annotations
In [89]:
import plotly.graph_objects as go

fig = go.Figure()
fig.add_trace(
    go.Bar(
        x=merged_order_month_sum_from2017.index, 
        y=merged_order_month_sum_from2017['payment_value'], 
        text=merged_order_month_sum_from2017['payment_value'], 
        textposition='auto', 
        texttemplate='R$ %{y:,.0f}',
        marker_color=colors
    )
)
fig.update_layout(
    {
        "title": {
            "text": "<b>Turnover per Month in Brazilian Olist E-Commerce company</b>",
            "x": 0.5,
            "y": 0.9,
            "font": {
                "size": 15
            }
        },
        "xaxis": {
            "title": "from Jan. 2017 to Sep. 2018",
            "showticklabels":True,
            "tick0": "2017-01-31",
            "dtick": "M1",
            "tickfont": {
                "size": 7                
            }
        },
        "yaxis": {
            "title": "Turnover per Month",
            "tickfont": {
                "size": 10                
            }
        },
        "template":'plotly_white'
    }
)

fig.add_annotation(
            x="2017-11-30",
            y=1153393,
            text="<b>Peaked Monthly Turnover</b>",
            showarrow=True,
            font=dict(
                size=10,
                color="#ffffff"
                ),
            align="center",
            arrowhead=2,
            arrowsize=1,
            arrowwidth=2,
            arrowcolor="#77CFD9",
            ax=20,
            ay=-30,
            bordercolor="#77CFD9",
            borderwidth=2,
            borderpad=4,
            bgcolor="#F25D50",
            opacity=0.9
)

fig.show()

3-5 일별,주별,월별 판매 트렌드는?¶

In [92]:
merged_order.head()
Out[92]:
order_id customer_id order_status order_purchase_timestamp order_approved_at order_delivered_carrier_date order_delivered_customer_date order_estimated_delivery_date payment_sequential payment_type payment_installments payment_value
0 e481f51cbdc54678b7cc49136f2d6af7 9ef432eb6251297304e76186b10a928d delivered 2017-10-02 10:56:33 2017-10-02 11:07:15 2017-10-04 19:55:00 2017-10-10 21:25:13 2017-10-18 00:00:00 6 credit_cardvouchervoucher 3 38.71
1 53cdb2fc8bc7dce0b6741e2150273451 b0830fb4747a6c6d20dea0b8c802d7ef delivered 2018-07-24 20:41:37 2018-07-26 03:24:27 2018-07-26 14:31:00 2018-08-07 15:27:45 2018-08-13 00:00:00 1 boleto 1 141.46
2 47770eb9100c2d0c44946d9cf07ec65d 41ce2a54c0b03bf3443c3d931a367089 delivered 2018-08-08 08:38:49 2018-08-08 08:55:23 2018-08-08 13:50:00 2018-08-17 18:06:29 2018-09-04 00:00:00 1 credit_card 3 179.12
3 949d5b44dbf5de918fe9c16f97b45f8a f88197465ea7920adcdbec7375364d82 delivered 2017-11-18 19:28:06 2017-11-18 19:45:59 2017-11-22 13:39:59 2017-12-02 00:28:42 2017-12-15 00:00:00 1 credit_card 1 72.20
4 ad21c59c0840e6cb83a9ceb5573f8159 8ab97904e6daea8866dbdbc4fb7aad2c delivered 2018-02-13 21:18:39 2018-02-13 22:20:29 2018-02-14 19:46:34 2018-02-16 18:17:02 2018-02-26 00:00:00 1 credit_card 1 28.62
In [94]:
merged_order_payment_date.head()
Out[94]:
order_purchase_timestamp payment_value
0 2017-10-02 10:56:33 38.71
1 2018-07-24 20:41:37 141.46
2 2018-08-08 08:38:49 179.12
3 2017-11-18 19:28:06 72.20
4 2018-02-13 21:18:39 28.62

월별 거래건수 확인¶

In [95]:
# order_purchase_timestamp 의 날짜 데이터를 기반으로 월별 계산을 해야 하므로 datetime 타입으로 변환
merged_order_payment_date['order_purchase_timestamp'] = pd.to_datetime(merged_order_payment_date['order_purchase_timestamp'], format='%Y-%m-%d %H:%M:%S', errors='raise') 
merged_order_payment_date = merged_order_payment_date.set_index('order_purchase_timestamp')

merged_order_month_count = merged_order_payment_date.groupby(pd.Grouper(freq='M')).count() # key 는 기본이 index 임
merged_order_month_count.head()
Out[95]:
payment_value
order_purchase_timestamp
2016-10-31 270
2016-11-30 0
2016-12-31 1
2017-01-31 748
2017-02-28 1641
In [98]:
merged_order_month_count.iplot(kind='bar', theme='white')

일별 거래액 확인¶

In [100]:
merged_order_date_sum = merged_order_payment_date.groupby(pd.Grouper(freq='D')).sum() # key 는 기본이 index 임
merged_order_date_sum.iplot(kind='line', theme='white')

시간대별 분석¶

In [106]:
merged_order_payment_date['order_purchase_timestamp'] = pd.to_datetime(merged_order_payment_date.index, format='%Y-%m-%d %H:%M:%S', errors='raise') 
merged_order_payment_date.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 96460 entries, 2017-10-02 10:56:33 to 2018-03-08 20:57:30
Data columns (total 2 columns):
 #   Column                    Non-Null Count  Dtype         
---  ------                    --------------  -----         
 0   payment_value             96460 non-null  float64       
 1   order_purchase_timestamp  96460 non-null  datetime64[ns]
dtypes: datetime64[ns](1), float64(1)
memory usage: 2.2 MB
In [107]:
merged_order_payment_date['year'] = merged_order_payment_date['order_purchase_timestamp'].dt.year
merged_order_payment_date['monthday'] = merged_order_payment_date['order_purchase_timestamp'].dt.day
merged_order_payment_date['weekday'] = merged_order_payment_date['order_purchase_timestamp'].dt.weekday
merged_order_payment_date['month'] = merged_order_payment_date['order_purchase_timestamp'].dt.month
merged_order_payment_date['hour'] = merged_order_payment_date['order_purchase_timestamp'].dt.hour
merged_order_payment_date['quarter'] = merged_order_payment_date['order_purchase_timestamp'].dt.quarter
merged_order_payment_date['minute'] = merged_order_payment_date['order_purchase_timestamp'].dt.minute

연도별 분석¶

In [109]:
merged_order_payment_year = merged_order_payment_date[['year', 'payment_value']].copy()
merged_order_payment_year
Out[109]:
year payment_value
order_purchase_timestamp
2017-10-02 10:56:33 2017 38.71
2018-07-24 20:41:37 2018 141.46
2018-08-08 08:38:49 2018 179.12
2017-11-18 19:28:06 2017 72.20
2018-02-13 21:18:39 2018 28.62
... ... ...
2017-03-09 09:54:05 2017 85.08
2018-02-06 12:58:58 2018 195.00
2017-08-27 14:46:43 2017 271.01
2018-01-08 21:28:27 2018 441.16
2018-03-08 20:57:30 2018 86.86

96460 rows × 2 columns

In [111]:
merged_order_payment_year = merged_order_payment_year.groupby('year').sum()
merged_order_payment_year.head()
Out[111]:
payment_value
year
2016 47290.82
2017 6920422.83
2018 8451969.20
In [114]:
merged_order_payment_year.iplot(kind='bar', theme='white')
In [115]:
layout = {
    'xaxis': {
        'showticklabels':True,
        'tickvals':[2016, 2017, 2018]
    }
}
merged_order_payment_year.iplot(kind='bar', theme='white', layout=layout)

요일별 분석¶

  • weekday : (0:월, 1:화, 2:수, 3:목, 4:금, 5:토, 6:일)
In [116]:
merged_order_payment_weekday = merged_order_payment_date[['weekday', 'payment_value']].copy()
In [117]:
merged_order_payment_weekday = merged_order_payment_weekday.groupby('weekday').sum()
merged_order_payment_weekday.head()
Out[117]:
payment_value
weekday
0 2530671.56
1 2473930.77
2 2396215.34
3 2283849.51
4 2222421.36
In [118]:
merged_order_payment_weekday = merged_order_payment_weekday.reset_index()

apply() 함수 활용해서 컬럼값 변경하기¶

  • index 를 column 으로 변경하고 (reset_index()), column 값을 순서에 맞춰서 변경 후, index로 재설정(set_index())
In [122]:
def func(row):
    if row['weekday'] == 0:
        row['weekday'] = 'Mon'
    elif row['weekday'] == 1:
        row['weekday'] = 'Tue'
    elif row['weekday'] == 2:
        row['weekday'] = 'Wed'
    elif row['weekday'] == 3:
        row['weekday'] = 'Thu'
    elif row['weekday'] == 4:
        row['weekday'] = 'Fri'
    elif row['weekday'] == 5:
        row['weekday'] = 'Sat'
    elif row['weekday'] == 6:
        row['weekday'] = 'Sun'
        
    return row 
In [123]:
merged_order_payment_weekday = merged_order_payment_weekday.apply(func, axis=1)
merged_order_payment_weekday
Out[123]:
weekday payment_value
0 Mon 2530671.56
1 Tue 2473930.77
2 Wed 2396215.34
3 Thu 2283849.51
4 Fri 2222421.36
5 Sat 1704949.69
6 Sun 1807644.62
In [124]:
merged_order_payment_weekday = merged_order_payment_weekday.set_index('weekday')
In [125]:
merged_order_payment_weekday.iplot(kind='bar', theme='white')

계절별(quarter별) 분석¶

In [126]:
merged_order_payment_quarter = merged_order_payment_date[['quarter', 'payment_value']].copy()
In [127]:
merged_order_payment_quarter = merged_order_payment_quarter.groupby('quarter').sum()
merged_order_payment_quarter.head()
Out[127]:
payment_value
quarter
1 3977098.94
2 4721382.50
3 3926177.93
4 2795023.48
In [128]:
merged_order_payment_quarter = merged_order_payment_quarter.reset_index()
In [129]:
def func(row):
    if row['quarter'] == 1:
        row['quarter'] = '1Q'
    elif row['quarter'] == 2:
        row['quarter'] = '2Q'
    elif row['quarter'] == 3:
        row['quarter'] = '3Q'
    elif row['quarter'] == 4:
        row['quarter'] = '4Q'
        
    return row
In [131]:
merged_order_payment_quarter = merged_order_payment_quarter.apply(func, axis=1).set_index('quarter')
merged_order_payment_quarter.iplot(kind='bar', theme='white')

시간대별 분석¶

In [132]:
merged_order_payment_hour = merged_order_payment_date[['hour', 'payment_value']].copy()
In [133]:
merged_order_payment_hour = merged_order_payment_hour.groupby('hour').sum()
merged_order_payment_hour.iplot(kind='bar', theme='white')
In [134]:
layout = {
    'xaxis': {
        'showticklabels':True,
        'dtick': 1
    }
}
merged_order_payment_hour.iplot(kind='bar', theme='white', layout=layout)

요일/시간간 거래액 상관관계 알아보기¶

In [170]:
import pandas as pd
PATH = "C:/Users/Admin/Data Analysis/Ecommerce\Dataset/"
products = pd.read_csv(PATH + "olist_products_dataset.csv", encoding='utf-8-sig')
customers = pd.read_csv(PATH + "olist_customers_dataset.csv", encoding='utf-8-sig')
geolocation = pd.read_csv(PATH + "olist_geolocation_dataset.csv", encoding='utf-8-sig')
order_items = pd.read_csv(PATH + "olist_order_items_dataset.csv", encoding='utf-8-sig')
payments = pd.read_csv(PATH + "olist_order_payments_dataset.csv", encoding='utf-8-sig')
reviews = pd.read_csv(PATH + "olist_order_reviews_dataset.csv", encoding='utf-8-sig')
orders = pd.read_csv(PATH + "olist_orders_dataset.csv", encoding='utf-8-sig')
sellers = pd.read_csv(PATH + "olist_sellers_dataset.csv", encoding='utf-8-sig')
category_name = pd.read_csv(PATH + "product_category_name_translation.csv", encoding='utf-8-sig')
orders = orders.dropna()
merged_order = pd.merge(orders, payments, on='order_id')
merged_order_payment_date = merged_order[['order_purchase_timestamp', 'payment_value']].copy()
merged_order_payment_date['order_purchase_timestamp'] = pd.to_datetime(merged_order_payment_date['order_purchase_timestamp'], format='%Y-%m-%d %H:%M:%S', errors='raise') 
merged_order_payment_date['year'] = merged_order_payment_date['order_purchase_timestamp'].dt.year
merged_order_payment_date['monthday'] = merged_order_payment_date['order_purchase_timestamp'].dt.day
merged_order_payment_date['weekday'] = merged_order_payment_date['order_purchase_timestamp'].dt.weekday
merged_order_payment_date['month'] = merged_order_payment_date['order_purchase_timestamp'].dt.month
merged_order_payment_date['hour'] = merged_order_payment_date['order_purchase_timestamp'].dt.hour
merged_order_payment_date['quarter'] = merged_order_payment_date['order_purchase_timestamp'].dt.quarter
merged_order_payment_date['minute'] = merged_order_payment_date['order_purchase_timestamp'].dt.minute
In [171]:
merged_order_payment_date.head()
Out[171]:
order_purchase_timestamp payment_value year monthday weekday month hour quarter minute
0 2017-10-02 10:56:33 18.12 2017 2 0 10 10 4 56
1 2017-10-02 10:56:33 2.00 2017 2 0 10 10 4 56
2 2017-10-02 10:56:33 18.59 2017 2 0 10 10 4 56
3 2018-07-24 20:41:37 141.46 2018 24 1 7 20 3 41
4 2018-08-08 08:38:49 179.12 2018 8 2 8 8 3 38
In [172]:
merged_order_payment_hour_weekday = merged_order_payment_date[['weekday', 'hour', 'payment_value']].copy()
In [173]:
merged_order_payment_hour_weekday
Out[173]:
weekday hour payment_value
0 0 10 18.12
1 0 10 2.00
2 0 10 18.59
3 1 20 141.46
4 2 8 179.12
... ... ... ...
100734 3 9 85.08
100735 1 12 195.00
100736 6 14 271.01
100737 0 21 441.16
100738 3 20 86.86

100739 rows × 3 columns

In [175]:
merged_order_payment_hour_weekday = merged_order_payment_hour_weekday.groupby(['weekday','hour']).sum()
merged_order_payment_hour_weekday
Out[175]:
payment_value
weekday hour
0 0 46807.42
1 17734.77
2 7462.56
3 6442.62
4 2138.02
... ... ...
6 19 140127.39
20 150543.96
21 135431.31
22 127866.74
23 82991.95

168 rows × 1 columns

In [176]:
merged_order_payment_hour_weekday.index[1][0] # 리스트 안의 리스트일 때처럼 가리키면 됨
Out[176]:
0
In [178]:
merged_order_payment_hour_weekday = merged_order_payment_hour_weekday.reset_index()
merged_order_payment_hour_weekday
Out[178]:
index weekday hour payment_value
0 0 0 0 46807.42
1 1 0 1 17734.77
2 2 0 2 7462.56
3 3 0 3 6442.62
4 4 0 4 2138.02
... ... ... ... ...
163 163 6 19 140127.39
164 164 6 20 150543.96
165 165 6 21 135431.31
166 166 6 22 127866.74
167 167 6 23 82991.95

168 rows × 4 columns

In [179]:
def func(row):
    if row['weekday'] == 0:
        row['weekday'] = 'Mon'
    elif row['weekday'] == 1:
        row['weekday'] = 'Tue'
    elif row['weekday'] == 2:
        row['weekday'] = 'Wed'
    elif row['weekday'] == 3:
        row['weekday'] = 'Thu'
    elif row['weekday'] == 4:
        row['weekday'] = 'Fri'
    elif row['weekday'] == 5:
        row['weekday'] = 'Sat'
    elif row['weekday'] == 6:
        row['weekday'] = 'Sun'
        
    return row 
In [180]:
merged_order_payment_hour_weekday = merged_order_payment_hour_weekday.apply(func, axis=1)
merged_order_payment_hour_weekday
Out[180]:
index weekday hour payment_value
0 0.0 Mon 0.0 46807.42
1 1.0 Mon 1.0 17734.77
2 2.0 Mon 2.0 7462.56
3 3.0 Mon 3.0 6442.62
4 4.0 Mon 4.0 2138.02
... ... ... ... ...
163 163.0 Sun 19.0 140127.39
164 164.0 Sun 20.0 150543.96
165 165.0 Sun 21.0 135431.31
166 166.0 Sun 22.0 127866.74
167 167.0 Sun 23.0 82991.95

168 rows × 4 columns

In [181]:
merged_order_payment_hour_weekday['weekday'] = pd.Categorical(merged_order_payment_hour_weekday['weekday'],categories=['Sun','Mon','Tue','Wed','Thu','Fri','Sat'],ordered=True)
In [182]:
merged_order_payment_hour_weekday = merged_order_payment_hour_weekday.sort_values(by=['weekday', 'hour'], ascending=True)
merged_order_payment_hour_weekday.head()
Out[182]:
index weekday hour payment_value
144 144.0 Sun 0.0 38888.45
145 145.0 Sun 1.0 20137.20
146 146.0 Sun 2.0 11965.82
147 147.0 Sun 3.0 6326.13
148 148.0 Sun 4.0 3914.48
In [184]:
merged_order_payment_hour_weekday.iplot(kind='heatmap', y='weekday', x='hour', z='payment_value', colorscale='Blues')
In [185]:
layout = {
    'title':  {
            "text": "<b>Heatmap of turnover over the hour by day</b>",
            "x": 0.5,
            "y": 0.87,
            "font": {
                "size": 16
            }
    },  
    'xaxis': {
        'title': '<b>Hour</b>',
        'showticklabels':True,
        'dtick': 1
    },
    "yaxis": {
        "title": "<b>The day of the week</b>"
    },
    "template":'plotly_white'
}
merged_order_payment_hour_weekday.iplot(kind='heatmap', y='weekday', x='hour', z='payment_value', colorscale='Reds', layout=layout)

3-6 어떤 카테고리가 가장 많은 상품이 팔렸을까?¶

In [192]:
products_category = pd.merge(products, category_name, on='product_category_name')
products_category
Out[192]:
product_id product_category_name product_name_lenght product_description_lenght product_photos_qty product_weight_g product_length_cm product_height_cm product_width_cm product_category_name_english
0 1e9e8ef04dbcff4541ed26657ea517e5 perfumaria 40.0 287.0 1.0 225.0 16.0 10.0 14.0 perfumery
1 6a2fb4dd53d2cdb88e0432f1284a004c perfumaria 39.0 346.0 2.0 400.0 27.0 5.0 20.0 perfumery
2 0d009643171aee696f4733340bc2fdd0 perfumaria 52.0 150.0 1.0 422.0 21.0 16.0 18.0 perfumery
3 b1eae565a61935e0011ee7682fef9dc9 perfumaria 49.0 460.0 2.0 267.0 17.0 13.0 17.0 perfumery
4 8da90b37f0fb171b4877c124f965b1f6 perfumaria 56.0 733.0 3.0 377.0 18.0 13.0 15.0 perfumery
... ... ... ... ... ... ... ... ... ... ...
32323 13b25797ad1e6d6c7cd3cbeb35f8ee7a moveis_colchao_e_estofado 44.0 348.0 1.0 18250.0 25.0 25.0 25.0 furniture_mattress_and_upholstery
32324 dc36a7859b743d8610a2bbbaea26ece9 moveis_colchao_e_estofado 29.0 2073.0 2.0 2500.0 41.0 21.0 21.0 furniture_mattress_and_upholstery
32325 107fde0930956120d1e13dd1062fbb46 moveis_colchao_e_estofado 51.0 418.0 1.0 30000.0 60.0 73.0 60.0 furniture_mattress_and_upholstery
32326 726b4e18f00255e2e63491bcba3f60b8 moveis_colchao_e_estofado 41.0 866.0 1.0 1700.0 70.0 10.0 100.0 furniture_mattress_and_upholstery
32327 1dceebcc5f23c02ea23e16d5bedca000 cds_dvds_musicais 47.0 117.0 2.0 550.0 35.0 15.0 25.0 cds_dvds_musicals

32328 rows × 10 columns

In [193]:
products_category_only = products_category[['product_id', 'product_category_name_english']].copy()
products_category_only.head()
Out[193]:
product_id product_category_name_english
0 1e9e8ef04dbcff4541ed26657ea517e5 perfumery
1 6a2fb4dd53d2cdb88e0432f1284a004c perfumery
2 0d009643171aee696f4733340bc2fdd0 perfumery
3 b1eae565a61935e0011ee7682fef9dc9 perfumery
4 8da90b37f0fb171b4877c124f965b1f6 perfumery
In [194]:
products_category_order = pd.merge(products_category_only, order_items, on='product_id')
products_category_order.head()
Out[194]:
product_id product_category_name_english order_id order_item_id seller_id shipping_limit_date price freight_value
0 1e9e8ef04dbcff4541ed26657ea517e5 perfumery e17e4f88e31525f7deef66779844ddce 1 5670f4db5b62c43d542e1b2d56b0cf7c 2018-04-30 17:33:54 10.91 7.39
1 6a2fb4dd53d2cdb88e0432f1284a004c perfumery 048cc42e03ca8d43c729adf6962cb348 1 7040e82f899a04d1b434b795a43b4617 2017-11-23 21:31:31 16.90 7.78
2 6a2fb4dd53d2cdb88e0432f1284a004c perfumery 5fa78e91b5cb84b6a8ccc42733f95742 1 7040e82f899a04d1b434b795a43b4617 2017-12-14 16:33:14 16.90 7.78
3 0d009643171aee696f4733340bc2fdd0 perfumery 24b1c4d88fdb7a2dc87f8ecc7d8f47f1 1 fe2032dab1a61af8794248c8196565c9 2017-08-03 09:50:14 339.00 17.13
4 0d009643171aee696f4733340bc2fdd0 perfumery 7b13c77c64a9a956500cbf1a9a23798d 1 7178f9f4dd81dcef02f62acdf8151e01 2018-05-30 10:38:35 275.00 23.48

카테고리별 거래 건수 확인하기¶

  1. 카테고리 이름으로 grouping 한다: products_category_order.groupby('product_category_name_english')
  2. grouping 할 때 order는 거래액이 아니라 거래건수로 count 를 한다: .aggregate({'order_id':'count'})
    • .aggregate() 또는 .agg() 를 사용
  3. 거래건수로 count 한 값의 컬럼명을 order_count 로 바꾼다: .rename(columns={'order_id':'order_count'})
In [195]:
products_category_order = products_category_order.groupby('product_category_name_english').aggregate({'order_id':'count'}).rename(columns={'order_id':'order_count'})
In [196]:
products_category_order.head()
Out[196]:
order_count
product_category_name_english
agro_industry_and_commerce 212
air_conditioning 297
art 209
arts_and_craftmanship 24
audio 364
In [197]:
products_category_order = products_category_order.sort_values(by='order_count',ascending=False)
In [198]:
import plotly.graph_objects as go

fig = go.Figure()
fig.add_trace(
    go.Bar(
        x=products_category_order.index, 
        y=products_category_order['order_count']
    )
)
fig.update_layout(
    {
        "title": {
            "text": "<b>The number of Order per Category in Brazilian Olist E-Commerce company</b>",
            "x": 0.5,
            "y": 0.9,
            "font": {
                "size": 15
            }
        },
        "xaxis": {
            "title": "from Oct. 2016 to Sep. 2018",
            "showticklabels":True,
            "tickfont": {
                "size": 7                
            }
        },
        "yaxis": {
            "title": "The number of Order"
        },
        "template":'plotly_white'
    }
)
fig.show()
In [199]:
# 주요 카테고리만 확인
products_category_order = products_category_order[products_category_order['order_count'] > 1000]
In [200]:
import plotly.graph_objects as go

fig = go.Figure()
fig.add_trace(
    go.Bar(
        x=products_category_order.index, 
        y=products_category_order['order_count']
    )
)
fig.update_layout(
    {
        "title": {
            "text": "<b>The number of Order per Category in Brazilian Olist E-Commerce company</b>",
            "x": 0.5,
            "y": 0.9,
            "font": {
                "size": 15
            }
        },
        "xaxis": {
            "title": "from Oct. 2016 to Sep. 2018",
            "showticklabels":True,
            "tickfont": {
                "size": 7                
            }
        },
        "yaxis": {
            "title": "The number of Order"
        },
        "template":'plotly_white'
    }
)
fig.show()

3-6 평균 배송 시간 분석¶

In [201]:
orders.head()
Out[201]:
order_id customer_id order_status order_purchase_timestamp order_approved_at order_delivered_carrier_date order_delivered_customer_date order_estimated_delivery_date
0 e481f51cbdc54678b7cc49136f2d6af7 9ef432eb6251297304e76186b10a928d delivered 2017-10-02 10:56:33 2017-10-02 11:07:15 2017-10-04 19:55:00 2017-10-10 21:25:13 2017-10-18 00:00:00
1 53cdb2fc8bc7dce0b6741e2150273451 b0830fb4747a6c6d20dea0b8c802d7ef delivered 2018-07-24 20:41:37 2018-07-26 03:24:27 2018-07-26 14:31:00 2018-08-07 15:27:45 2018-08-13 00:00:00
2 47770eb9100c2d0c44946d9cf07ec65d 41ce2a54c0b03bf3443c3d931a367089 delivered 2018-08-08 08:38:49 2018-08-08 08:55:23 2018-08-08 13:50:00 2018-08-17 18:06:29 2018-09-04 00:00:00
3 949d5b44dbf5de918fe9c16f97b45f8a f88197465ea7920adcdbec7375364d82 delivered 2017-11-18 19:28:06 2017-11-18 19:45:59 2017-11-22 13:39:59 2017-12-02 00:28:42 2017-12-15 00:00:00
4 ad21c59c0840e6cb83a9ceb5573f8159 8ab97904e6daea8866dbdbc4fb7aad2c delivered 2018-02-13 21:18:39 2018-02-13 22:20:29 2018-02-14 19:46:34 2018-02-16 18:17:02 2018-02-26 00:00:00
In [202]:
orders.info()
<class 'pandas.core.frame.DataFrame'>
Index: 96461 entries, 0 to 99440
Data columns (total 8 columns):
 #   Column                         Non-Null Count  Dtype 
---  ------                         --------------  ----- 
 0   order_id                       96461 non-null  object
 1   customer_id                    96461 non-null  object
 2   order_status                   96461 non-null  object
 3   order_purchase_timestamp       96461 non-null  object
 4   order_approved_at              96461 non-null  object
 5   order_delivered_carrier_date   96461 non-null  object
 6   order_delivered_customer_date  96461 non-null  object
 7   order_estimated_delivery_date  96461 non-null  object
dtypes: object(8)
memory usage: 6.6+ MB
In [203]:
orders.isnull().sum()
Out[203]:
order_id                         0
customer_id                      0
order_status                     0
order_purchase_timestamp         0
order_approved_at                0
order_delivered_carrier_date     0
order_delivered_customer_date    0
order_estimated_delivery_date    0
dtype: int64
In [204]:
# 결측값 삭제
orders = orders.dropna()
orders.isnull().sum()
Out[204]:
order_id                         0
customer_id                      0
order_status                     0
order_purchase_timestamp         0
order_approved_at                0
order_delivered_carrier_date     0
order_delivered_customer_date    0
order_estimated_delivery_date    0
dtype: int64

배송 시간을 계산해서 새로운 컬럼으로 추가¶

  • 시간 차는 timedelta64 타입으로 표시됨
In [205]:
orders['delivery_time'] = pd.to_datetime(orders['order_delivered_customer_date']) - pd.to_datetime(orders['order_purchase_timestamp'])
In [206]:
orders['delivery_time'].describe()
Out[206]:
count                         96461
mean     12 days 13:23:46.585853350
std       9 days 13:06:45.121880865
min                 0 days 12:48:07
25%                 6 days 18:22:50
50%                10 days 05:11:29
75%                15 days 17:17:10
max               209 days 15:05:12
Name: delivery_time, dtype: object
In [207]:
orders['delivery_time'].iplot(kind="box")
In [208]:
# max 값을 제거 하기 위해서 quantile 활용
delivery_time_q95 = orders['delivery_time'].quantile(.95)
In [209]:
delivery_time_q95
Out[209]:
Timedelta('29 days 06:36:33')
In [210]:
delivery_time_q90 = orders['delivery_time'].quantile(.90)
delivery_time_q90
Out[210]:
Timedelta('23 days 02:21:07')
In [211]:
orders = orders[orders['delivery_time'] < delivery_time_q95]
In [212]:
orders.describe()
Out[212]:
delivery_time
count 91637
mean 11 days 01:30:52.357934022
std 6 days 02:35:35.737263832
min 0 days 12:48:07
25% 6 days 11:17:04
50% 9 days 21:57:11
75% 14 days 13:03:22
max 29 days 06:35:41

월별 평균 배송 시간 분석¶

  • orders_date['delivery_time'].dt.total_seconds() : 배송 시간을 초로 변환
    • float 타입으로 변환해서, mean() 평균 계산
In [213]:
orders_date = orders[['order_purchase_timestamp', 'delivery_time']].copy()
# order_purchase_timestamp 의 날짜 데이터를 기반으로 월별 계산을 해야 하므로 datetime 타입으로 변환
orders_date['order_purchase_timestamp'] = pd.to_datetime(orders_date['order_purchase_timestamp'], format='%Y-%m-%d %H:%M:%S', errors='raise') 
# delivery_time이 timedelta64 타입인데, 이를 float 타입으로 변환
orders_date['delivery_time'] = orders_date['delivery_time'].dt.total_seconds()
orders_date = orders_date.set_index('order_purchase_timestamp')
In [214]:
orders_date = orders_date.groupby(pd.Grouper(freq='M')).mean() # key 는 기본이 index 임
orders_date.head()
Out[214]:
delivery_time
order_purchase_timestamp
2016-10-31 1.293386e+06
2016-11-30 NaN
2016-12-31 4.054770e+05
2017-01-31 9.549751e+05
2017-02-28 1.010813e+06
In [215]:
orders_date['delivery_time'] = orders_date['delivery_time'] / 86400
In [216]:
orders_date.head()
Out[216]:
delivery_time
order_purchase_timestamp
2016-10-31 14.969742
2016-11-30 NaN
2016-12-31 4.693021
2017-01-31 11.052953
2017-02-28 11.699220
In [217]:
orders_date.iplot(kind="bar")
In [218]:
orders_date = orders_date[orders_date.index > "2017-01-01"]
In [219]:
layout = {
    'xaxis': {
        'showticklabels':True,
        "tick0": "2017-01-31",
        'dtick': 'M1',
        "tickfont": {
            "size": 7                
        }
    }
}
orders_date.iplot(kind="bar", layout=layout)
In [220]:
colors = ['#1B80BF',] * len(orders_date.index)
for index in range(15, len(orders_date.index)):
    colors[index] = '#BF2C47'
In [221]:
import plotly.graph_objects as go

fig = go.Figure()
fig.add_trace(
    go.Bar(
        x=orders_date.index, 
        y=orders_date['delivery_time'], 
        text=orders_date['delivery_time'], 
        textposition='auto', 
        texttemplate='%{text:.2f} days',
        marker_color=colors
    )
)
fig.update_layout(
    {
        "title": {
            "text": "<b>Average Delivery Time per Month in Brazilian Olist E-Commerce company</b>",
            "x": 0.5,
            "y": 0.9,
            "font": {
                "size": 15
            }
        },
        "xaxis": {
            "title": "from Jan. 2017 to Aug. 2018",
            "showticklabels":True,
            "tick0": "2017-01-31",
            "dtick": "M1",
            "tickfont": {
                "size": 7                
            }
        },
        "yaxis": {
            "title": "Delivery Time (days)"
        },
        "template":'plotly_white'
    }
)

fig.add_annotation(
            x="2018-04-30",
            y=10.4,
            text="<b>Improved logistics</b>",
            showarrow=True,
            font=dict(
                size=10,
                color="#ffffff"
                ),
            align="center",
            arrowhead=2,
            arrowsize=1,
            arrowwidth=2,
            arrowcolor="#77BDD9",
            ax=40,
            ay=-30,
            bordercolor="#77BDD9",
            borderwidth=2,
            borderpad=4,
            bgcolor="#F22E62",
            opacity=0.8
)


fig.show()
In [ ]: