브라질에서 가장 큰 백화점의 이커머스 쇼핑몰 (https://olist.com/solucoes/distribuidoras-e-lojas-de-bebidas/)
Kaggle Dataset(https://www.kaggle.com/datasets/olistbr/brazilian-ecommerce/)
데이터셋 (dataset)
일반적으로 현업에서는 하이브(HIVE), MySQL(SQL), mongodb(NoSQL) 등 데이터를 저장/관리하는 데이터베이스에서 SQL 과 유사한 형태의 명령을 통해 데이터셋을 만드는 것이 일반적임
수치형 데이터일 경우에는 다음과 같이 EDA 5 수치 + 평균(mean) 확인
범주형 데이터일 경우에는 각 수준별 갯수 세기
시계열 데이터일 경우에는 필요하면 line 또는 bar 그래프 그리기
feature 간 상관관계 분석이 필요할 경우에는 heatmap 또는 scatter 그래프 그리기
시각화를 위해 데이터 조작이 필요하므로, 가볍게 각 데이터만 확인
import pandas as pd
PATH = "C:/Users/Admin/Data Analysis/Ecommerce\Dataset/"
products = pd.read_csv(PATH + "olist_products_dataset.csv", encoding='utf-8-sig')
customers = pd.read_csv(PATH + "olist_customers_dataset.csv", encoding='utf-8-sig')
geolocation = pd.read_csv(PATH + "olist_geolocation_dataset.csv", encoding='utf-8-sig')
order_items = pd.read_csv(PATH + "olist_order_items_dataset.csv", encoding='utf-8-sig')
payments = pd.read_csv(PATH + "olist_order_payments_dataset.csv", encoding='utf-8-sig')
reviews = pd.read_csv(PATH + "olist_order_reviews_dataset.csv", encoding='utf-8-sig')
orders = pd.read_csv(PATH + "olist_orders_dataset.csv", encoding='utf-8-sig')
sellers = pd.read_csv(PATH + "olist_sellers_dataset.csv", encoding='utf-8-sig')
category_name = pd.read_csv(PATH + "product_category_name_translation.csv", encoding='utf-8-sig')
customers.head()
customer_id | customer_unique_id | customer_zip_code_prefix | customer_city | customer_state | |
---|---|---|---|---|---|
0 | 06b8999e2fba1a1fbc88172c00ba8bc7 | 861eff4711a542e4b93843c6dd7febb0 | 14409 | franca | SP |
1 | 18955e83d337fd6b2def6b18a428ac77 | 290c77bc529b7ac935b93aa66c333dc3 | 9790 | sao bernardo do campo | SP |
2 | 4e7b3e00288586ebd08712fdd0374a03 | 060e732b5b29e8181a18229c7b0b2b5e | 1151 | sao paulo | SP |
3 | b2b6027bc5c5109e529d4dc6358b12c3 | 259dac757896d24d7702b9acbbff3f3c | 8775 | mogi das cruzes | SP |
4 | 4f2d8ab171c80ec8364f7c12e35b23ad | 345ecd01c38d18a9036ed96c73b8d066 | 13056 | campinas | SP |
customers.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 99441 entries, 0 to 99440 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 customer_id 99441 non-null object 1 customer_unique_id 99441 non-null object 2 customer_zip_code_prefix 99441 non-null int64 3 customer_city 99441 non-null object 4 customer_state 99441 non-null object dtypes: int64(1), object(4) memory usage: 3.8+ MB
customers['customer_id'].nunique()
99441
customers['customer_unique_id'].nunique()
96096
customers.head()
customer_id | customer_unique_id | customer_zip_code_prefix | customer_city | customer_state | |
---|---|---|---|---|---|
0 | 06b8999e2fba1a1fbc88172c00ba8bc7 | 861eff4711a542e4b93843c6dd7febb0 | 14409 | franca | SP |
1 | 18955e83d337fd6b2def6b18a428ac77 | 290c77bc529b7ac935b93aa66c333dc3 | 9790 | sao bernardo do campo | SP |
2 | 4e7b3e00288586ebd08712fdd0374a03 | 060e732b5b29e8181a18229c7b0b2b5e | 1151 | sao paulo | SP |
3 | b2b6027bc5c5109e529d4dc6358b12c3 | 259dac757896d24d7702b9acbbff3f3c | 8775 | mogi das cruzes | SP |
4 | 4f2d8ab171c80ec8364f7c12e35b23ad | 345ecd01c38d18a9036ed96c73b8d066 | 13056 | campinas | SP |
customers_location = customers.groupby('customer_city').count().sort_values(by='customer_id', ascending=False)
customers_location.head(10)
customer_id | customer_unique_id | customer_zip_code_prefix | customer_state | |
---|---|---|---|---|
customer_city | ||||
sao paulo | 15540 | 15540 | 15540 | 15540 |
rio de janeiro | 6882 | 6882 | 6882 | 6882 |
belo horizonte | 2773 | 2773 | 2773 | 2773 |
brasilia | 2131 | 2131 | 2131 | 2131 |
curitiba | 1521 | 1521 | 1521 | 1521 |
campinas | 1444 | 1444 | 1444 | 1444 |
porto alegre | 1379 | 1379 | 1379 | 1379 |
salvador | 1245 | 1245 | 1245 | 1245 |
guarulhos | 1189 | 1189 | 1189 | 1189 |
sao bernardo do campo | 938 | 938 | 938 | 938 |
customers_location = customers.groupby('customer_city')['customer_id'].nunique().sort_values(ascending=False)
customers_location.head(10)
customer_city sao paulo 15540 rio de janeiro 6882 belo horizonte 2773 brasilia 2131 curitiba 1521 campinas 1444 porto alegre 1379 salvador 1245 guarulhos 1189 sao bernardo do campo 938 Name: customer_id, dtype: int64
import chart_studio.plotly as py
import cufflinks as cf
cf.go_offline(connected=True)
customers_location_top20 = customers_location.head(20)
customers_location_top20.iplot(kind='bar', theme='white')
top20_customer_locations = customers_location_top20.index
for index, location in enumerate(list(top20_customer_locations)):
print ("TOP", index + 1, ":", location)
TOP 1 : sao paulo TOP 2 : rio de janeiro TOP 3 : belo horizonte TOP 4 : brasilia TOP 5 : curitiba TOP 6 : campinas TOP 7 : porto alegre TOP 8 : salvador TOP 9 : guarulhos TOP 10 : sao bernardo do campo TOP 11 : niteroi TOP 12 : santo andre TOP 13 : osasco TOP 14 : santos TOP 15 : goiania TOP 16 : sao jose dos campos TOP 17 : fortaleza TOP 18 : sorocaba TOP 19 : recife TOP 20 : florianopolis
payments.head()
order_id | payment_sequential | payment_type | payment_installments | payment_value | |
---|---|---|---|---|---|
0 | b81ef226f3fe1789b1e8b2acac839d17 | 1 | credit_card | 8 | 99.33 |
1 | a9810da82917af2d9aefd1278f1dcfa0 | 1 | credit_card | 1 | 24.39 |
2 | 25e8ea4e93396b6fa0d3dd708e76c1bd | 1 | credit_card | 1 | 65.71 |
3 | ba78997921bbcdc1373bb41e913ab953 | 1 | credit_card | 8 | 107.78 |
4 | 42fdf880ba16b47b59251dd489d4441a | 1 | credit_card | 2 | 128.45 |
payments.isnull().sum()
order_id 0 payment_sequential 0 payment_type 0 payment_installments 0 payment_value 0 dtype: int64
payments['payment_type'].unique()
array(['credit_card', 'boleto', 'voucher', 'debit_card', 'not_defined'], dtype=object)
payments = payments[payments['payment_type'] != 'not_defined']
payments['payment_type'].unique()
array(['credit_card', 'boleto', 'voucher', 'debit_card'], dtype=object)
payment_type_count = payments.groupby('payment_type')['order_id'].nunique().sort_values(ascending=False)
payment_type_count
payment_type credit_card 76505 boleto 19784 voucher 3866 debit_card 1528 Name: order_id, dtype: int64
payment_type_count.iplot(kind='bar', theme='white')
import plotly.graph_objects as go
fig = go.Figure()
fig.add_trace(
go.Pie(
labels=payment_type_count.index, values=payment_type_count.values,
textinfo='label+percent', insidetextorientation='horizontal'
)
)
fig.update_layout(
{
"title": {
"text": "Payment Type Analysis",
"x": 0.5,
"y": 0.9,
"font": {
"size": 15
}
},
"showlegend": True
}
)
fig.show()
orders.head()
order_id | customer_id | order_status | order_purchase_timestamp | order_approved_at | order_delivered_carrier_date | order_delivered_customer_date | order_estimated_delivery_date | |
---|---|---|---|---|---|---|---|---|
0 | e481f51cbdc54678b7cc49136f2d6af7 | 9ef432eb6251297304e76186b10a928d | delivered | 2017-10-02 10:56:33 | 2017-10-02 11:07:15 | 2017-10-04 19:55:00 | 2017-10-10 21:25:13 | 2017-10-18 00:00:00 |
1 | 53cdb2fc8bc7dce0b6741e2150273451 | b0830fb4747a6c6d20dea0b8c802d7ef | delivered | 2018-07-24 20:41:37 | 2018-07-26 03:24:27 | 2018-07-26 14:31:00 | 2018-08-07 15:27:45 | 2018-08-13 00:00:00 |
2 | 47770eb9100c2d0c44946d9cf07ec65d | 41ce2a54c0b03bf3443c3d931a367089 | delivered | 2018-08-08 08:38:49 | 2018-08-08 08:55:23 | 2018-08-08 13:50:00 | 2018-08-17 18:06:29 | 2018-09-04 00:00:00 |
3 | 949d5b44dbf5de918fe9c16f97b45f8a | f88197465ea7920adcdbec7375364d82 | delivered | 2017-11-18 19:28:06 | 2017-11-18 19:45:59 | 2017-11-22 13:39:59 | 2017-12-02 00:28:42 | 2017-12-15 00:00:00 |
4 | ad21c59c0840e6cb83a9ceb5573f8159 | 8ab97904e6daea8866dbdbc4fb7aad2c | delivered | 2018-02-13 21:18:39 | 2018-02-13 22:20:29 | 2018-02-14 19:46:34 | 2018-02-16 18:17:02 | 2018-02-26 00:00:00 |
order_items.head()
order_id | order_item_id | product_id | seller_id | shipping_limit_date | price | freight_value | |
---|---|---|---|---|---|---|---|
0 | 00010242fe8c5a6d1ba2dd792cb16214 | 1 | 4244733e06e7ecb4970a6e2683c13e61 | 48436dade18ac8b2bce089ec2a041202 | 2017-09-19 09:45:35 | 58.90 | 13.29 |
1 | 00018f77f2f0320c557190d7a144bdd3 | 1 | e5f2d52b802189ee658865ca93d83a8f | dd7ddc04e1b6c2c614352b383efe2d36 | 2017-05-03 11:05:13 | 239.90 | 19.93 |
2 | 000229ec398224ef6ca0657da4fc703e | 1 | c777355d18b72b67abbeef9df44fd0fd | 5b51032eddd242adc84c38acab88f23d | 2018-01-18 14:48:30 | 199.00 | 17.87 |
3 | 00024acbcdf0a6daa1e931b038114c75 | 1 | 7634da152a4610f1595efa32f14722fc | 9d7a1d34a5052409006425275ba1c2b4 | 2018-08-15 10:10:18 | 12.99 | 12.79 |
4 | 00042b26cf59d7ce69dfabb4e55b4fd9 | 1 | ac6c3623068f30de03045865e4e10089 | df560393f3a51e74553ab94004ba5c87 | 2017-02-13 13:57:51 | 199.90 | 18.14 |
payments.head()
order_id | payment_sequential | payment_type | payment_installments | payment_value | |
---|---|---|---|---|---|
0 | b81ef226f3fe1789b1e8b2acac839d17 | 1 | credit_card | 8 | 99.33 |
1 | a9810da82917af2d9aefd1278f1dcfa0 | 1 | credit_card | 1 | 24.39 |
2 | 25e8ea4e93396b6fa0d3dd708e76c1bd | 1 | credit_card | 1 | 65.71 |
3 | ba78997921bbcdc1373bb41e913ab953 | 1 | credit_card | 8 | 107.78 |
4 | 42fdf880ba16b47b59251dd489d4441a | 1 | credit_card | 2 | 128.45 |
orders 의 구매 날짜와 payments 의 총 구매 금액을 가지고 월별 평균 거래액 분석을 하자
orders.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 99441 entries, 0 to 99440 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 order_id 99441 non-null object 1 customer_id 99441 non-null object 2 order_status 99441 non-null object 3 order_purchase_timestamp 99441 non-null object 4 order_approved_at 99281 non-null object 5 order_delivered_carrier_date 97658 non-null object 6 order_delivered_customer_date 96476 non-null object 7 order_estimated_delivery_date 99441 non-null object dtypes: object(8) memory usage: 6.1+ MB
orders.isnull().sum()
order_id 0 customer_id 0 order_status 0 order_purchase_timestamp 0 order_approved_at 160 order_delivered_carrier_date 1783 order_delivered_customer_date 2965 order_estimated_delivery_date 0 dtype: int64
orders = orders.dropna()
orders.isnull().sum()
order_id 0 customer_id 0 order_status 0 order_purchase_timestamp 0 order_approved_at 0 order_delivered_carrier_date 0 order_delivered_customer_date 0 order_estimated_delivery_date 0 dtype: int64
orders.info()
<class 'pandas.core.frame.DataFrame'> Index: 96461 entries, 0 to 99440 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 order_id 96461 non-null object 1 customer_id 96461 non-null object 2 order_status 96461 non-null object 3 order_purchase_timestamp 96461 non-null object 4 order_approved_at 96461 non-null object 5 order_delivered_carrier_date 96461 non-null object 6 order_delivered_customer_date 96461 non-null object 7 order_estimated_delivery_date 96461 non-null object dtypes: object(8) memory usage: 6.6+ MB
payments.isnull().sum()
order_id 0 payment_sequential 0 payment_type 0 payment_installments 0 payment_value 0 dtype: int64
payments.info()
<class 'pandas.core.frame.DataFrame'> Index: 103883 entries, 0 to 103885 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 order_id 103883 non-null object 1 payment_sequential 103883 non-null int64 2 payment_type 103883 non-null object 3 payment_installments 103883 non-null int64 4 payment_value 103883 non-null float64 dtypes: float64(1), int64(2), object(2) memory usage: 4.8+ MB
orders.info()
<class 'pandas.core.frame.DataFrame'> Index: 96461 entries, 0 to 99440 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 order_id 96461 non-null object 1 customer_id 96461 non-null object 2 order_status 96461 non-null object 3 order_purchase_timestamp 96461 non-null object 4 order_approved_at 96461 non-null object 5 order_delivered_carrier_date 96461 non-null object 6 order_delivered_customer_date 96461 non-null object 7 order_estimated_delivery_date 96461 non-null object dtypes: object(8) memory usage: 6.6+ MB
payments['order_id'].value_counts()
order_id fa65dad1b0e818e3ccc5cb0e39231352 29 ccf804e764ed5650cd8759557269dc13 26 285c2e15bebd4ac83635ccc563dc71f4 22 895ab968e7bb0d5659d16cd74cd1650c 21 fedcd9f7ccdc8cba3a18defedd1a5547 19 .. a7737f6d9208dd56ea498a322ed3c37f 1 646e62df54f3e236eb6d5ff3b31429b8 1 e115da7a49ec2acf622e1f31da65cfb9 1 f22592e01ddc0711e6794f900fa5e787 1 28bbae6599b09d39ca406b747b6632b1 1 Name: count, Length: 99437, dtype: int64
payments[payments['order_id'] == 'fa65dad1b0e818e3ccc5cb0e39231352']
order_id | payment_sequential | payment_type | payment_installments | payment_value | |
---|---|---|---|---|---|
4885 | fa65dad1b0e818e3ccc5cb0e39231352 | 27 | voucher | 1 | 66.02 |
9985 | fa65dad1b0e818e3ccc5cb0e39231352 | 4 | voucher | 1 | 29.16 |
14321 | fa65dad1b0e818e3ccc5cb0e39231352 | 1 | voucher | 1 | 3.71 |
17274 | fa65dad1b0e818e3ccc5cb0e39231352 | 9 | voucher | 1 | 1.08 |
19565 | fa65dad1b0e818e3ccc5cb0e39231352 | 10 | voucher | 1 | 12.86 |
23074 | fa65dad1b0e818e3ccc5cb0e39231352 | 2 | voucher | 1 | 8.51 |
24879 | fa65dad1b0e818e3ccc5cb0e39231352 | 25 | voucher | 1 | 3.68 |
28330 | fa65dad1b0e818e3ccc5cb0e39231352 | 5 | voucher | 1 | 0.66 |
29648 | fa65dad1b0e818e3ccc5cb0e39231352 | 6 | voucher | 1 | 5.02 |
32519 | fa65dad1b0e818e3ccc5cb0e39231352 | 11 | voucher | 1 | 4.03 |
36822 | fa65dad1b0e818e3ccc5cb0e39231352 | 14 | voucher | 1 | 0.00 |
39108 | fa65dad1b0e818e3ccc5cb0e39231352 | 29 | voucher | 1 | 19.26 |
39111 | fa65dad1b0e818e3ccc5cb0e39231352 | 28 | voucher | 1 | 29.05 |
63369 | fa65dad1b0e818e3ccc5cb0e39231352 | 15 | voucher | 1 | 14.04 |
65641 | fa65dad1b0e818e3ccc5cb0e39231352 | 3 | voucher | 1 | 2.95 |
68853 | fa65dad1b0e818e3ccc5cb0e39231352 | 8 | voucher | 1 | 26.02 |
73537 | fa65dad1b0e818e3ccc5cb0e39231352 | 12 | voucher | 1 | 3.60 |
74239 | fa65dad1b0e818e3ccc5cb0e39231352 | 21 | voucher | 1 | 4.11 |
74721 | fa65dad1b0e818e3ccc5cb0e39231352 | 19 | voucher | 1 | 5.02 |
77885 | fa65dad1b0e818e3ccc5cb0e39231352 | 13 | voucher | 1 | 0.00 |
78050 | fa65dad1b0e818e3ccc5cb0e39231352 | 16 | voucher | 1 | 12.86 |
79587 | fa65dad1b0e818e3ccc5cb0e39231352 | 26 | voucher | 1 | 28.27 |
79670 | fa65dad1b0e818e3ccc5cb0e39231352 | 17 | voucher | 1 | 3.72 |
80321 | fa65dad1b0e818e3ccc5cb0e39231352 | 18 | voucher | 1 | 0.73 |
82593 | fa65dad1b0e818e3ccc5cb0e39231352 | 7 | voucher | 1 | 0.32 |
85166 | fa65dad1b0e818e3ccc5cb0e39231352 | 23 | voucher | 1 | 18.86 |
93337 | fa65dad1b0e818e3ccc5cb0e39231352 | 20 | voucher | 1 | 150.00 |
99213 | fa65dad1b0e818e3ccc5cb0e39231352 | 24 | voucher | 1 | 0.42 |
100606 | fa65dad1b0e818e3ccc5cb0e39231352 | 22 | voucher | 1 | 4.03 |
중복된 order_id 에 대해 orders 필드값이 덮어씌워져서 본래 orders 보다 많은 행이 생김
payments = payments.groupby('order_id').sum()
payments[payments.index == 'fa65dad1b0e818e3ccc5cb0e39231352']
payment_sequential | payment_type | payment_installments | payment_value | |
---|---|---|---|---|
order_id | ||||
fa65dad1b0e818e3ccc5cb0e39231352 | 435 | vouchervouchervouchervouchervouchervouchervouc... | 29 | 457.99 |
merged_order = pd.merge(orders, payments, on='order_id')
merged_order.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 96460 entries, 0 to 96459 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 order_id 96460 non-null object 1 customer_id 96460 non-null object 2 order_status 96460 non-null object 3 order_purchase_timestamp 96460 non-null object 4 order_approved_at 96460 non-null object 5 order_delivered_carrier_date 96460 non-null object 6 order_delivered_customer_date 96460 non-null object 7 order_estimated_delivery_date 96460 non-null object 8 payment_sequential 96460 non-null int64 9 payment_type 96460 non-null object 10 payment_installments 96460 non-null int64 11 payment_value 96460 non-null float64 dtypes: float64(1), int64(2), object(9) memory usage: 8.8+ MB
merged_order[merged_order['order_id'] == 'fa65dad1b0e818e3ccc5cb0e39231352']
order_id | customer_id | order_status | order_purchase_timestamp | order_approved_at | order_delivered_carrier_date | order_delivered_customer_date | order_estimated_delivery_date | payment_sequential | payment_type | payment_installments | payment_value |
---|
표시 | 의미 |
---|---|
%y | 연도를 축약해서 표시, 예: 21 |
%Y | 연도를 축약하지 않고 표시, 예: 2021 |
%m | 월을 두자릿 수로 표시, 예: 01 ~ 12 |
%d | 일을 두자릿 수로 표시, 예: 01 ~ 31 |
%H | 시간 표시(24시간), 예: 00 ~ 23 |
%M | 분 표시, 예: 00 ~ 59 |
%S | 초 표시, 예: 00 ~ 59 |
참고: https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior
# order 한 시간 정보 데이터만 datetime64 로 변환하기
merged_order['order_purchase_timestamp'] = pd.to_datetime(merged_order['order_purchase_timestamp'],
format='%Y-%m-%d %H:%M:%S', errors='raise')
merged_order_payment_date = merged_order[['order_purchase_timestamp', 'payment_value']].copy()
merged_order_payment_date.head()
order_purchase_timestamp | payment_value | |
---|---|---|
0 | 2017-10-02 10:56:33 | 38.71 |
1 | 2018-07-24 20:41:37 | 141.46 |
2 | 2018-08-08 08:38:49 | 179.12 |
3 | 2017-11-18 19:28:06 | 72.20 |
4 | 2018-02-13 21:18:39 | 28.62 |
데이터프레임.groupby(pd.Groper(key='그루핑기준이되는 컬럼', freq='세부 기준'))
freq 옵션: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases
# 월별 거래액 확인하기
merged_order_month_sum = merged_order_payment_date.groupby(pd.Grouper(key='order_purchase_timestamp', freq='M')).sum() # key 는 기본이 index 임
merged_order_month_sum.head()
payment_value | |
---|---|
order_purchase_timestamp | |
2016-10-31 | 47271.20 |
2016-11-30 | 0.00 |
2016-12-31 | 19.62 |
2017-01-31 | 127430.74 |
2017-02-28 | 269458.98 |
merged_order_month_sum.iplot(kind='bar', theme='white')
merged_order_month_sum['payment_value'].mean()
670420.9934782608
merged_order_month_sum[merged_order_month_sum['payment_value'] == merged_order_month_sum['payment_value'].max()]
payment_value | |
---|---|
order_purchase_timestamp | |
2017-11-30 | 1153393.22 |
import plotly.graph_objects as go
fig = go.Figure()
fig.add_trace(
go.Bar(
x=merged_order_month_sum.index,
y=merged_order_month_sum['payment_value'],
text=merged_order_month_sum['payment_value'],
textposition='auto',
texttemplate='R$ %{text:.0f}'
)
)
fig.update_layout(
{
"title": {
"text": "<b>Turnover per Month in Brazilian Olist E-Commerce company</b>",
"x": 0.5,
"y": 0.9,
"font": {
"size": 15
}
},
"xaxis": {
"title": "from Oct. 2016 to Sep. 2018",
"showticklabels":True,
"dtick": "M1",
"tickfont": {
"size": 8
}
},
"yaxis": {
"title": "Turnover per Month"
}
}
)
fig.show()
# 불필요한 데이터 삭제
merged_order_month_sum_from2017 = merged_order_month_sum[merged_order_month_sum.index > '2017-01-01']
merged_order_month_sum_from2017
payment_value | |
---|---|
order_purchase_timestamp | |
2017-01-31 | 127430.74 |
2017-02-28 | 269458.98 |
2017-03-31 | 414369.39 |
2017-04-30 | 390952.18 |
2017-05-31 | 566872.73 |
2017-06-30 | 490225.60 |
2017-07-31 | 566403.93 |
2017-08-31 | 646000.61 |
2017-09-30 | 700976.01 |
2017-10-31 | 751140.27 |
2017-11-30 | 1153393.22 |
2017-12-31 | 843199.17 |
2018-01-31 | 1078606.86 |
2018-02-28 | 966554.97 |
2018-03-31 | 1120678.00 |
2018-04-30 | 1132933.95 |
2018-05-31 | 1128836.69 |
2018-06-30 | 1011561.35 |
2018-07-31 | 1027383.10 |
2018-08-31 | 985414.28 |
import plotly.graph_objects as go
fig = go.Figure()
fig.add_trace(
go.Bar(
x=merged_order_month_sum_from2017.index,
y=merged_order_month_sum_from2017['payment_value'],
text=merged_order_month_sum_from2017['payment_value'],
textposition='auto',
texttemplate='R$ %{text:.0f}'
)
)
fig.update_layout(
{
"title": {
"text": "<b>Turnover per Month in Brazilian Olist E-Commerce company</b>",
"x": 0.5,
"y": 0.9,
"font": {
"size": 15
}
},
"xaxis": {
"title": "from Jan. 2017 to Sep. 2018",
"showticklabels":True,
"dtick": "M1",
"tickfont": {
"size": 7
}
},
"yaxis": {
"title": "Turnover per Month"
}
}
)
fig.show()
import plotly.io as pio
pio.templates
Templates configuration ----------------------- Default template: 'plotly' Available templates: ['ggplot2', 'seaborn', 'simple_white', 'plotly', 'plotly_white', 'plotly_dark', 'presentation', 'xgridoff', 'ygridoff', 'gridon', 'none']
import plotly.graph_objects as go
for template in pio.templates:
fig = go.Figure()
fig.add_trace(
go.Bar(
x=merged_order_month_sum_from2017.index,
y=merged_order_month_sum_from2017['payment_value'],
text=merged_order_month_sum_from2017['payment_value'],
textposition='auto',
texttemplate='R$ %{text:.0f}'
)
)
fig.update_layout(
{
"title": {
"text": "<b>Turnover per Month in Brazilian Olist E-Commerce company</b> by " + template,
"x": 0.5,
"y": 0.9,
"font": {
"size": 15
}
},
"xaxis": {
"title": "from Feb. 2017 to Sep. 2018",
"showticklabels":True,
"tick0": "2017-01-31", # 처음 tick 을 설정을 해주지 않을 경우, x 축이 밀리는 경우가 있음
"dtick": "M1", # 한 달 단위로 tick 설정
"tickfont": {
"size": 7
}
},
"yaxis": {
"title": "Turnover per Month"
},
"template":template
}
)
fig.show()
# 테마 선택
import plotly.graph_objects as go
fig = go.Figure()
fig.add_trace(
go.Bar(
x=merged_order_month_sum_from2017.index,
y=merged_order_month_sum_from2017['payment_value'],
text=merged_order_month_sum_from2017['payment_value'],
textposition='auto',
texttemplate='R$ %{text:,.0f}'
)
)
fig.update_layout(
{
"title": {
"text": "<b>Turnover per Month in Brazilian Olist E-Commerce company</b>",
"x": 0.5,
"y": 0.9,
"font": {
"size": 15
}
},
"xaxis": {
"title": "from Jan. 2017 to Sep. 2018",
"showticklabels":True,
"tick0": "2017-01-31", # 처음 tick 을 설정을 해주지 않을 경우, x 축이 밀리는 경우가 있음
"dtick": "M1", # 한 달 단위로 tick 설정
"tickfont": {
"size": 7
}
},
"yaxis": {
"title": "Turnover per Month"
},
"template":'plotly_white'
}
)
fig.show()
colors = ['#03588C',] * len(merged_order_month_sum_from2017.index)
colors[10] = '#F24472'
import plotly.graph_objects as go
fig = go.Figure()
fig.add_trace(
go.Bar(
x=merged_order_month_sum_from2017.index,
y=merged_order_month_sum_from2017['payment_value'],
text=merged_order_month_sum_from2017['payment_value'],
textposition='auto',
texttemplate='R$ %{text:,.0f}',
marker_color=colors
)
)
fig.update_layout(
{
"title": {
"text": "<b>Turnover per Month in Brazilian Olist E-Commerce company</b>",
"x": 0.5,
"y": 0.9,
"font": {
"size": 15
}
},
"xaxis": {
"title": "from Jan. 2017 to Aug. 2018",
"showticklabels":True,
"tick0": "2017-01-31",
"dtick": "M1",
"tickfont": {
"size": 7
}
},
"yaxis": {
"title": "Turnover per Month",
"tickfont": {
"size": 10
}
},
"template":'plotly_white'
}
)
fig.show()
import plotly.graph_objects as go
fig = go.Figure()
fig.add_trace(
go.Bar(
x=merged_order_month_sum_from2017.index,
y=merged_order_month_sum_from2017['payment_value'],
text=merged_order_month_sum_from2017['payment_value'],
textposition='auto',
texttemplate='R$ %{y:,.0f}',
marker_color=colors
)
)
fig.update_layout(
{
"title": {
"text": "<b>Turnover per Month in Brazilian Olist E-Commerce company</b>",
"x": 0.5,
"y": 0.9,
"font": {
"size": 15
}
},
"xaxis": {
"title": "from Jan. 2017 to Sep. 2018",
"showticklabels":True,
"tick0": "2017-01-31",
"dtick": "M1",
"tickfont": {
"size": 7
}
},
"yaxis": {
"title": "Turnover per Month",
"tickfont": {
"size": 10
}
},
"template":'plotly_white'
}
)
fig.add_annotation(
x="2017-11-30",
y=1153393,
text="<b>Peaked Monthly Turnover</b>",
showarrow=True,
font=dict(
size=10,
color="#ffffff"
),
align="center",
arrowhead=2,
arrowsize=1,
arrowwidth=2,
arrowcolor="#77CFD9",
ax=20,
ay=-30,
bordercolor="#77CFD9",
borderwidth=2,
borderpad=4,
bgcolor="#F25D50",
opacity=0.9
)
fig.show()
merged_order.head()
order_id | customer_id | order_status | order_purchase_timestamp | order_approved_at | order_delivered_carrier_date | order_delivered_customer_date | order_estimated_delivery_date | payment_sequential | payment_type | payment_installments | payment_value | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | e481f51cbdc54678b7cc49136f2d6af7 | 9ef432eb6251297304e76186b10a928d | delivered | 2017-10-02 10:56:33 | 2017-10-02 11:07:15 | 2017-10-04 19:55:00 | 2017-10-10 21:25:13 | 2017-10-18 00:00:00 | 6 | credit_cardvouchervoucher | 3 | 38.71 |
1 | 53cdb2fc8bc7dce0b6741e2150273451 | b0830fb4747a6c6d20dea0b8c802d7ef | delivered | 2018-07-24 20:41:37 | 2018-07-26 03:24:27 | 2018-07-26 14:31:00 | 2018-08-07 15:27:45 | 2018-08-13 00:00:00 | 1 | boleto | 1 | 141.46 |
2 | 47770eb9100c2d0c44946d9cf07ec65d | 41ce2a54c0b03bf3443c3d931a367089 | delivered | 2018-08-08 08:38:49 | 2018-08-08 08:55:23 | 2018-08-08 13:50:00 | 2018-08-17 18:06:29 | 2018-09-04 00:00:00 | 1 | credit_card | 3 | 179.12 |
3 | 949d5b44dbf5de918fe9c16f97b45f8a | f88197465ea7920adcdbec7375364d82 | delivered | 2017-11-18 19:28:06 | 2017-11-18 19:45:59 | 2017-11-22 13:39:59 | 2017-12-02 00:28:42 | 2017-12-15 00:00:00 | 1 | credit_card | 1 | 72.20 |
4 | ad21c59c0840e6cb83a9ceb5573f8159 | 8ab97904e6daea8866dbdbc4fb7aad2c | delivered | 2018-02-13 21:18:39 | 2018-02-13 22:20:29 | 2018-02-14 19:46:34 | 2018-02-16 18:17:02 | 2018-02-26 00:00:00 | 1 | credit_card | 1 | 28.62 |
merged_order_payment_date.head()
order_purchase_timestamp | payment_value | |
---|---|---|
0 | 2017-10-02 10:56:33 | 38.71 |
1 | 2018-07-24 20:41:37 | 141.46 |
2 | 2018-08-08 08:38:49 | 179.12 |
3 | 2017-11-18 19:28:06 | 72.20 |
4 | 2018-02-13 21:18:39 | 28.62 |
# order_purchase_timestamp 의 날짜 데이터를 기반으로 월별 계산을 해야 하므로 datetime 타입으로 변환
merged_order_payment_date['order_purchase_timestamp'] = pd.to_datetime(merged_order_payment_date['order_purchase_timestamp'], format='%Y-%m-%d %H:%M:%S', errors='raise')
merged_order_payment_date = merged_order_payment_date.set_index('order_purchase_timestamp')
merged_order_month_count = merged_order_payment_date.groupby(pd.Grouper(freq='M')).count() # key 는 기본이 index 임
merged_order_month_count.head()
payment_value | |
---|---|
order_purchase_timestamp | |
2016-10-31 | 270 |
2016-11-30 | 0 |
2016-12-31 | 1 |
2017-01-31 | 748 |
2017-02-28 | 1641 |
merged_order_month_count.iplot(kind='bar', theme='white')
merged_order_date_sum = merged_order_payment_date.groupby(pd.Grouper(freq='D')).sum() # key 는 기본이 index 임
merged_order_date_sum.iplot(kind='line', theme='white')
merged_order_payment_date['order_purchase_timestamp'] = pd.to_datetime(merged_order_payment_date.index, format='%Y-%m-%d %H:%M:%S', errors='raise')
merged_order_payment_date.info()
<class 'pandas.core.frame.DataFrame'> DatetimeIndex: 96460 entries, 2017-10-02 10:56:33 to 2018-03-08 20:57:30 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 payment_value 96460 non-null float64 1 order_purchase_timestamp 96460 non-null datetime64[ns] dtypes: datetime64[ns](1), float64(1) memory usage: 2.2 MB
merged_order_payment_date['year'] = merged_order_payment_date['order_purchase_timestamp'].dt.year
merged_order_payment_date['monthday'] = merged_order_payment_date['order_purchase_timestamp'].dt.day
merged_order_payment_date['weekday'] = merged_order_payment_date['order_purchase_timestamp'].dt.weekday
merged_order_payment_date['month'] = merged_order_payment_date['order_purchase_timestamp'].dt.month
merged_order_payment_date['hour'] = merged_order_payment_date['order_purchase_timestamp'].dt.hour
merged_order_payment_date['quarter'] = merged_order_payment_date['order_purchase_timestamp'].dt.quarter
merged_order_payment_date['minute'] = merged_order_payment_date['order_purchase_timestamp'].dt.minute
merged_order_payment_year = merged_order_payment_date[['year', 'payment_value']].copy()
merged_order_payment_year
year | payment_value | |
---|---|---|
order_purchase_timestamp | ||
2017-10-02 10:56:33 | 2017 | 38.71 |
2018-07-24 20:41:37 | 2018 | 141.46 |
2018-08-08 08:38:49 | 2018 | 179.12 |
2017-11-18 19:28:06 | 2017 | 72.20 |
2018-02-13 21:18:39 | 2018 | 28.62 |
... | ... | ... |
2017-03-09 09:54:05 | 2017 | 85.08 |
2018-02-06 12:58:58 | 2018 | 195.00 |
2017-08-27 14:46:43 | 2017 | 271.01 |
2018-01-08 21:28:27 | 2018 | 441.16 |
2018-03-08 20:57:30 | 2018 | 86.86 |
96460 rows × 2 columns
merged_order_payment_year = merged_order_payment_year.groupby('year').sum()
merged_order_payment_year.head()
payment_value | |
---|---|
year | |
2016 | 47290.82 |
2017 | 6920422.83 |
2018 | 8451969.20 |
merged_order_payment_year.iplot(kind='bar', theme='white')
layout = {
'xaxis': {
'showticklabels':True,
'tickvals':[2016, 2017, 2018]
}
}
merged_order_payment_year.iplot(kind='bar', theme='white', layout=layout)
merged_order_payment_weekday = merged_order_payment_date[['weekday', 'payment_value']].copy()
merged_order_payment_weekday = merged_order_payment_weekday.groupby('weekday').sum()
merged_order_payment_weekday.head()
payment_value | |
---|---|
weekday | |
0 | 2530671.56 |
1 | 2473930.77 |
2 | 2396215.34 |
3 | 2283849.51 |
4 | 2222421.36 |
merged_order_payment_weekday = merged_order_payment_weekday.reset_index()
def func(row):
if row['weekday'] == 0:
row['weekday'] = 'Mon'
elif row['weekday'] == 1:
row['weekday'] = 'Tue'
elif row['weekday'] == 2:
row['weekday'] = 'Wed'
elif row['weekday'] == 3:
row['weekday'] = 'Thu'
elif row['weekday'] == 4:
row['weekday'] = 'Fri'
elif row['weekday'] == 5:
row['weekday'] = 'Sat'
elif row['weekday'] == 6:
row['weekday'] = 'Sun'
return row
merged_order_payment_weekday = merged_order_payment_weekday.apply(func, axis=1)
merged_order_payment_weekday
weekday | payment_value | |
---|---|---|
0 | Mon | 2530671.56 |
1 | Tue | 2473930.77 |
2 | Wed | 2396215.34 |
3 | Thu | 2283849.51 |
4 | Fri | 2222421.36 |
5 | Sat | 1704949.69 |
6 | Sun | 1807644.62 |
merged_order_payment_weekday = merged_order_payment_weekday.set_index('weekday')
merged_order_payment_weekday.iplot(kind='bar', theme='white')
merged_order_payment_quarter = merged_order_payment_date[['quarter', 'payment_value']].copy()
merged_order_payment_quarter = merged_order_payment_quarter.groupby('quarter').sum()
merged_order_payment_quarter.head()
payment_value | |
---|---|
quarter | |
1 | 3977098.94 |
2 | 4721382.50 |
3 | 3926177.93 |
4 | 2795023.48 |
merged_order_payment_quarter = merged_order_payment_quarter.reset_index()
def func(row):
if row['quarter'] == 1:
row['quarter'] = '1Q'
elif row['quarter'] == 2:
row['quarter'] = '2Q'
elif row['quarter'] == 3:
row['quarter'] = '3Q'
elif row['quarter'] == 4:
row['quarter'] = '4Q'
return row
merged_order_payment_quarter = merged_order_payment_quarter.apply(func, axis=1).set_index('quarter')
merged_order_payment_quarter.iplot(kind='bar', theme='white')
merged_order_payment_hour = merged_order_payment_date[['hour', 'payment_value']].copy()
merged_order_payment_hour = merged_order_payment_hour.groupby('hour').sum()
merged_order_payment_hour.iplot(kind='bar', theme='white')
layout = {
'xaxis': {
'showticklabels':True,
'dtick': 1
}
}
merged_order_payment_hour.iplot(kind='bar', theme='white', layout=layout)
import pandas as pd
PATH = "C:/Users/Admin/Data Analysis/Ecommerce\Dataset/"
products = pd.read_csv(PATH + "olist_products_dataset.csv", encoding='utf-8-sig')
customers = pd.read_csv(PATH + "olist_customers_dataset.csv", encoding='utf-8-sig')
geolocation = pd.read_csv(PATH + "olist_geolocation_dataset.csv", encoding='utf-8-sig')
order_items = pd.read_csv(PATH + "olist_order_items_dataset.csv", encoding='utf-8-sig')
payments = pd.read_csv(PATH + "olist_order_payments_dataset.csv", encoding='utf-8-sig')
reviews = pd.read_csv(PATH + "olist_order_reviews_dataset.csv", encoding='utf-8-sig')
orders = pd.read_csv(PATH + "olist_orders_dataset.csv", encoding='utf-8-sig')
sellers = pd.read_csv(PATH + "olist_sellers_dataset.csv", encoding='utf-8-sig')
category_name = pd.read_csv(PATH + "product_category_name_translation.csv", encoding='utf-8-sig')
orders = orders.dropna()
merged_order = pd.merge(orders, payments, on='order_id')
merged_order_payment_date = merged_order[['order_purchase_timestamp', 'payment_value']].copy()
merged_order_payment_date['order_purchase_timestamp'] = pd.to_datetime(merged_order_payment_date['order_purchase_timestamp'], format='%Y-%m-%d %H:%M:%S', errors='raise')
merged_order_payment_date['year'] = merged_order_payment_date['order_purchase_timestamp'].dt.year
merged_order_payment_date['monthday'] = merged_order_payment_date['order_purchase_timestamp'].dt.day
merged_order_payment_date['weekday'] = merged_order_payment_date['order_purchase_timestamp'].dt.weekday
merged_order_payment_date['month'] = merged_order_payment_date['order_purchase_timestamp'].dt.month
merged_order_payment_date['hour'] = merged_order_payment_date['order_purchase_timestamp'].dt.hour
merged_order_payment_date['quarter'] = merged_order_payment_date['order_purchase_timestamp'].dt.quarter
merged_order_payment_date['minute'] = merged_order_payment_date['order_purchase_timestamp'].dt.minute
merged_order_payment_date.head()
order_purchase_timestamp | payment_value | year | monthday | weekday | month | hour | quarter | minute | |
---|---|---|---|---|---|---|---|---|---|
0 | 2017-10-02 10:56:33 | 18.12 | 2017 | 2 | 0 | 10 | 10 | 4 | 56 |
1 | 2017-10-02 10:56:33 | 2.00 | 2017 | 2 | 0 | 10 | 10 | 4 | 56 |
2 | 2017-10-02 10:56:33 | 18.59 | 2017 | 2 | 0 | 10 | 10 | 4 | 56 |
3 | 2018-07-24 20:41:37 | 141.46 | 2018 | 24 | 1 | 7 | 20 | 3 | 41 |
4 | 2018-08-08 08:38:49 | 179.12 | 2018 | 8 | 2 | 8 | 8 | 3 | 38 |
merged_order_payment_hour_weekday = merged_order_payment_date[['weekday', 'hour', 'payment_value']].copy()
merged_order_payment_hour_weekday
weekday | hour | payment_value | |
---|---|---|---|
0 | 0 | 10 | 18.12 |
1 | 0 | 10 | 2.00 |
2 | 0 | 10 | 18.59 |
3 | 1 | 20 | 141.46 |
4 | 2 | 8 | 179.12 |
... | ... | ... | ... |
100734 | 3 | 9 | 85.08 |
100735 | 1 | 12 | 195.00 |
100736 | 6 | 14 | 271.01 |
100737 | 0 | 21 | 441.16 |
100738 | 3 | 20 | 86.86 |
100739 rows × 3 columns
merged_order_payment_hour_weekday = merged_order_payment_hour_weekday.groupby(['weekday','hour']).sum()
merged_order_payment_hour_weekday
payment_value | ||
---|---|---|
weekday | hour | |
0 | 0 | 46807.42 |
1 | 17734.77 | |
2 | 7462.56 | |
3 | 6442.62 | |
4 | 2138.02 | |
... | ... | ... |
6 | 19 | 140127.39 |
20 | 150543.96 | |
21 | 135431.31 | |
22 | 127866.74 | |
23 | 82991.95 |
168 rows × 1 columns
merged_order_payment_hour_weekday.index[1][0] # 리스트 안의 리스트일 때처럼 가리키면 됨
0
merged_order_payment_hour_weekday = merged_order_payment_hour_weekday.reset_index()
merged_order_payment_hour_weekday
index | weekday | hour | payment_value | |
---|---|---|---|---|
0 | 0 | 0 | 0 | 46807.42 |
1 | 1 | 0 | 1 | 17734.77 |
2 | 2 | 0 | 2 | 7462.56 |
3 | 3 | 0 | 3 | 6442.62 |
4 | 4 | 0 | 4 | 2138.02 |
... | ... | ... | ... | ... |
163 | 163 | 6 | 19 | 140127.39 |
164 | 164 | 6 | 20 | 150543.96 |
165 | 165 | 6 | 21 | 135431.31 |
166 | 166 | 6 | 22 | 127866.74 |
167 | 167 | 6 | 23 | 82991.95 |
168 rows × 4 columns
def func(row):
if row['weekday'] == 0:
row['weekday'] = 'Mon'
elif row['weekday'] == 1:
row['weekday'] = 'Tue'
elif row['weekday'] == 2:
row['weekday'] = 'Wed'
elif row['weekday'] == 3:
row['weekday'] = 'Thu'
elif row['weekday'] == 4:
row['weekday'] = 'Fri'
elif row['weekday'] == 5:
row['weekday'] = 'Sat'
elif row['weekday'] == 6:
row['weekday'] = 'Sun'
return row
merged_order_payment_hour_weekday = merged_order_payment_hour_weekday.apply(func, axis=1)
merged_order_payment_hour_weekday
index | weekday | hour | payment_value | |
---|---|---|---|---|
0 | 0.0 | Mon | 0.0 | 46807.42 |
1 | 1.0 | Mon | 1.0 | 17734.77 |
2 | 2.0 | Mon | 2.0 | 7462.56 |
3 | 3.0 | Mon | 3.0 | 6442.62 |
4 | 4.0 | Mon | 4.0 | 2138.02 |
... | ... | ... | ... | ... |
163 | 163.0 | Sun | 19.0 | 140127.39 |
164 | 164.0 | Sun | 20.0 | 150543.96 |
165 | 165.0 | Sun | 21.0 | 135431.31 |
166 | 166.0 | Sun | 22.0 | 127866.74 |
167 | 167.0 | Sun | 23.0 | 82991.95 |
168 rows × 4 columns
merged_order_payment_hour_weekday['weekday'] = pd.Categorical(merged_order_payment_hour_weekday['weekday'],categories=['Sun','Mon','Tue','Wed','Thu','Fri','Sat'],ordered=True)
merged_order_payment_hour_weekday = merged_order_payment_hour_weekday.sort_values(by=['weekday', 'hour'], ascending=True)
merged_order_payment_hour_weekday.head()
index | weekday | hour | payment_value | |
---|---|---|---|---|
144 | 144.0 | Sun | 0.0 | 38888.45 |
145 | 145.0 | Sun | 1.0 | 20137.20 |
146 | 146.0 | Sun | 2.0 | 11965.82 |
147 | 147.0 | Sun | 3.0 | 6326.13 |
148 | 148.0 | Sun | 4.0 | 3914.48 |
merged_order_payment_hour_weekday.iplot(kind='heatmap', y='weekday', x='hour', z='payment_value', colorscale='Blues')
layout = {
'title': {
"text": "<b>Heatmap of turnover over the hour by day</b>",
"x": 0.5,
"y": 0.87,
"font": {
"size": 16
}
},
'xaxis': {
'title': '<b>Hour</b>',
'showticklabels':True,
'dtick': 1
},
"yaxis": {
"title": "<b>The day of the week</b>"
},
"template":'plotly_white'
}
merged_order_payment_hour_weekday.iplot(kind='heatmap', y='weekday', x='hour', z='payment_value', colorscale='Reds', layout=layout)
products_category = pd.merge(products, category_name, on='product_category_name')
products_category
product_id | product_category_name | product_name_lenght | product_description_lenght | product_photos_qty | product_weight_g | product_length_cm | product_height_cm | product_width_cm | product_category_name_english | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1e9e8ef04dbcff4541ed26657ea517e5 | perfumaria | 40.0 | 287.0 | 1.0 | 225.0 | 16.0 | 10.0 | 14.0 | perfumery |
1 | 6a2fb4dd53d2cdb88e0432f1284a004c | perfumaria | 39.0 | 346.0 | 2.0 | 400.0 | 27.0 | 5.0 | 20.0 | perfumery |
2 | 0d009643171aee696f4733340bc2fdd0 | perfumaria | 52.0 | 150.0 | 1.0 | 422.0 | 21.0 | 16.0 | 18.0 | perfumery |
3 | b1eae565a61935e0011ee7682fef9dc9 | perfumaria | 49.0 | 460.0 | 2.0 | 267.0 | 17.0 | 13.0 | 17.0 | perfumery |
4 | 8da90b37f0fb171b4877c124f965b1f6 | perfumaria | 56.0 | 733.0 | 3.0 | 377.0 | 18.0 | 13.0 | 15.0 | perfumery |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
32323 | 13b25797ad1e6d6c7cd3cbeb35f8ee7a | moveis_colchao_e_estofado | 44.0 | 348.0 | 1.0 | 18250.0 | 25.0 | 25.0 | 25.0 | furniture_mattress_and_upholstery |
32324 | dc36a7859b743d8610a2bbbaea26ece9 | moveis_colchao_e_estofado | 29.0 | 2073.0 | 2.0 | 2500.0 | 41.0 | 21.0 | 21.0 | furniture_mattress_and_upholstery |
32325 | 107fde0930956120d1e13dd1062fbb46 | moveis_colchao_e_estofado | 51.0 | 418.0 | 1.0 | 30000.0 | 60.0 | 73.0 | 60.0 | furniture_mattress_and_upholstery |
32326 | 726b4e18f00255e2e63491bcba3f60b8 | moveis_colchao_e_estofado | 41.0 | 866.0 | 1.0 | 1700.0 | 70.0 | 10.0 | 100.0 | furniture_mattress_and_upholstery |
32327 | 1dceebcc5f23c02ea23e16d5bedca000 | cds_dvds_musicais | 47.0 | 117.0 | 2.0 | 550.0 | 35.0 | 15.0 | 25.0 | cds_dvds_musicals |
32328 rows × 10 columns
products_category_only = products_category[['product_id', 'product_category_name_english']].copy()
products_category_only.head()
product_id | product_category_name_english | |
---|---|---|
0 | 1e9e8ef04dbcff4541ed26657ea517e5 | perfumery |
1 | 6a2fb4dd53d2cdb88e0432f1284a004c | perfumery |
2 | 0d009643171aee696f4733340bc2fdd0 | perfumery |
3 | b1eae565a61935e0011ee7682fef9dc9 | perfumery |
4 | 8da90b37f0fb171b4877c124f965b1f6 | perfumery |
products_category_order = pd.merge(products_category_only, order_items, on='product_id')
products_category_order.head()
product_id | product_category_name_english | order_id | order_item_id | seller_id | shipping_limit_date | price | freight_value | |
---|---|---|---|---|---|---|---|---|
0 | 1e9e8ef04dbcff4541ed26657ea517e5 | perfumery | e17e4f88e31525f7deef66779844ddce | 1 | 5670f4db5b62c43d542e1b2d56b0cf7c | 2018-04-30 17:33:54 | 10.91 | 7.39 |
1 | 6a2fb4dd53d2cdb88e0432f1284a004c | perfumery | 048cc42e03ca8d43c729adf6962cb348 | 1 | 7040e82f899a04d1b434b795a43b4617 | 2017-11-23 21:31:31 | 16.90 | 7.78 |
2 | 6a2fb4dd53d2cdb88e0432f1284a004c | perfumery | 5fa78e91b5cb84b6a8ccc42733f95742 | 1 | 7040e82f899a04d1b434b795a43b4617 | 2017-12-14 16:33:14 | 16.90 | 7.78 |
3 | 0d009643171aee696f4733340bc2fdd0 | perfumery | 24b1c4d88fdb7a2dc87f8ecc7d8f47f1 | 1 | fe2032dab1a61af8794248c8196565c9 | 2017-08-03 09:50:14 | 339.00 | 17.13 |
4 | 0d009643171aee696f4733340bc2fdd0 | perfumery | 7b13c77c64a9a956500cbf1a9a23798d | 1 | 7178f9f4dd81dcef02f62acdf8151e01 | 2018-05-30 10:38:35 | 275.00 | 23.48 |
products_category_order = products_category_order.groupby('product_category_name_english').aggregate({'order_id':'count'}).rename(columns={'order_id':'order_count'})
products_category_order.head()
order_count | |
---|---|
product_category_name_english | |
agro_industry_and_commerce | 212 |
air_conditioning | 297 |
art | 209 |
arts_and_craftmanship | 24 |
audio | 364 |
products_category_order = products_category_order.sort_values(by='order_count',ascending=False)
import plotly.graph_objects as go
fig = go.Figure()
fig.add_trace(
go.Bar(
x=products_category_order.index,
y=products_category_order['order_count']
)
)
fig.update_layout(
{
"title": {
"text": "<b>The number of Order per Category in Brazilian Olist E-Commerce company</b>",
"x": 0.5,
"y": 0.9,
"font": {
"size": 15
}
},
"xaxis": {
"title": "from Oct. 2016 to Sep. 2018",
"showticklabels":True,
"tickfont": {
"size": 7
}
},
"yaxis": {
"title": "The number of Order"
},
"template":'plotly_white'
}
)
fig.show()
# 주요 카테고리만 확인
products_category_order = products_category_order[products_category_order['order_count'] > 1000]
import plotly.graph_objects as go
fig = go.Figure()
fig.add_trace(
go.Bar(
x=products_category_order.index,
y=products_category_order['order_count']
)
)
fig.update_layout(
{
"title": {
"text": "<b>The number of Order per Category in Brazilian Olist E-Commerce company</b>",
"x": 0.5,
"y": 0.9,
"font": {
"size": 15
}
},
"xaxis": {
"title": "from Oct. 2016 to Sep. 2018",
"showticklabels":True,
"tickfont": {
"size": 7
}
},
"yaxis": {
"title": "The number of Order"
},
"template":'plotly_white'
}
)
fig.show()
orders.head()
order_id | customer_id | order_status | order_purchase_timestamp | order_approved_at | order_delivered_carrier_date | order_delivered_customer_date | order_estimated_delivery_date | |
---|---|---|---|---|---|---|---|---|
0 | e481f51cbdc54678b7cc49136f2d6af7 | 9ef432eb6251297304e76186b10a928d | delivered | 2017-10-02 10:56:33 | 2017-10-02 11:07:15 | 2017-10-04 19:55:00 | 2017-10-10 21:25:13 | 2017-10-18 00:00:00 |
1 | 53cdb2fc8bc7dce0b6741e2150273451 | b0830fb4747a6c6d20dea0b8c802d7ef | delivered | 2018-07-24 20:41:37 | 2018-07-26 03:24:27 | 2018-07-26 14:31:00 | 2018-08-07 15:27:45 | 2018-08-13 00:00:00 |
2 | 47770eb9100c2d0c44946d9cf07ec65d | 41ce2a54c0b03bf3443c3d931a367089 | delivered | 2018-08-08 08:38:49 | 2018-08-08 08:55:23 | 2018-08-08 13:50:00 | 2018-08-17 18:06:29 | 2018-09-04 00:00:00 |
3 | 949d5b44dbf5de918fe9c16f97b45f8a | f88197465ea7920adcdbec7375364d82 | delivered | 2017-11-18 19:28:06 | 2017-11-18 19:45:59 | 2017-11-22 13:39:59 | 2017-12-02 00:28:42 | 2017-12-15 00:00:00 |
4 | ad21c59c0840e6cb83a9ceb5573f8159 | 8ab97904e6daea8866dbdbc4fb7aad2c | delivered | 2018-02-13 21:18:39 | 2018-02-13 22:20:29 | 2018-02-14 19:46:34 | 2018-02-16 18:17:02 | 2018-02-26 00:00:00 |
orders.info()
<class 'pandas.core.frame.DataFrame'> Index: 96461 entries, 0 to 99440 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 order_id 96461 non-null object 1 customer_id 96461 non-null object 2 order_status 96461 non-null object 3 order_purchase_timestamp 96461 non-null object 4 order_approved_at 96461 non-null object 5 order_delivered_carrier_date 96461 non-null object 6 order_delivered_customer_date 96461 non-null object 7 order_estimated_delivery_date 96461 non-null object dtypes: object(8) memory usage: 6.6+ MB
orders.isnull().sum()
order_id 0 customer_id 0 order_status 0 order_purchase_timestamp 0 order_approved_at 0 order_delivered_carrier_date 0 order_delivered_customer_date 0 order_estimated_delivery_date 0 dtype: int64
# 결측값 삭제
orders = orders.dropna()
orders.isnull().sum()
order_id 0 customer_id 0 order_status 0 order_purchase_timestamp 0 order_approved_at 0 order_delivered_carrier_date 0 order_delivered_customer_date 0 order_estimated_delivery_date 0 dtype: int64
orders['delivery_time'] = pd.to_datetime(orders['order_delivered_customer_date']) - pd.to_datetime(orders['order_purchase_timestamp'])
orders['delivery_time'].describe()
count 96461 mean 12 days 13:23:46.585853350 std 9 days 13:06:45.121880865 min 0 days 12:48:07 25% 6 days 18:22:50 50% 10 days 05:11:29 75% 15 days 17:17:10 max 209 days 15:05:12 Name: delivery_time, dtype: object
orders['delivery_time'].iplot(kind="box")
# max 값을 제거 하기 위해서 quantile 활용
delivery_time_q95 = orders['delivery_time'].quantile(.95)
delivery_time_q95
Timedelta('29 days 06:36:33')
delivery_time_q90 = orders['delivery_time'].quantile(.90)
delivery_time_q90
Timedelta('23 days 02:21:07')
orders = orders[orders['delivery_time'] < delivery_time_q95]
orders.describe()
delivery_time | |
---|---|
count | 91637 |
mean | 11 days 01:30:52.357934022 |
std | 6 days 02:35:35.737263832 |
min | 0 days 12:48:07 |
25% | 6 days 11:17:04 |
50% | 9 days 21:57:11 |
75% | 14 days 13:03:22 |
max | 29 days 06:35:41 |
orders_date = orders[['order_purchase_timestamp', 'delivery_time']].copy()
# order_purchase_timestamp 의 날짜 데이터를 기반으로 월별 계산을 해야 하므로 datetime 타입으로 변환
orders_date['order_purchase_timestamp'] = pd.to_datetime(orders_date['order_purchase_timestamp'], format='%Y-%m-%d %H:%M:%S', errors='raise')
# delivery_time이 timedelta64 타입인데, 이를 float 타입으로 변환
orders_date['delivery_time'] = orders_date['delivery_time'].dt.total_seconds()
orders_date = orders_date.set_index('order_purchase_timestamp')
orders_date = orders_date.groupby(pd.Grouper(freq='M')).mean() # key 는 기본이 index 임
orders_date.head()
delivery_time | |
---|---|
order_purchase_timestamp | |
2016-10-31 | 1.293386e+06 |
2016-11-30 | NaN |
2016-12-31 | 4.054770e+05 |
2017-01-31 | 9.549751e+05 |
2017-02-28 | 1.010813e+06 |
orders_date['delivery_time'] = orders_date['delivery_time'] / 86400
orders_date.head()
delivery_time | |
---|---|
order_purchase_timestamp | |
2016-10-31 | 14.969742 |
2016-11-30 | NaN |
2016-12-31 | 4.693021 |
2017-01-31 | 11.052953 |
2017-02-28 | 11.699220 |
orders_date.iplot(kind="bar")
orders_date = orders_date[orders_date.index > "2017-01-01"]
layout = {
'xaxis': {
'showticklabels':True,
"tick0": "2017-01-31",
'dtick': 'M1',
"tickfont": {
"size": 7
}
}
}
orders_date.iplot(kind="bar", layout=layout)
colors = ['#1B80BF',] * len(orders_date.index)
for index in range(15, len(orders_date.index)):
colors[index] = '#BF2C47'
import plotly.graph_objects as go
fig = go.Figure()
fig.add_trace(
go.Bar(
x=orders_date.index,
y=orders_date['delivery_time'],
text=orders_date['delivery_time'],
textposition='auto',
texttemplate='%{text:.2f} days',
marker_color=colors
)
)
fig.update_layout(
{
"title": {
"text": "<b>Average Delivery Time per Month in Brazilian Olist E-Commerce company</b>",
"x": 0.5,
"y": 0.9,
"font": {
"size": 15
}
},
"xaxis": {
"title": "from Jan. 2017 to Aug. 2018",
"showticklabels":True,
"tick0": "2017-01-31",
"dtick": "M1",
"tickfont": {
"size": 7
}
},
"yaxis": {
"title": "Delivery Time (days)"
},
"template":'plotly_white'
}
)
fig.add_annotation(
x="2018-04-30",
y=10.4,
text="<b>Improved logistics</b>",
showarrow=True,
font=dict(
size=10,
color="#ffffff"
),
align="center",
arrowhead=2,
arrowsize=1,
arrowwidth=2,
arrowcolor="#77BDD9",
ax=40,
ay=-30,
bordercolor="#77BDD9",
borderwidth=2,
borderpad=4,
bgcolor="#F22E62",
opacity=0.8
)
fig.show()