##주피터 노트북 세팅##
from IPython.core.display import display, HTML
display(HTML("<style>.container {width:80% !important;}</style>"))
%matplotlib inline

import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy import stats
import warnings
warnings.filterwarnings('ignore')
sns.set_style('whitegrid')

mpl.rcParams['figure.figsize'] = (12,8)  #시각화 figure default 설정
mpl.rcParams['font.family'] = 'NanumGothic' #폰트 디폴트 설정
mpl.rcParams['font.size'] = 10    #폰트 사이즈 디폴트 설정
plt.rcParams['axes.unicode_minus'] = False
%config InlineBackend.figure_format='retina' # 그래프 글씨 뚜렷


import pandas as pd
import numpy as np

df = pd.read_excel("Global Superstore.xlsx")
df.head(3)


df.shape

(50922, 26)


df = df.sample(10000, random_state = 1)


## 데이터프레임 크기 확인
df.shape

(10000, 26)


## info 확인
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000 entries, 2551 to 20501
Data columns (total 26 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   Category            10000 non-null  object        
 1   City                10000 non-null  object        
 2   Country             10000 non-null  object        
 3   Customer ID         10000 non-null  object        
 4   Customer Name       10000 non-null  object        
 5   Market              10000 non-null  object        
 6   Market1             10000 non-null  object        
 7   Order Date          10000 non-null  datetime64[ns]
 8   Order ID            10000 non-null  object        
 9   Order Priority      10000 non-null  object        
 10  Product ID          10000 non-null  object        
 11  Product Name        10000 non-null  object        
 12  Region              10000 non-null  object        
 13  Row ID              10000 non-null  int64         
 14  Segment             10000 non-null  object        
 15  Ship Date           10000 non-null  datetime64[ns]
 16  Ship Mode           10000 non-null  object        
 17  State               10000 non-null  object        
 18  Sub-Category        10000 non-null  object        
 19  Discount            10000 non-null  float64       
 20  Order Date (Years)  10000 non-null  int64         
 21  Profit              10000 non-null  float64       
 22  Quantity            10000 non-null  int64         
 23  Sales               10000 non-null  float64       
 24  Shipping Cost       10000 non-null  float64       
 25  레코드 수               10000 non-null  int64         
dtypes: datetime64[ns](2), float64(4), int64(4), object(16)
memory usage: 2.1+ MB


## 결측치 확인
df.isna().sum()

Category              0
City                  0
Country               0
Customer ID           0
Customer Name         0
Market                0
Market1               0
Order Date            0
Order ID              0
Order Priority        0
Product ID            0
Product Name          0
Region                0
Row ID                0
Segment               0
Ship Date             0
Ship Mode             0
State                 0
Sub-Category          0
Discount              0
Order Date (Years)    0
Profit                0
Quantity              0
Sales                 0
Shipping Cost         0
레코드 수                 0
dtype: int64


## 데이터 요약 확인
df.describe()


## Sales와 Profit에 대한 박스플롯 그래프
plt.figure(figsize = (10, 6))
ax1 = plt.subplot(1,2,1)
ax1.boxplot(df['Sales'])
ax1.set_ylabel("Sales")
ax1.set_title("<Sales Boxplot>", fontsize = 15, fontweight = "semibold")

ax2 = plt.subplot(1,2,2)
ax2.boxplot(df['Profit'])
ax2.set_ylabel("Profit")
ax2.set_title("<Profit Boxplot>", fontsize = 15, fontweight = "semibold")
plt.show()


## Sales 이상치 제거를 위해 Q1, Q2, Q3, IQR값 확인

q1 = df['Sales'].quantile(0.25)
q2 = df['Sales'].quantile(0.5)
q3 = df['Sales'].quantile(0.75)
iqr = q3 - q1

print("<Sales 이상치 제거를 위해 Q1, Q2, Q3, IQR값 확인>")
print("Q1 값: {}\nQ2 값: {}\nQ3 값: {}\nIQR 값: {}".format(q1, q2, q3, iqr))
condition = (df['Sales'] > (q3 + 1.5 * iqr)) | (df['Sales'] < (q1 - 1.5 * iqr))
print("=> Sales 이상치 예상 제거 수 :", df[condition].shape[0])

<Sales 이상치 제거를 위해 Q1, Q2, Q3, IQR값 확인>
Q1 값: 30.69
Q2 값: 86.934
Q3 값: 254.89499999999998
IQR 값: 224.20499999999998
=> Sales 이상치 예상 제거 수 : 1101


## Sales 이상치 제거 작업 진행 후 df_store로 저장
sales_outlier = df[condition].index
df_store = df.drop(sales_outlier, axis = 0).copy()
print("Sales 이상치 제거 작업 결과")
print("=> df개수: {}개, df_store개수: {}개".format(df.shape[0], df_store.shape[0]))

Sales 이상치 제거 작업 결과
=> df개수: 10000개, df_store개수: 8899개


## Proft 이상치 제거를 위해 Q1, Q2, Q3, IQR값 확인

q1 = df_store['Profit'].quantile(0.25)
q2 = df_store['Profit'].quantile(0.5)
q3 = df_store['Profit'].quantile(0.75)
iqr = q3 - q1

print("<Profit 이상치 제거를 위해 Q1, Q2, Q3, IQR값 확인>")
print("Q1값: {}\nQ2값: {}\nQ3값: {}\nIQR값: {}".format(q1, q2, q3, iqr))
condition = (df_store['Profit'] > (q3 + 1.5 * iqr)) | (df_store['Profit'] < (q1 - 1.5 * iqr))
print("이상치 예상 제거 수 :", df_store[condition].shape[0])

<Profit 이상치 제거를 위해 Q1, Q2, Q3, IQR값 확인>
Q1값: 0.0
Q2값: 7.92
Q3값: 27.869999999999997
IQR값: 27.869999999999997
이상치 예상 제거 수 : 1515


profit_outlier = df_store[condition].index
df_store = df_store.drop(profit_outlier, axis = 0).copy()
print("Proft 이상치 제거 작업 결과")
print("=> df 개수: {}개, 이후 df_store개수: {}개".format(df.shape[0], df_store.shape[0]))

Proft 이상치 제거 작업 결과
=> df 개수: 10000개, 이후 df_store개수: 7384개


#Sales의 평균값 확인
mean_sales = df_store['Sales'].mean()
print("평균 Sales:", mean_sales)

평균 Sales: 89.51504407638139


# Sales와 Profit의 산점도
plt.figure(figsize= (14,6))
# sns.scatterplot(df_store["Sales"], df_store["Profit"])
sns.scatterplot(df_store["Sales"], df_store["Profit"], color = "#E69F00", alpha = 0.8)
plt.axhline(y=0, color="#0072B2", lw = 1.5, ls = "--")
plt.axvline(x=mean_sales, color = "#0072B2", lw = 1.5, ls = "--")
plt.title("<Sales-Profit Scatterplot>", fontsize = 15, fontweight = "semibold")
plt.show()


## Sales와 Profit의 histogram을 통해 분포 확인
plt.figure(figsize=(14,6))
ax1 = plt.subplot(1,2,1)
ax1 = sns.histplot(df_store['Sales'], color = "#009E73")
ax1.set_title("<Sales Histogram>", fontsize = 15, fontweight = "semibold")

ax2 = plt.subplot(1,2,2)
ax2 = sns.histplot(df_store['Profit'], color = "#009E73")
ax2.set_title("<Profit Histogram>", fontsize = 15, fontweight = "semibold")
plt.show()


## Market과 Market1의 변수들을 확인
print(df_store['Market'].unique())
print(df_store['Market1'].unique())

['America' 'APAC' 'EMEA']
['US' 'APAC' 'EMEA' 'EU' 'Africa' 'LATAM' 'Canada']


## Market과 Market1의 변수들 간의 관계를 확인
pd.DataFrame(df_store.groupby(["Market", "Market1"])["Sales"].mean())


for i, market in enumerate(df_store['Market1'].unique()):
    print("{}에 소속된 국가들: {}\n".format(market, df_store[df_store['Market1'] == market]["Country"].unique()))

US에 소속된 국가들: ['United States']

APAC에 소속된 국가들: ['Myanmar (Burma)' 'China' 'Australia' 'India' 'Indonesia' 'New Zealand'
 'Vietnam' 'Bangladesh' 'Singapore' 'South Korea' 'Philippines' 'Japan'
 'Thailand' 'Pakistan' 'Nepal' 'Malaysia' 'Cambodia' 'Afghanistan'
 'Papua New Guinea' 'Hong Kong' 'Sri Lanka' 'Taiwan']

EMEA에 소속된 국가들: ['Romania' 'Turkey' 'Russia' 'Croatia' 'Iran' 'Czech Republic'
 'Saudi Arabia' 'Poland' 'Israel' 'Turkmenistan' 'Armenia' 'Iraq'
 'Kazakhstan' 'Belarus' 'Azerbaijan' 'Bulgaria' 'Bosnia and Herzegovina'
 'Hungary' 'Ukraine' 'Syria' 'Moldova' 'Macedonia' 'Uzbekistan'
 'Kyrgyzstan' 'Yemen' 'Jordan' 'Estonia' 'Albania' 'Slovakia' 'Slovenia'
 'Lithuania' 'Georgia' 'Lebanon' 'United Arab Emirates']

EU에 소속된 국가들: ['Norway' 'France' 'Italy' 'Germany' 'United Kingdom' 'Belgium' 'Spain'
 'Ireland' 'Portugal' 'Denmark' 'Netherlands' 'Sweden' 'Switzerland'
 'Finland']

Africa에 소속된 국가들: ['Nigeria' 'Morocco' 'Senegal' 'Algeria' 'Egypt' 'Kenya' 'Sudan'
 'Tanzania' "Cote d'Ivoire" 'Niger' 'Zambia' 'South Africa' 'Cameroon'
 'Libya' 'Angola' 'Mali' 'Democratic Republic of the Congo' 'Chad'
 'Mozambique' 'Madagascar' 'Benin' 'Central African Republic' 'Ghana'
 'Zimbabwe' 'Togo' 'Djibouti' 'Rwanda' 'Liberia' 'Tunisia' 'Somalia'
 'Uganda' 'Guinea' 'Gabon' 'Guinea-Bissau' 'Burundi' 'Swaziland'
 'Ethiopia' 'Mauritania' 'Equatorial Guinea' 'Namibia'
 'Republic of the Congo']

LATAM에 소속된 국가들: ['Dominican Republic' 'Guatemala' 'Brazil' 'Cuba' 'Mexico' 'Honduras'
 'Argentina' 'Nicaragua' 'Peru' 'El Salvador' 'Panama' 'Colombia'
 'Trinidad and Tobago' 'Haiti' 'Chile' 'Bolivia' 'Venezuela' 'Barbados'
 'Ecuador' 'Jamaica' 'Uruguay' 'Paraguay' 'Martinique' 'Guadeloupe']

Canada에 소속된 국가들: ['Canada']


# Market1별 Sales와 Profit의 평균을 집계하여 비교
df_market = df_store.groupby(["Market1"])[["Sales", "Profit"]].sum().copy()
fig, ax1 = plt.subplots(figsize=(10,5))
ax1 = df_market["Sales"].sort_values(ascending = False).plot.bar(rot = 0, label = "Sales")
ax1.grid(False)
ax1.set_xlabel('Target Market', fontsize = 10, fontweight = "normal")
ax1.set_ylabel('Sales / dollars($)',fontsize = 10, fontweight = "normal")
ax1.legend(prop={'size': 12}, title = 'Indicator', loc = "upper right")

ax2 = ax1.twinx()
ax2 = df_market["Profit"].sort_values(ascending = False).plot.line(figsize=(10,5), rot = 0, color = "orange", label = "Sales")
ax2.grid(False)
ax2.set_ylabel('Profit / dollars($)',fontsize = 10, fontweight = "normal")
ax2.legend(prop={'size': 12}, title = 'Indicator', loc = "center right")

plt.title('<Sales & Profit Order by Market>', fontsize = 15, fontweight = "semibold")
plt.show()


## 국가별 Sales, Profit 분석
# 국가별 그룹화 후 Sales와 Profit의 평균으로 집계하여 df_country로 저장
# df_country = df_store.groupby("Country")[["Sales", "Profit"]].agg(["mean"])
df_country = df_store.groupby(["Country"])[["Sales", "Profit"]].mean().copy()
df_country.sort_values(by = "Sales", ascending = False)


## df_country의 인덱스를 갱신하여 df_market_country
df_market_country = df_store.groupby(["Market1", "Country"])[["Sales", "Profit"]].mean().copy()
df_market_country = df_market_country.reset_index()

## 국가별 Sales와 Profit 분포 확인(Market1으로 구분, Sales)
plt.figure(figsize = (14, 6))
sns.scatterplot(x = df_market_country.Sales, 
                y = df_market_country.Profit, 
                hue = df_market_country.Market1, 
                size = df_market_country.Sales,
                sizes = (50, 400), 
                alpha = 0.5)

plt.title('<Sales & Profit Order by Country (per capita)>', fontsize = 15, fontweight = "semibold")
plt.xlabel('Profit / dollars($)', fontsize = 10, fontweight = "normal")
plt.ylabel('Sales / dollars($)',fontsize = 10, fontweight = "normal")
plt.legend(bbox_to_anchor=(1.02, 0.85), prop={'size': 12})
plt.grid(True)
plt.tight_layout()
plt.show()


# 평균 Sales 상위 10위 국가
df_country.sort_values(["Sales", "Profit"], ascending = False).head(10)


# 평균 Profit 상위 10개 국가
df_country.sort_values(["Profit", "Sales"], ascending = False).head(10)


# df.columns


## df_store에서 분석에 활용할 컬럼들만 뽑아 df_discount에 저장
df_discount = df_store[["Sales", "Profit", "Discount", "Order Priority", "Quantity", "Shipping Cost"]].copy()
df_discount.head()


## df_store에 있는 각 컬럼 간 상관계수 확인
plt.figure(figsize=(10,5))
sns.heatmap(df_discount.corr(),
            annot = True, 
            cmap = "RdBu_r", 
            annot_kws={"size": 13, 
                       "weight": "semibold"})

plt.title('<Correlation Coefficient between each Column>', fontsize = 15, fontweight = "semibold")
plt.show()


plt.figure(figsize=(15,5))
ax1 = plt.subplot(1,3,1)
ax1 = sns.regplot(df_discount['Discount'],df_discount['Sales'])
ax1.set_title("<Discount-Sales Regplot>", fontsize = 15, fontweight = "semibold")

ax2 = plt.subplot(1,3,2)
ax2 = sns.regplot(df_discount['Discount'],df_discount['Profit'])
ax2.set_title("<Discount-Profit Regplot>", fontsize = 15, fontweight = "semibold")

ax3 = plt.subplot(1,3,3)
ax3 = sns.regplot(df_discount['Shipping Cost'],df_discount['Sales'])
ax3.set_title("<Shipping Cost-Sales Regplot>", fontsize = 15, fontweight = "semibold")
plt.show()


# Sales와 Profit 분포를 Discount별로 확인
plt.figure(figsize =(18, 8))
sns.scatterplot(df_discount['Sales'], df_discount['Profit'], hue= df_discount['Discount'])

plt.title('<Sales & Profit Distribution by Discount>', fontsize = 20, fontweight = "semibold")
plt.xlabel('Sales / dollars($)', fontsize = 14, fontweight = "normal")
plt.ylabel('Profit / dollars($)',fontsize = 14, fontweight = "normal")
plt.grid(True)
plt.tight_layout()
plt.show()


# df_store.columns


df_store["Market1"].unique()

array(['US', 'APAC', 'EMEA', 'EU', 'Africa', 'LATAM', 'Canada'],
      dtype=object)


df_product = df_store[["Category", "Market1", "Country","Segment","Sub-Category", "Profit", "Sales", "Quantity"]].copy()
df_product.head()


## Market과 Sub-Category 변수 간의 교차 집계표(Crosstab) 생성
## Crosstab의 값은 주문량(Quantity)의 총합(sum)을 기준으로 Market별 합이 1이 되도록 normalize 진행
df_ct = pd.crosstab(index = df_product["Market1"],
                    columns = df_product["Sub-Category"],
                    values = df_product["Quantity"],
                    aggfunc = "sum", 
                    normalize = "index")
df_ct = df_ct * 100


plt.figure(figsize = (20, 8))
plt.title('<Market - Sub Category Heatmap(%)>', fontsize = 20, fontweight = "semibold")
sns.heatmap(df_ct,
            linecolor = "white",
            square = True, 
            cmap = "Reds", 
            annot = True, 
            fmt = ".1f", 
            annot_kws={"size": 13, 
                       "weight": "semibold"})
plt.show()


## 날짜와 관련된 데이터 확인
date_list = list(df_store.columns[df_store.columns.str.contains("Date")])
date_list

['Order Date', 'Ship Date', 'Order Date (Years)']


## Order Date에 해당되는 Years 확인
year_list = df_store["Order Date (Years)"].unique()
print((year_list))
## Order Date의 타입을 문자형으로 변환
df_store["Order Date (Years)"] = df_store["Order Date (Years)"].astype(str)

[2011 2014 2013 2012]


df_store.groupby("Order Date (Years)")[["Sales", "Profit"]].sum().plot.bar(figsize = (11,5))
plt.grid(False)
plt.xlabel('Year', fontsize = 10, fontweight = "normal")
plt.xticks(ticks = range(0, len(year_list)), labels = year_list, rotation = 0)
plt.ylabel('dollars($)',fontsize = 10, fontweight = "normal")
plt.legend(prop={'size': 12}, title = 'Indicator', loc = "upper left")
plt.title('<Sales & Profit Order by Market>', fontsize = 15, fontweight = "semibold")
plt.tight_layout()
plt.show()


## 더 세밀한 시계열 분석을 위해 df_date로 필요한 변수들을 따로 정리하여 저장
new_col_list = ["Order Date","Order Date (Years)", "Ship Date", "Category", "Market1", "Country","Segment",
                "Sub-Category", "Profit", "Sales", "Quantity"]
df_date = df_store[new_col_list].copy()
df_date.head()


## Order Date에서 Month 부분만 분리하여 Order Date (Month) 생성
df_date["Order Date (Month)"] = [date.split("-")[1] for date in df_date["Order Date"].astype("str").values]

## Order Date에서 Month 부분만 분리하여 Order Date (Month) 생성
df_date["Order Date (Day)"] = [date.split("-")[2] for date in df_date["Order Date"].astype("str").values]


df_date.sample()


month_list = list(df_date["Order Date (Month)"].unique())
month_list.sort()
day_list = list(df_date["Order Date (Day)"].unique())
day_list.sort()


fig, axes = plt.subplots(1, 2, figsize=(16,5))
sns.countplot(ax=axes[0], x=df_date["Order Date (Month)"], 
              order = month_list, 
              #hue = df_date["Order Date (Years)"],
              color = "#f4a261")
axes[0].set_title("[Order Quantity by Month]", fontsize = 15, fontweight = "semibold")
sns.countplot(ax=axes[1], 
              x=df_date["Order Date (Day)"],
              order = day_list, 
              #hue = df_date["Order Date (Years)"],
              color = "#2a9d8f")
axes[1].set_title("[Order Quantity by Day]", fontsize = 15, fontweight = "semibold")

plt.show()


plt.figure(figsize=(10, 4))
df_pivot = df_date.pivot_table(index="Order Date (Month)", columns = "Order Date (Day)",
                               values = "Sales", aggfunc = "sum")
sns.heatmap(df_pivot, annot = False, cmap = "RdBu_r", square = False, fmt = ".0f")
plt.title('<Sales by Date>', fontsize = 15, fontweight = "semibold")
plt.tight_layout()
plt.show()


plt.figure(figsize=(10, 4))
df_pivot = df_date.pivot_table(index="Order Date (Month)", columns = "Order Date (Day)",
                               values = "Order Date", aggfunc = "count")
sns.heatmap(df_pivot, annot = True, cmap = "RdBu_r", square = False, fmt = ".0f")
plt.title('<Order Quantity by Date>', fontsize = 15, fontweight = "semibold")
plt.tight_layout()
plt.show()


## 위 그래프가 정확하게 집계되었는지 12월 1일 데이터를 통해 점검
conditon_1201 = (df_date["Order Date (Month)"] == "12") & (df_date["Order Date (Day)"] == "01")
df_date[conditon_1201].shape[0]

19


plt.figure(figsize=(14,4))
ax1 = plt.subplot(1,3,1)
ax1 = df_store.groupby("Ship Mode")["Shipping Cost"].mean().plot.barh(color = "#277da1")
ax1.set_title("<Ship Mode - Shipping Cost>", fontsize = 15, fontweight = "semibold")
ax1.set_xlabel("Shipping Cost($)")

ax2 = plt.subplot(1,3,2)
ax2 = df_store.groupby("Ship Mode")["Sales"].mean().plot.barh(color = "#656d4a")
ax2.set_title("<Ship Mode - Sales>", fontsize = 15, fontweight = "semibold")
ax2.set_xlabel("Sales($)")

ax3 = plt.subplot(1,3,3)
ax3 = df_store.groupby("Ship Mode")["Profit"].mean().plot.barh(color = "#faa307")
ax3.set_title("<Ship Mode - Profit>", fontsize = 15, fontweight = "semibold")
ax3.set_xlabel("Profit($)")

plt.tight_layout()
plt.show()


plt.figure(figsize=(8,4))
ax1 = plt.subplot()
ax1 = df_store.groupby("Ship Mode")["Sales"].mean().plot.bar(color = "#656d4a")
ax1.set_title("<Sales & Profit by Ship Mode>", fontsize = 15, fontweight = "semibold")
ax1.set_xticklabels(labels = ax1.get_xticklabels(), rotation=0)
ax1.grid(False)

ax2 = ax1.twinx()
ax2 = df_store.groupby("Ship Mode")["Profit"].mean().plot.line(color = "#faa307", linewidth = 3)
ax2.grid(False)
plt.tight_layout()
plt.show()


df_store["Shipping_period"] = df_store["Ship Date"] - df_store["Order Date"]
df_store["Shipping_period"] = df_store["Shipping_period"].astype("str")
df_store["Shipping_period"] = df_store['Shipping_period'].str.replace(' days', '').astype("int")


# df_store["Shipping_period (days)"]


plt.figure(figsize = (8,4))
sns.countplot(df_store["Shipping_period"], color = "#afa2ff")
plt.show()


## Ship Mode에 따른 Shipping period의 평균 차이
df_store.groupby("Ship Mode")["Shipping_period"].mean().sort_values()

Ship Mode
Same Day          0.041975
First Class       2.177465
Second Class      3.248509
Standard Class    4.998638
Name: Shipping_period, dtype: float64


pd.crosstab(columns = df_store["Shipping_period"], index = df_store["Ship Mode"])


plt.figure(figsize=(8,5))
sns.boxplot(df_store["Ship Mode"], df_store["Shipping_period"], palette= "pastel")
plt.show()


# df_store.columns


## Customer별로 구매 데이터를 집계하여 df_top_customer 생성
df_top_customer = df_store.groupby(["Customer Name"])[["Sales", "Profit", "Quantity"]].mean().copy()
df_top_customer = df_top_customer.sort_values(["Sales"], ascending = False).reset_index()

## df_top_customer의 고객별 분포를 Sales와 Profit을 축으로 확인
plt.figure(figsize =(14,6))
sns.scatterplot(df_top_customer["Sales"],
                df_top_customer["Profit"], 
                size = df_top_customer["Quantity"], 
                sizes=(10, 200),
                alpha = 0.5,
                hue = df_top_customer["Quantity"])
plt.title("<Customer Sales-Profit Scatterplot>", fontsize = 15, fontweight = "semibold")
plt.show()


## Sales의 평균값과 중앙값 비교
print("Sales의 평균값:", df_top_customer["Sales"].mean())
print("Sales의 중앙값:", df_top_customer["Sales"].median())

## 평균과 중앙값에 큰 차이가 없는 것으로 판단되기 때문에 클러스터링 기준에서는 평균값을 이용
mean_sales = np.mean(df_top_customer["Sales"])

Sales의 평균값: 89.60496047224981
Sales의 중앙값: 86.0604


## Sales의 평균과 Profit의 0을 기준으로 4개의 세그먼트로 구분할 수 있는 함수 정의
def sales_segmentation(sales, mean_sales):
    if (sales >= mean_sales):
        return "High Sales"
    elif (sales < mean_sales):
        return "Low Sales"
    
def profit_segmentation(profit):
    if (profit >= 0):
        return "High Profit"
    elif (profit < 0):
        return "Low Profit"


## 위에서 정의한 함수를 적용하여 Sales_Segement와 Profit_Segment컬럼 생성
df_top_customer["Sales_Segment"] = df_top_customer["Sales"].apply(lambda x: sales_segmentation(x,mean_sales))
df_top_customer["Profit_Segment"] = df_top_customer["Profit"].apply(lambda x: profit_segmentation(x))


## Sales_Segment와 Profit_Segment를 결합하여 Customer_Segment로 통합
df_top_customer["Customer_Segment"] = df_top_customer["Sales_Segment"] + " & " + df_top_customer["Profit_Segment"]
df_top_customer.sample(3)


## 모든 Customer에 대해서 Customer_Segement별 분포를 확인
print("[Customer Segment별 고객 수]")
print(df_top_customer["Customer_Segment"].value_counts())

[Customer Segment별 고객 수]
Low Sales & High Profit     401
High Sales & High Profit    353
Low Sales & Low Profit       31
High Sales & Low Profit      10
Name: Customer_Segment, dtype: int64


## Customer_Segement를 기준으로 고객을 분류하여 그룹화
plt.figure(figsize=(14, 6))
sns.scatterplot(data = df_top_customer,
                x = "Sales",
                y = "Profit",
                hue = "Customer_Segment", 
                style = "Customer_Segment")
plt.axhline(y=0, color="#e63946", lw = 1.5, ls = "--")
plt.axvline(x=mean_sales, color = "#e63946", lw = 1.5, ls = "--")
plt.title("<Customer Segmentation>", fontsize = 15, fontweight = "semibold")
plt.show()


plt.figure(figsize=(20,6))
ax1 = plt.subplot(1,2,1)
ax1 = sns.scatterplot(df_top_customer["Sales"], df_top_customer["Profit"])
ax1.set_title("<Before Customer Segmentation>", fontsize = 15, fontweight = "semibold")

ax2 = plt.subplot(1,2,2)
ax2 = sns.scatterplot(data = df_top_customer,
                      x = "Sales",
                      y = "Profit",
                      hue = "Customer_Segment", 
                      style = "Customer_Segment")
ax2.set_title("<After Customer Segmentation>", fontsize = 15, fontweight = "semibold")
plt.show()

	Row ID	Discount	Order Date (Years)	Profit	Quantity	Sales	Shipping Cost	레코드 수
count	10000.000000	10000.000000	10000.000000	10000.000000	10000.000000	10000.000000	10000.000000	10000.0
mean	25466.963100	0.143306	2012.776700	28.956048	3.502500	249.311964	27.155403	1.0
std	14787.036586	0.211723	1.099708	182.658114	2.284976	495.516087	59.248525	0.0
min	1.000000	0.000000	2011.000000	-3839.990400	1.000000	0.898000	0.020000	1.0
25%	12588.000000	0.000000	2012.000000	0.000000	2.000000	30.690000	2.680000	1.0
50%	25289.500000	0.000000	2013.000000	9.418300	3.000000	86.934000	7.885000	1.0
75%	38217.250000	0.200000	2014.000000	37.867250	5.000000	254.895000	24.730000	1.0
max	51286.000000	0.800000	2014.000000	8399.976000	14.000000	17499.950000	865.740000	1.0

	Sales	Profit
Country
Sri Lanka	445.500000	4.320000
Martinique	192.570000	12.625000
Estonia	187.140000	28.510000
Gabon	184.215000	23.835000
Central African Republic	176.220000	45.810000
...	...	...
Turkmenistan	11.256000	-17.294000
Nigeria	11.206573	-14.250029
Kazakhstan	11.177308	-14.315769
Zimbabwe	11.049750	-16.230250
Uganda	3.438000	-5.052000

	Sales	Profit
Country
Sri Lanka	445.50000	4.320000
Martinique	192.57000	12.625000
Estonia	187.14000	28.510000
Gabon	184.21500	23.835000
Central African Republic	176.22000	45.810000
Libya	170.53000	22.146667
Republic of the Congo	166.14000	21.570000
Afghanistan	159.93000	23.044286
Slovenia	153.42000	50.580000
Singapore	145.55375	24.415000

	Sales	Profit
Country
Guadeloupe	130.2400	52.0800
Slovenia	153.4200	50.5800
Central African Republic	176.2200	45.8100
Hong Kong	100.9200	45.3600
Albania	105.7200	43.3200
Taiwan	121.6800	40.7400
Jamaica	110.0250	36.6850
Zambia	134.2275	34.5405
Nepal	75.8400	30.2100
Djibouti	136.0700	29.3500

	Sales	Profit	Discount	Order Priority	Quantity	Shipping Cost
2551	129.888	12.9888	0.20	High	6	24.59
27516	49.932	-0.7080	0.27	Medium	4	1.35
18899	60.540	20.5800	0.00	Medium	2	4.02
21523	38.430	13.4100	0.00	Medium	3	2.27
46837	86.376	30.2316	0.20	Medium	3	5.64

1. 라이브러리 import¶

2. 데이터 불러오기¶

3. 데이터 요약¶

4. EDA¶

4.1 이상치 제거¶

4.2 Sales & Profit 분포¶

4.3 Market별 Sales& Profit 집계¶

4.4 Discount별 Sales & Profit 분석¶

4.5 Market별 인기 Category 확인¶

4.6 Order Date별 Sales & Profit & Quantity 비교¶

4.7 배송등급별 Sales와 Profit 비교¶

4.6.4 Ship Mode별 배송기간 확인¶

5. 고객 세분화 (Segmentation)¶

5.1 Sales와 Profit 기준으로 고객 세분화¶

	Category	City	Country	Customer ID	Customer Name	Market	Market1	Order Date	Order ID	Order Priority	...	Ship Mode	State	Sub-Category	Discount	Order Date (Years)	Profit	Quantity	Sales	Shipping Cost	레코드 수
0	Office Supplies	Budapest	Hungary	AT-7352	Annie Thurman	EMEA	EMEA	2011-01-01	HU-2011-1220	High	...	Second Class	Budapest	Storage	0.0	2011	29.640	4	66.120	8.17	1
1	Office Supplies	Stockholm	Sweden	EM-141402	Eugene Moren	EMEA	EU	2011-01-01	IT-2011-3647632	High	...	Second Class	Stockholm	Paper	0.5	2011	-26.055	3	44.865	4.82	1
2	Office Supplies	Constantine	Algeria	TB-112801	Toby Braunhardt	EMEA	Africa	2011-01-01	AG-2011-2040	Medium	...	Standard Class	Constantine	Storage	0.0	2011	106.140	2	408.300	35.46	1

		Sales
Market	Market1
APAC	APAC	105.033527
America	Canada	64.836957
	LATAM	89.282776
	US	74.022520
EMEA	Africa	71.745307
	EMEA	66.741068
	EU	111.927560

	Category	Market1	Country	Segment	Sub-Category	Profit	Sales	Quantity
2551	Furniture	US	United States	Consumer	Furnishings	12.9888	129.888	6
27516	Furniture	APAC	Myanmar (Burma)	Corporate	Furnishings	-0.7080	49.932	4
18899	Office Supplies	EMEA	Romania	Consumer	Art	20.5800	60.540	2
21523	Office Supplies	EU	Norway	Consumer	Labels	13.4100	38.430	3
46837	Office Supplies	US	United States	Home Office	Binders	30.2316	86.376	3

	Order Date	Order Date (Years)	Ship Date	Category	Market1	Country	Segment	Sub-Category	Profit	Sales	Quantity
2551	2011-07-28	2011	2011-07-28	Furniture	US	United States	Consumer	Furnishings	12.9888	129.888	6
27516	2014-10-21	2014	2014-10-26	Furniture	APAC	Myanmar (Burma)	Corporate	Furnishings	-0.7080	49.932	4
18899	2014-04-28	2014	2014-05-03	Office Supplies	EMEA	Romania	Consumer	Art	20.5800	60.540	2
21523	2014-10-30	2014	2014-11-06	Office Supplies	EU	Norway	Consumer	Labels	13.4100	38.430	3
46837	2014-07-09	2014	2014-07-13	Office Supplies	US	United States	Home Office	Binders	30.2316	86.376	3

Shipping_period	0	1	2	3	4	5	6	7
Ship Mode
First Class	0	220	436	409	0	0	0	0
Same Day	388	17	0	0	0	0	0	0
Second Class	0	0	596	285	285	343	0	0
Standard Class	0	0	0	0	1779	1287	905	434

	Customer Name	Sales	Profit	Quantity	Sales_Segment	Profit_Segment	Customer_Segment
399	Denny Joy	85.947600	7.391280	3.0	Low Sales	High Profit	Low Sales & High Profit
246	Gary Hwang	100.086143	16.451514	3.5	High Sales	High Profit	High Sales & High Profit
237	Brad Eason	101.535360	11.802230	2.9	High Sales	High Profit	High Sales & High Profit