{ "cells": [ { "cell_type": "markdown", "metadata": { "_cell_guid": "02adde30-2633-41f1-a672-af8ff83a1b02", "_uuid": "22754d0cc0847be93bf947c7998d7fb65a2817d7" }, "source": [ "**대회 목적:**\n", "\n", "경쟁 데이터 세트에는 공개 도메인의 으스스한 작가가 쓴 소설의 텍스트가 포함되어 있습니다.\n", " 1. 에드거 앨런 포(EAP)\n", " 2. HP 러브크래프트(HPL)\n", " 3. 메리 울스턴크래프트 셸리(MWS)\n", " \n", "목표는 테스트 세트에서 문장의 저자를 정확하게 식별하는 것입니다.\n", "\n", "**노트북의 목적:**\n", "\n", "이 노트북에서 무시무시한 작성자를 식별하는 데 도움이 되는 다양한 기능을 만들어 보겠습니다.\n", "\n", "첫 번째 단계로 기능 엔지니어링 부분을 자세히 살펴보기 전에 몇 가지 기본 데이터 시각화 및 정리를 수행합니다." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 요약\n", "\n", "1. 메타변수 생성 -> xgb -> np.mean(metrics.log_loss(실제값, 예측값))\n", "2. TfidfVectorizer -> MultinomialNB -> confusion_matrix-> np.mean(metrics.log_loss(실제값, 예측값))\n", "3. TfidfVectorizer -> TruncatedSVD -> np.mean(metrics.log_loss(실제값, 예측값))\n", "4. CountVectorizer(stop_words='english', ngram_range=(1,3)) -> MultinomialNB -> confusion_matrix -> np.mean(metrics.log_loss(실제값, 예측값))\n", "5. CountVectorizer(ngram_range=(1,7), analyzer='char')\n", "6. TfidfVectorizer(ngram_range=(1,5), analyzer='char') -> TruncatedSVD(n_components=n_comp, algorithm='arpack') -> pd.concat([train_df, train_svd], axis=1) -> 앞서한 예측값들 -> xgb -> xgb.plot_importance -> np.mean(metrics.log_loss(실제값, 예측값))\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "_cell_guid": "b31f62fb-bde8-410e-972c-3c092f22d497", "_uuid": "0fcdf81ce439d2215892af58f839edfc0ca80a91" }, "outputs": [], "source": [ "import numpy as np # linear algebra\n", "import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "\n", "import nltk\n", "from nltk.corpus import stopwords\n", "import string # 문자/숫자의 리스트 출력\n", "\n", "import xgboost as xgb\n", "\n", "from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer\n", "from sklearn.decomposition import TruncatedSVD\n", "from sklearn import ensemble, metrics, model_selection, naive_bayes\n", "color = sns.color_palette()\n", "\n", "%matplotlib inline\n", "\n", "eng_stopwords = set(stopwords.words(\"english\"))\n", "pd.options.mode.chained_assignment = None" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "_cell_guid": "add1b71c-e802-408f-8f62-7ea71ed155cf", "_uuid": "ae86f9515b3be4956e223db4c2472c2c0c40d9fb" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of rows in train dataset : 19579\n", "Number of rows in test dataset : 8392\n" ] } ], "source": [ "## Read the train and test dataset and check the top few lines ##\n", "train_df = pd.read_csv(\"./input/train.csv\")\n", "test_df = pd.read_csv(\"./input/test.csv\")\n", "print(\"Number of rows in train dataset : \",train_df.shape[0])\n", "print(\"Number of rows in test dataset : \",test_df.shape[0])" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "_cell_guid": "35e22f81-77f9-438e-ba50-803aea15ad14", "_uuid": "83a9dfc6af30753f82f07ef6162d3b9ba155b06d" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idtextauthor
0id26305This process, however, afforded me no means of...EAP
1id17569It never once occurred to me that the fumbling...HPL
2id11008In his left hand was a gold snuff box, from wh...EAP
3id27763How lovely is spring As we looked from Windsor...MWS
4id12958Finding nothing else, not even gold, the Super...HPL
\n", "
" ], "text/plain": [ " id text author\n", "0 id26305 This process, however, afforded me no means of... EAP\n", "1 id17569 It never once occurred to me that the fumbling... HPL\n", "2 id11008 In his left hand was a gold snuff box, from wh... EAP\n", "3 id27763 How lovely is spring As we looked from Windsor... MWS\n", "4 id12958 Finding nothing else, not even gold, the Super... HPL" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_df.head()" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "5bee9e9a-5fbf-4ee6-9e92-0c77821fe77d", "_uuid": "d7682e0812990e3409db6076e7570ce984e466a1" }, "source": [ "각 저자의 출현 횟수를 확인하여 클래스가 균형을 이루고 있는지 확인할 수 있습니다." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "_cell_guid": "95529569-3380-4794-8248-caf5e8c67f6a", "_uuid": "045fe5712b26a72855dad61f05c947ab829a295b" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "C:\\Users\\HOME\\anaconda3\\lib\\site-packages\\seaborn\\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.\n", " warnings.warn(\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "cnt_srs = train_df['author'].value_counts()\n", "\n", "plt.figure(figsize=(8,4))\n", "sns.barplot(cnt_srs.index, cnt_srs.values, alpha=0.8)\n", "plt.ylabel('Number of Occurrences', fontsize=12)\n", "plt.xlabel('Author Name', fontsize=12)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "bff5de3a-b92f-4192-b142-60d0367e1bb2", "_uuid": "7b1ddc8bf782ae87ed5c24fdb6edbff986255d60" }, "source": [ "이것은 좋아 보인다. 계급 불균형이 별로 없다. 가능하면 각 저자의 글 스타일을 이해하고 이해하기 위해 각 저자의 몇 줄을 인쇄해 보겠습니다." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "_cell_guid": "6c85f84f-d0ee-444e-8b5f-2bf62452624a", "_uuid": "cc6b841404940a0f84481c76f5c2328b373416e1" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Author name : EAP\n", "This process, however, afforded me no means of ascertaining the dimensions of my dungeon; as I might make its circuit, and return to the point whence I set out, without being aware of the fact; so perfectly uniform seemed the wall.\n", "In his left hand was a gold snuff box, from which, as he capered down the hill, cutting all manner of fantastic steps, he took snuff incessantly with an air of the greatest possible self satisfaction.\n", "The astronomer, perhaps, at this point, took refuge in the suggestion of non luminosity; and here analogy was suddenly let fall.\n", "The surcingle hung in ribands from my body.\n", "I knew that you could not say to yourself 'stereotomy' without being brought to think of atomies, and thus of the theories of Epicurus; and since, when we discussed this subject not very long ago, I mentioned to you how singularly, yet with how little notice, the vague guesses of that noble Greek had met with confirmation in the late nebular cosmogony, I felt that you could not avoid casting your eyes upward to the great nebula in Orion, and I certainly expected that you would do so.\n", "\n", "\n", "Author name : HPL\n", "It never once occurred to me that the fumbling might be a mere mistake.\n", "Finding nothing else, not even gold, the Superintendent abandoned his attempts; but a perplexed look occasionally steals over his countenance as he sits thinking at his desk.\n", "Herbert West needed fresh bodies because his life work was the reanimation of the dead.\n", "The farm like grounds extended back very deeply up the hill, almost to Wheaton Street.\n", "His facial aspect, too, was remarkable for its maturity; for though he shared his mother's and grandfather's chinlessness, his firm and precociously shaped nose united with the expression of his large, dark, almost Latin eyes to give him an air of quasi adulthood and well nigh preternatural intelligence.\n", "\n", "\n", "Author name : MWS\n", "How lovely is spring As we looked from Windsor Terrace on the sixteen fertile counties spread beneath, speckled by happy cottages and wealthier towns, all looked as in former years, heart cheering and fair.\n", "A youth passed in solitude, my best years spent under your gentle and feminine fosterage, has so refined the groundwork of my character that I cannot overcome an intense distaste to the usual brutality exercised on board ship: I have never believed it to be necessary, and when I heard of a mariner equally noted for his kindliness of heart and the respect and obedience paid to him by his crew, I felt myself peculiarly fortunate in being able to secure his services.\n", "I confess that neither the structure of languages, nor the code of governments, nor the politics of various states possessed attractions for me.\n", "He shall find that I can feel my injuries; he shall learn to dread my revenge\" A few days after he arrived.\n", "He had escaped me, and I must commence a destructive and almost endless journey across the mountainous ices of the ocean, amidst cold that few of the inhabitants could long endure and which I, the native of a genial and sunny climate, could not hope to survive.\n", "\n", "\n" ] } ], "source": [ "grouped_df = train_df.groupby('author')\n", "for name, group in grouped_df:\n", " print(\"Author name : \", name)\n", " cnt = 0\n", " for ind, row in group.iterrows():\n", " print(row[\"text\"])\n", " cnt += 1\n", " if cnt == 5:\n", " break\n", " print(\"\\n\")" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "84398b75-325f-4259-90ea-bb6293cc5235", "_uuid": "d5739c4a5967d2bcb9326d39529f84791c09dfb8" }, "source": [ "내가 볼 수있는 유일한 것은 텍스트 데이터에 상당히 많은 특수 문자가 있다는 것입니다. 따라서 이러한 특수 문자의 수는 좋은 변수일 수 있습니다. 아마 나중에 만들 수 있을 것입니다.\n", "\n", "그 외에는 실마리가 별로 없습니다.. 재미있는 스타일(만들 수 있는 변수)이 있으면 댓글에 추가해 주세요.\n", "\n", "**변수 엔지니어링:**\n", "\n", "이제 변수 엔지니어링을 시도해 보겠습니다. 이것은 두 가지 주요 부분으로 구성됩니다.\n", "\n", " 1. 메타 변수 - 단어 수, 중지 단어 수, 구두점 수 등과 같은 텍스트에서 추출된 변수\n", " 2. 텍스트 기반 변수 - 빈도, svd, word2vec 등과 같은 텍스트/단어를 직접 기반으로 하는 변수\n", "\n", "**메타 변수:**\n", "\n", "우리는 메타 변수을 만드는 것부터 시작하여 그들이 으스스한 작가를 얼마나 잘 예측하는지 볼 것입니다. 변수 목록은 다음과 같습니다.\n", "1. 텍스트의 단어 수\n", "2. 텍스트의 고유 단어 수\n", "3. 텍스트의 문자 수\n", "4. 불용어의 수\n", "5. 구두점 수\n", "6. 대문자 단어의 수\n", "7. 제목 케이스 단어의 수\n", "8. 단어의 평균 길이" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "_cell_guid": "5758c1ff-ca4d-4c66-8c85-ac1725bd631b", "_uuid": "ec015064f318cfd7d0a405bd28cd002f72724bbe" }, "outputs": [], "source": [ "## Number of words in the text ##\n", "train_df[\"num_words\"] = train_df[\"text\"].apply(lambda x: len(str(x).split()))\n", "test_df[\"num_words\"] = test_df[\"text\"].apply(lambda x: len(str(x).split()))\n", "\n", "## Number of unique words in the text ##\n", "train_df[\"num_unique_words\"] = train_df[\"text\"].apply(lambda x: len(set(str(x).split())))\n", "test_df[\"num_unique_words\"] = test_df[\"text\"].apply(lambda x: len(set(str(x).split())))\n", "\n", "## Number of characters in the text ##\n", "train_df[\"num_chars\"] = train_df[\"text\"].apply(lambda x: len(str(x)))\n", "test_df[\"num_chars\"] = test_df[\"text\"].apply(lambda x: len(str(x)))\n", "\n", "## Number of stopwords in the text ##\n", "train_df[\"num_stopwords\"] = train_df[\"text\"].apply(lambda x: len([w for w in str(x).lower().split() if w in eng_stopwords]))\n", "test_df[\"num_stopwords\"] = test_df[\"text\"].apply(lambda x: len([w for w in str(x).lower().split() if w in eng_stopwords]))\n", "\n", "## Number of punctuations in the text ##\n", "train_df[\"num_punctuations\"] =train_df['text'].apply(lambda x: len([c for c in str(x) if c in string.punctuation]) )\n", "test_df[\"num_punctuations\"] =test_df['text'].apply(lambda x: len([c for c in str(x) if c in string.punctuation]) )\n", "\n", "## Number of title case words in the text ##\n", "train_df[\"num_words_upper\"] = train_df[\"text\"].apply(lambda x: len([w for w in str(x).split() if w.isupper()]))\n", "test_df[\"num_words_upper\"] = test_df[\"text\"].apply(lambda x: len([w for w in str(x).split() if w.isupper()]))\n", "\n", "## Number of title case words in the text ##\n", "train_df[\"num_words_title\"] = train_df[\"text\"].apply(lambda x: len([w for w in str(x).split() if w.istitle()]))\n", "test_df[\"num_words_title\"] = test_df[\"text\"].apply(lambda x: len([w for w in str(x).split() if w.istitle()]))\n", "\n", "## Average length of the words in the text ##\n", "train_df[\"mean_word_len\"] = train_df[\"text\"].apply(lambda x: np.mean([len(w) for w in str(x).split()]))\n", "test_df[\"mean_word_len\"] = test_df[\"text\"].apply(lambda x: np.mean([len(w) for w in str(x).split()]))" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "164b0d54-3d9e-4b58-ae7c-20c447efdc69", "_uuid": "d79f96166a7fe609fd8431b341b2d5ad189ca07a" }, "source": [ "이제 예측에 도움이 될 몇 가지 새로운 변수를 플로팅해 보겠습니다." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "_cell_guid": "fe9a91d9-3df7-4e1b-9e69-f701cc714a15", "_uuid": "51962b52791977deefab8b7991d59a51211c8a5e" }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "train_df['num_words'].loc[train_df['num_words']>80] = 80 #truncation for better visuals\n", "plt.figure(figsize=(12,8))\n", "sns.violinplot(x='author', y='num_words', data=train_df)\n", "plt.xlabel('Author Name', fontsize=12)\n", "plt.ylabel('Number of words in text', fontsize=12)\n", "plt.title(\"Number of words by author\", fontsize=15)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "66833874-08f2-48b8-9561-b54de997071d", "_uuid": "abe5dcadd42cd45169e44e3e299c303d5197b8ab", "collapsed": true }, "source": [ "EAP는 MWS 및 HPL보다 단어 수가 약간 적은 것 같습니다." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "_cell_guid": "91514e74-d734-4642-a9dd-470cb8a65733", "_uuid": "1c4c0950d33e29d270fbc021aa8c5ff0e7ab205b" }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "train_df['num_punctuations'].loc[train_df['num_punctuations']>10] = 10 #truncation for better visuals\n", "plt.figure(figsize=(12,8))\n", "sns.violinplot(x='author', y='num_punctuations', data=train_df)\n", "plt.xlabel('Author Name', fontsize=12)\n", "plt.ylabel('Number of puntuations in text', fontsize=12)\n", "plt.title(\"Number of punctuations by author\", fontsize=15)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "3a622bfc-da8b-4f13-b9d8-a63dd79a5218", "_uuid": "240a9ac16cf5d342e6d01cb020f1d8cea670ec84" }, "source": [ "이것도 좀 쓸만해 보입니다. 이제 몇 가지 텍스트 기반 기능을 만드는 데 집중해 보겠습니다.\n", "\n", "이러한 메타 변수가 어떻게 도움이 되는지 알아보기 위해 먼저 기본 모델을 구축해 보겠습니다." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idtextauthornum_wordsnum_unique_wordsnum_charsnum_stopwordsnum_punctuationsnum_words_uppernum_words_titlemean_word_len
0id26305This process, however, afforded me no means of...EAP4135231197234.658537
1id17569It never once occurred to me that the fumbling...HPL14147181014.142857
2id11008In his left hand was a gold snuff box, from wh...EAP3632200165014.583333
3id27763How lovely is spring As we looked from Windsor...MWS3432206134045.088235
4id12958Finding nothing else, not even gold, the Super...HPL2725174114025.481481
....................................
19574id17718I could have fancied, while I looked at it, th...EAP2019108113224.450000
19575id08973The lids clenched themselves together as if in...EAP10105561014.600000
19576id05267Mais il faut agir that is to say, a Frenchman ...EAP13136842024.307692
19577id17513For an item of news like this, it strikes us i...EAP15147473014.000000
19578id00393He laid a gnarled claw on my shoulder, and it ...HPL2221109142014.000000
\n", "

19579 rows × 11 columns

\n", "
" ], "text/plain": [ " id text author \\\n", "0 id26305 This process, however, afforded me no means of... EAP \n", "1 id17569 It never once occurred to me that the fumbling... HPL \n", "2 id11008 In his left hand was a gold snuff box, from wh... EAP \n", "3 id27763 How lovely is spring As we looked from Windsor... MWS \n", "4 id12958 Finding nothing else, not even gold, the Super... HPL \n", "... ... ... ... \n", "19574 id17718 I could have fancied, while I looked at it, th... EAP \n", "19575 id08973 The lids clenched themselves together as if in... EAP \n", "19576 id05267 Mais il faut agir that is to say, a Frenchman ... EAP \n", "19577 id17513 For an item of news like this, it strikes us i... EAP \n", "19578 id00393 He laid a gnarled claw on my shoulder, and it ... HPL \n", "\n", " num_words num_unique_words num_chars num_stopwords \\\n", "0 41 35 231 19 \n", "1 14 14 71 8 \n", "2 36 32 200 16 \n", "3 34 32 206 13 \n", "4 27 25 174 11 \n", "... ... ... ... ... \n", "19574 20 19 108 11 \n", "19575 10 10 55 6 \n", "19576 13 13 68 4 \n", "19577 15 14 74 7 \n", "19578 22 21 109 14 \n", "\n", " num_punctuations num_words_upper num_words_title mean_word_len \n", "0 7 2 3 4.658537 \n", "1 1 0 1 4.142857 \n", "2 5 0 1 4.583333 \n", "3 4 0 4 5.088235 \n", "4 4 0 2 5.481481 \n", "... ... ... ... ... \n", "19574 3 2 2 4.450000 \n", "19575 1 0 1 4.600000 \n", "19576 2 0 2 4.307692 \n", "19577 3 0 1 4.000000 \n", "19578 2 0 1 4.000000 \n", "\n", "[19579 rows x 11 columns]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_df" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "_cell_guid": "d97fc321-32a9-429e-9135-ae52300332ec", "_uuid": "c27a85339ef5cf5ac3c87527249f878b615d3996" }, "outputs": [], "source": [ "## Prepare the data for modeling ###\n", "# 라벨인코딩\n", "author_mapping_dict = {'EAP':0, 'HPL':1, 'MWS':2}\n", "train_y = train_df['author'].map(author_mapping_dict)\n", "\n", "train_id = train_df['id'].values\n", "test_id = test_df['id'].values\n", "\n", "### 재작성 \n", "train_df[\"num_words\"] = train_df[\"text\"].apply(lambda x: len(str(x).split()))\n", "test_df[\"num_words\"] = test_df[\"text\"].apply(lambda x: len(str(x).split()))\n", "train_df[\"mean_word_len\"] = train_df[\"text\"].apply(lambda x: np.mean([len(w) for w in str(x).split()]))\n", "test_df[\"mean_word_len\"] = test_df[\"text\"].apply(lambda x: np.mean([len(w) for w in str(x).split()]))\n", "\n", "# 변수제거\n", "cols_to_drop = ['id', 'text']\n", "train_X = train_df.drop(cols_to_drop+['author'], axis=1)\n", "test_X = test_df.drop(cols_to_drop, axis=1)" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "c4198d3f-5fad-41b1-9bcf-3b2faef2287f", "_uuid": "1fddaea1621cba593f86d78851a26e1645f4954e" }, "source": [ "이러한 메타 변수만으로 간단한 XGBoost 모델을 훈련할 수 있습니다." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
num_wordsnum_unique_wordsnum_charsnum_stopwordsnum_punctuationsnum_words_uppernum_words_titlemean_word_len
04135231197234.658537
114147181014.142857
23632200165014.583333
33432206134045.088235
42725174114025.481481
...........................
195742019108113224.450000
1957510105561014.600000
1957613136842024.307692
1957715147473014.000000
195782221109142014.000000
\n", "

19579 rows × 8 columns

\n", "
" ], "text/plain": [ " num_words num_unique_words num_chars num_stopwords \\\n", "0 41 35 231 19 \n", "1 14 14 71 8 \n", "2 36 32 200 16 \n", "3 34 32 206 13 \n", "4 27 25 174 11 \n", "... ... ... ... ... \n", "19574 20 19 108 11 \n", "19575 10 10 55 6 \n", "19576 13 13 68 4 \n", "19577 15 14 74 7 \n", "19578 22 21 109 14 \n", "\n", " num_punctuations num_words_upper num_words_title mean_word_len \n", "0 7 2 3 4.658537 \n", "1 1 0 1 4.142857 \n", "2 5 0 1 4.583333 \n", "3 4 0 4 5.088235 \n", "4 4 0 2 5.481481 \n", "... ... ... ... ... \n", "19574 3 2 2 4.450000 \n", "19575 1 0 1 4.600000 \n", "19576 2 0 2 4.307692 \n", "19577 3 0 1 4.000000 \n", "19578 2 0 1 4.000000 \n", "\n", "[19579 rows x 8 columns]" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_X" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
num_wordsnum_unique_wordsnum_charsnum_stopwordsnum_punctuationsnum_words_uppernum_words_titlemean_word_len
0191911093134.842105
16249330337134.338710
23330189153014.757576
34134223195234.463415
411115361113.909091
...........................
8387994271013.777778
8388773441114.000000
83892524150112015.040000
83903834197213234.210526
83913833247185015.526316
\n", "

8392 rows × 8 columns

\n", "
" ], "text/plain": [ " num_words num_unique_words num_chars num_stopwords num_punctuations \\\n", "0 19 19 110 9 3 \n", "1 62 49 330 33 7 \n", "2 33 30 189 15 3 \n", "3 41 34 223 19 5 \n", "4 11 11 53 6 1 \n", "... ... ... ... ... ... \n", "8387 9 9 42 7 1 \n", "8388 7 7 34 4 1 \n", "8389 25 24 150 11 2 \n", "8390 38 34 197 21 3 \n", "8391 38 33 247 18 5 \n", "\n", " num_words_upper num_words_title mean_word_len \n", "0 1 3 4.842105 \n", "1 1 3 4.338710 \n", "2 0 1 4.757576 \n", "3 2 3 4.463415 \n", "4 1 1 3.909091 \n", "... ... ... ... \n", "8387 0 1 3.777778 \n", "8388 1 1 4.000000 \n", "8389 0 1 5.040000 \n", "8390 2 3 4.210526 \n", "8391 0 1 5.526316 \n", "\n", "[8392 rows x 8 columns]" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test_X" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
num_wordsnum_unique_wordsnum_charsnum_stopwordsnum_punctuationsnum_words_uppernum_words_titlemean_word_len
114147181014.142857
23632200165014.583333
33432206134045.088235
58366468436554.650602
6212112895015.142857
...........................
195732724143114044.333333
195742019108113224.450000
1957510105561014.600000
1957715147473014.000000
195782221109142014.000000
\n", "

15663 rows × 8 columns

\n", "
" ], "text/plain": [ " num_words num_unique_words num_chars num_stopwords \\\n", "1 14 14 71 8 \n", "2 36 32 200 16 \n", "3 34 32 206 13 \n", "5 83 66 468 43 \n", "6 21 21 128 9 \n", "... ... ... ... ... \n", "19573 27 24 143 11 \n", "19574 20 19 108 11 \n", "19575 10 10 55 6 \n", "19577 15 14 74 7 \n", "19578 22 21 109 14 \n", "\n", " num_punctuations num_words_upper num_words_title mean_word_len \n", "1 1 0 1 4.142857 \n", "2 5 0 1 4.583333 \n", "3 4 0 4 5.088235 \n", "5 6 5 5 4.650602 \n", "6 5 0 1 5.142857 \n", "... ... ... ... ... \n", "19573 4 0 4 4.333333 \n", "19574 3 2 2 4.450000 \n", "19575 1 0 1 4.600000 \n", "19577 3 0 1 4.000000 \n", "19578 2 0 1 4.000000 \n", "\n", "[15663 rows x 8 columns]" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dev_X" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
num_wordsnum_unique_wordsnum_charsnum_stopwordsnum_punctuationsnum_words_uppernum_words_titlemean_word_len
04135231197234.658537
42725174114025.481481
1315158652034.800000
202120111112014.333333
214439252244124.750000
...........................
19531775331016.714286
1954316169362124.875000
1955012125962014.000000
195542119126102015.047619
1957613136842024.307692
\n", "

3916 rows × 8 columns

\n", "
" ], "text/plain": [ " num_words num_unique_words num_chars num_stopwords \\\n", "0 41 35 231 19 \n", "4 27 25 174 11 \n", "13 15 15 86 5 \n", "20 21 20 111 11 \n", "21 44 39 252 24 \n", "... ... ... ... ... \n", "19531 7 7 53 3 \n", "19543 16 16 93 6 \n", "19550 12 12 59 6 \n", "19554 21 19 126 10 \n", "19576 13 13 68 4 \n", "\n", " num_punctuations num_words_upper num_words_title mean_word_len \n", "0 7 2 3 4.658537 \n", "4 4 0 2 5.481481 \n", "13 2 0 3 4.800000 \n", "20 2 0 1 4.333333 \n", "21 4 1 2 4.750000 \n", "... ... ... ... ... \n", "19531 1 0 1 6.714286 \n", "19543 2 1 2 4.875000 \n", "19550 2 0 1 4.000000 \n", "19554 2 0 1 5.047619 \n", "19576 2 0 2 4.307692 \n", "\n", "[3916 rows x 8 columns]" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "val_X" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1 1\n", "2 0\n", "3 2\n", "5 2\n", "6 0\n", " ..\n", "19573 2\n", "19574 0\n", "19575 0\n", "19577 0\n", "19578 1\n", "Name: author, Length: 15663, dtype: int64" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dev_y" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[1.59e-075, 3.41e-264, 1.00e+000],\n", " [1.00e+000, 7.61e-087, 2.68e-102],\n", " [1.03e-015, 1.50e-049, 1.00e+000],\n", " ...,\n", " [6.17e-010, 7.47e-094, 1.00e+000],\n", " [3.44e-016, 1.00e+000, 2.69e-027],\n", " [3.68e-050, 1.82e-067, 1.00e+000]])" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pred_val_y" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3916" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(pred_val_y)" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "8392" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(pred_test_y)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "_cell_guid": "1534f55b-1334-4f1e-a597-d51e4d190fb3", "_uuid": "7e736a33d7aefcc49edb0e629e57462f92e86c99" }, "outputs": [], "source": [ "# runXGB(dev_X, dev_y, val_X, val_y, test_X, seed_val=0)\n", "def runXGB(train_X, train_y, test_X, test_y=None, test_X2=None, seed_val=0, child=1, colsample=0.3):\n", " # 매개변수\n", " param = {}\n", " param['objective'] = 'multi:softprob'\n", " param['eta'] = 0.1\n", " param['max_depth'] = 3\n", " param['silent'] = 1\n", " param['num_class'] = 3\n", " param['eval_metric'] = \"mlogloss\"\n", " param['min_child_weight'] = child # 1\n", " param['subsample'] = 0.8\n", " param['colsample_bytree'] = colsample # 0.3\n", " param['seed'] = seed_val # 0\n", " num_rounds = 2000\n", "\n", " # 희소행렬 -> DMatrix으로 변환\n", " # xgb.DMatrix -> xgb.train -> xgb.train.predict -> xgb.plot_importance\n", " plst = list(param.items()) # ['multi:softprob', 0.1, 3, 1, 3, 'mlogloss', 1, 0.8, 0.3, 0]\n", " xgtrain = xgb.DMatrix(train_X, label=train_y) # dev_X는 (15663rows×8columns)인 df, dev_y는 15663인 series\n", "\n", " if test_y is not None:\n", " xgtest = xgb.DMatrix(test_X, label=test_y) # val_X는 (3916rows×8columns)인 df, val_y는 3916인 series\n", " watchlist = [ (xgtrain,'train'), (xgtest, 'test') ]\n", " model = xgb.train(plst, xgtrain, num_rounds, watchlist, early_stopping_rounds=50, verbose_eval=20)\n", " else:\n", " xgtest = xgb.DMatrix(test_X)\n", " model = xgb.train(plst, xgtrain, num_rounds)\n", "\n", " pred_test_y = model.predict(xgtest, ntree_limit = model.best_ntree_limit) # 3916인 array\n", " if test_X2 is not None:\n", " xgtest2 = xgb.DMatrix(test_X2)\n", " pred_test_y2 = model.predict(xgtest2, ntree_limit = model.best_ntree_limit) # 8392인 array\n", " return pred_test_y, pred_test_y2, model # " ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "1effdaf3-faca-455d-a7a2-ae4d81b5c209", "_uuid": "607c32af563879d1f7e011a438216ab8bea74ae3" }, "source": [ "커널 실행 시간을 위해 점수에 대한 k-폴드 교차 검증의 첫 번째 폴드만 확인할 수 있습니다. 로컬에서 실행하는 동안 'break' 줄을 제거하십시오." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "15663\n", "3916\n", "15663\n", "3916\n", "15663\n", "3916\n", "15663\n", "3916\n", "15664\n", "3915\n" ] } ], "source": [ "for dev_index, val_index in kf.split(train_X): # train_X : (15663 rows × 8 columns)인 df\n", " print(len(dev_index))\n", " print(len(val_index))" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "_cell_guid": "97965af9-da5a-4ccc-94c5-bc541ed00d49", "_uuid": "441491131b9a7272863714494dd1303022c9d630" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[13:37:24] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.5.1/src/learner.cc:576: \n", "Parameters: { \"silent\" } might not be used.\n", "\n", " This could be a false alarm, with some parameters getting used by language bindings but\n", " then being mistakenly passed down to XGBoost core, or some parameter actually being used\n", " but getting flagged wrongly here. Please open an issue if you find any such cases.\n", "\n", "\n", "[0]\ttrain-mlogloss:1.09384\ttest-mlogloss:1.09472\n", "[20]\ttrain-mlogloss:1.04663\ttest-mlogloss:1.05720\n", "[40]\ttrain-mlogloss:1.02368\ttest-mlogloss:1.03845\n", "[60]\ttrain-mlogloss:1.01109\ttest-mlogloss:1.02949\n", "[80]\ttrain-mlogloss:0.99824\ttest-mlogloss:1.01957\n", "[100]\ttrain-mlogloss:0.98938\ttest-mlogloss:1.01345\n", "[120]\ttrain-mlogloss:0.98209\ttest-mlogloss:1.00860\n", "[140]\ttrain-mlogloss:0.97603\ttest-mlogloss:1.00505\n", "[160]\ttrain-mlogloss:0.97088\ttest-mlogloss:1.00237\n", "[180]\ttrain-mlogloss:0.96619\ttest-mlogloss:1.00040\n", "[200]\ttrain-mlogloss:0.96141\ttest-mlogloss:0.99788\n", "[220]\ttrain-mlogloss:0.95745\ttest-mlogloss:0.99651\n", "[240]\ttrain-mlogloss:0.95372\ttest-mlogloss:0.99505\n", "[260]\ttrain-mlogloss:0.95045\ttest-mlogloss:0.99377\n", "[280]\ttrain-mlogloss:0.94730\ttest-mlogloss:0.99296\n", "[300]\ttrain-mlogloss:0.94402\ttest-mlogloss:0.99235\n", "[320]\ttrain-mlogloss:0.94112\ttest-mlogloss:0.99152\n", "[340]\ttrain-mlogloss:0.93846\ttest-mlogloss:0.99101\n", "[360]\ttrain-mlogloss:0.93540\ttest-mlogloss:0.99025\n", "[380]\ttrain-mlogloss:0.93291\ttest-mlogloss:0.98975\n", "[400]\ttrain-mlogloss:0.93021\ttest-mlogloss:0.98917\n", "[420]\ttrain-mlogloss:0.92797\ttest-mlogloss:0.98907\n", "[440]\ttrain-mlogloss:0.92578\ttest-mlogloss:0.98919\n", "[460]\ttrain-mlogloss:0.92365\ttest-mlogloss:0.98910\n", "[480]\ttrain-mlogloss:0.92181\ttest-mlogloss:0.98896\n", "[497]\ttrain-mlogloss:0.92026\ttest-mlogloss:0.98908\n", "cv scores : [0.9887632298301353]\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "C:\\Users\\HOME\\anaconda3\\lib\\site-packages\\xgboost\\core.py:105: UserWarning: ntree_limit is deprecated, use `iteration_range` or model slicing instead.\n", " warnings.warn(\n" ] } ], "source": [ "from sklearn.model_selection import KFold\n", "\n", "kf = KFold(n_splits=5, shuffle=True, random_state=2017)\n", "\n", "cv_scores = []\n", "#pred_full_test = 0 # array형태\n", "pred_train = np.zeros([train_df.shape[0], 3]) # (19579,3) \n", "\n", "# dev_index : 15663/15664, val_index : 3915/3916\n", "for dev_index, val_index in kf.split(train_X): # train_X : (19579 rows × 8 columns)인 df\n", " dev_X, val_X = train_X.loc[dev_index], train_X.loc[val_index]\n", " dev_y, val_y = train_y[dev_index], train_y[val_index]\n", " \n", " pred_val_y, pred_test_y, model = runXGB(dev_X, dev_y, val_X, val_y, test_X, seed_val=0)\n", " \n", " #pred_full_test = pred_full_test + pred_test_y # 8392인 array\n", " pred_train[val_index,:] = pred_val_y # 3916인 array\n", " cv_scores.append(metrics.log_loss(val_y, pred_val_y)) # (실제값, 예측값) 3916*5\n", " break\n", " \n", "print(\"cv scores : \", cv_scores)" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "ef5d6aa2-9d59-48e8-87ad-7ab254e0dc8a", "_uuid": "6ebaf270a27c8b07455498111bd0a8b2c11b6b16" }, "source": [ "메타 변수만 사용하여 '0.987'의 mlogloss를 얻고 있습니다. 나쁘지 않은 점수입니다. 이제 이러한 기능 중 어떤 것이 중요한지 살펴보겠습니다." ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "_cell_guid": "2856aa57-cce7-4f18-af56-a77ebd52eef7", "_uuid": "506938cebfedaad3e70d531e4cad554abe2a3c55" }, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAyEAAALJCAYAAACjjiyoAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjQuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/MnkTPAAAACXBIWXMAAAsTAAALEwEAmpwYAABW1UlEQVR4nO3dfZxWdZ3/8dcHMFRMXUL8oaigKC43itraz0ydUrDC0O3Gn2YkUlmrbndqkaipbQu12j3lWppkJlmuQmWYa05qmopFEhjixpggSubNKig5+Pn9cR2mC5jhRpjvDMzr+XjMY875nnO+53s+c4nXe77nXBOZiSRJkiSV0q2jByBJkiSpazGESJIkSSrKECJJkiSpKEOIJEmSpKIMIZIkSZKKMoRIkiRJKsoQIknq8iLivIj4TkePQ5K6ivDvhEiSNkVENAG7AivrmvfLzMc3sc8PZuZ/b9rotjwRcREwKDPf19FjkaT24kyIJGlzeEdm7lD39aoDyOYQET068vyv1pY6bknaWIYQSVK7iIidIuLKiFgSEYsj4t8ionu1bZ+I+GVE/DUinoqIayNi52rbNcCewE8i4oWI+FRENETEojX6b4qIY6rliyLixxHx/Yj4X2Dcus7fylgviojvV8sDIiIj4rSIeCwinomIj0TEP0XEgxHxbER8o+7YcRHx64j4ekQ8FxF/jIij67bvFhEzIuLpiHgkIj60xnnrx/0R4Dzg/1XX/vtqv9Mi4qGIeD4i/hQRH67royEiFkXE2RGxtLre0+q2bxcRl0XEo9X47oqI7apt/zci7q6u6fcR0fAqftSStNEMIZKk9jIVaAYGAQcBo4APVtsCmATsBvwjsAdwEUBmjgX+zN9nV764gec7HvgxsDNw7XrOvyHeAOwL/D/gK8BE4BhgKHBiRBy1xr5/AvoAnwX+KyJ6V9uuAxZV1/pu4N/rQ8oa474S+Hfgh9W1H1jtsxQ4DtgROA34ckQcXNfH/wF2AnYHPgBMiYh/qLZdChwCvBHoDXwKeCUidgd+Bvxb1X4OcENE7LIRNZKkV8UQIknaHG6qfpv+bETcFBG7Am8DPp6ZyzJzKfBl4CSAzHwkM2/NzBWZ+RfgS8BRbXe/Qe7JzJsy8xVqb9bbPP8G+lxmvpSZvwCWAddl5tLMXAzcSS3YrLIU+EpmvpyZPwTmA6MjYg/gTcCnq75mA98BxrY27sx8sbWBZObPMvN/suZXwC+AI+p2eRm4pDr/zcALwOCI6AaMBz6WmYszc2Vm3p2ZK4D3ATdn5s3VuW8FZgFv34gaSdKr4r2nkqTN4YT6h8gj4lBgG2BJRKxq7gY8Vm3vC3yN2hvp11bbntnEMTxWt7zXus6/gZ6sW36xlfUd6tYX5+qf9PIotZmP3YCnM/P5Nba9vo1xtyoi3kZthmU/atexPTCnbpe/ZmZz3fryanx9gG2B/2ml272A90TEO+ratgFuX994JGlTGUIkSe3hMWAF0GeNN8erTAISOCAz/xoRJwDfqNu+5kc3LqP2xhuA6tmONW8bqj9mfeff3HaPiKgLInsCM4DHgd4R8dq6ILInsLju2DWvdbX1iOgJ3AC8H5iemS9HxE3Ubmlbn6eAl4B9gN+vse0x4JrM/NBaR0lSO/N2LEnSZpeZS6jdMnRZROwYEd2qh9FX3XL1Wmq3DD1bPZtw7hpdPAnsXbf+MLBtRIyOiG2A84Gem3D+za0v8NGI2CYi3kPtOZebM/Mx4G5gUkRsGxEHUHtm49p19PUkMKC6lQrgNdSu9S9AczUrMmpDBlXdmnYV8KXqAfnuEXFYFWy+D7wjIo6t2retHnLvv/GXL0kbxxAiSWov76f2BnoetVutfgz0q7ZdDBwMPEft4ej/WuPYScD51TMm52Tmc8AZ1J6nWExtZmQR67au829u91J7iP0p4PPAuzPzr9W2k4EB1GZFbgQ+Wz1/0ZYfVd//GhG/rWZQPgpcT+063kttlmVDnUPt1q37gaeBLwDdqoB0PLVP4/oLtZmRc/G9gaQC/GOFkiRtgogYR+0PK76po8ciSVsKf9shSZIkqShDiCRJkqSivB1LkiRJUlHOhEiSJEkqyr8T0gXtvPPOOWjQoI4eRpexbNkyevXq1dHD6BKsdVnWuyzrXY61Lst6l1Wy3g888MBTmbnm33QCDCFd0q677sqsWbM6ehhdRmNjIw0NDR09jC7BWpdlvcuy3uVY67Ksd1kl6x0Rj7a1zduxJEmSJBVlCJEkSZJUlCFEkiRJUlGGEEmSJElFGUIkSZIkFWUIkSRJklSUIUSSJElSUYYQSZIkSUUZQiRJkiQVZQiRJEmSVJQhRJIkSVJRhhBJkiRJRRlCJEmSJBVlCJEkSZJUlCFEkiRJUlGGEEmSJElFGUIkSZIkFWUIkSRJklSUIUSSJElSUYYQSZIkSUUZQiRJkiQVZQiRJEmSVJQhRJIkSVJRhhBJkiRJRRlCJEmSJBVlCJEkSZJUlCFEkiRJUlGGEEmSJElFGUIkSZIkFWUIkSRJklSUIUSSJElSUYYQSZIkSUUZQiRJkiQVZQiRJEmSVJQhRJIkSVJRhhBJkiRJRRlCJEmSJBVlCJEkSZJUlCFEkiRJUlGGEEmSJElFGUIkSZIkFWUIkSRJklRUZGZHj0GF7bn3oOx24lc7ehhdxtnDm7lsTo+OHkaXYK3Lst5lWe9yrHVZXaneTZNHd/QQaGxspKGhoci5IuKBzHx9a9ucCZEkSZJUlCFEkiRJUlGGEEmSJElFGUIkSZIkFWUIkSRJklSUIUSSJElSUYYQSZIkSUUZQiRJkiQVZQiRJEmSVJQhRJIkSVJRhhBJkiRJRRlCJEmSJBVlCJEkSZJUlCFEkiRJUlGGEEmSJElFGUIkSZIkFWUIkSRJklSUIUSSJElSUYYQSZIkSUUZQiRJkiQVZQiRJEmSVJQhRJIkSVJRhhBJkiRJRRlCJEmSJBVlCJEkSZIKGT9+PH379mXYsGGrtX/9619n8ODBDB06lE996lMANDU1sd122zFixAhGjBjBRz7ykZb9f/jDH3LAAQestn9rJk2axKBBgxg8eDC33HJL+1zUq9CjowcgSZIkdRXjxo3jrLPO4v3vf39L2+2338706dN58MEH6dmzJ0uXLm3Zts8++zB79uzV+vjrX//KueeeywMPPMAuu+zCqaeeym233cbRRx+92n7z5s1j2rRpzJ07l8cff5xjjjmGK664ol2vb0M5E9KBImJARPxhHdsbIuKnJcckSZKk9nPkkUfSu3fv1dq+9a1vMWHCBHr27AlA375919nHn/70J/bbbz922WUXAI455hhuuOGGtfabPn06J510Ej179mTgwIEMGjSIP/7xj5vpSjaNIaSgiHDmSZIkSat5+OGHufPOO3nDG97AUUcdxf3339+ybeHChRx00EEcddRR3HnnnQAtYaKpqYnm5mZuuukmHnvssbX6Xbx4MXvssUfLev/+/Xnqqafa/4I2wBbxpjgiBgAzgbuA/wv8HvgucDHQFzgFmAt8HRhO7bouyszp1bHXAL2q7s7KzLsjogG4CHgKGAY8ALwvM7OV8x8KTMjMd0bE8cA0YCdqIW5eZu4dESOAy4Htgf8BxmfmMxHRCNwNHA7MqNavApZX17OhNejVxvWNA8ZU590HuDEz17oxMCJOB04H6NNnFy4c3ryhp9Ym2nU7ONt6F2Gty7LeZVnvcqx1WV2p3o2NjQA88cQTLFu2rGX9ueeeY86cOUyePJk//vGPjBkzhh/84Ae8/PLL/OAHP2CnnXZi/vz5vOtd7+K73/0uvXr14owzzuBtb3sb3bp1Y+jQoTz77LMt/a2yaNEiHnrooZb2JUuW0Ldv37X26whbRAipDALeQ+2N9P3Ae4E3UXsDfh4wD/hlZo6PiJ2B+yLiv4GlwMjMfCki9gWuA15f9XkQMBR4HPg1taDQWjD4bbUvwBHAH4B/ola/e6v27wH/mpm/iohLgM8CH6+27ZyZRwFExIN1+/3HRlz/xDauD2BENb4VwPyI+HpmrhaHM/MK4AqAPfcelJfN2ZJ+9Fu2s4c3Y73LsNZlWe+yrHc51rqsrlTvplMaat+bmujVqxcNDbX1wYMH89GPfpSGhgbe/OY3c+mllzJs2LCW260AGhoauO6669h11115/etfT0NDA+eddx4AV1xxBY888khLf6vcc889LcdC7SH1/v37r7VfR9iSbsdamJlzMvMVarMet1WzFnOAAcAoYEJEzAYagW2BPYFtgG9HxBzgR8CQuj7vy8xFVZ+zq37WkpnNwCMR8Y/AocCXgCOpBZI7I2InakHjV9UhU6vtq/wQoJX9rtmI62/r+qhq8VxmvkQtjO21Ef1KkiSpA51wwgn88pe/BGq3Zv3tb3+jT58+/OUvf2HlypVA7TmQBQsWsPfeewO0PLz+zDPP8M1vfpMPfvCDa/U7ZswYpk2bxooVK1i4cCELFixg//33L3RV67Ylxc4Vdcuv1K2/Qu06VgLvysz59QdFxEXAk8CB1ELXS230uZJ11+NO4G3Ay8B/A1cD3YFzNmDsy1YNB1jrdq8NFLR+fW9g465DkiRJHeTkk0+msbGRp556iv79+3PxxRczfvx4xo8fz7Bhw3jNa17D1KlTiQjuuOMOLrzwQnr06EH37t25/PLLWx5q/9jHPsbvf/97AC688EL2228/AGbMmMGsWbO45JJLGDp0KCeeeCJDhgyhR48eTJkyhe7du3fYtdfbmt6s3gL8a0T8a2ZmRByUmb+j9uzGosx8JSJOpRYcXo07qN1y9b3M/EtEvA74P8Dc6nzPRMQRmXknMBb41ZodZOazEfFcRLwpM++i9izLpl6fJEmSthDXXXddq+3f//7312p717vexbve9a6N6mfMmDGMGTOmZX3ixIlMnDixZb0zPA8CW9btWOvzOWq3Xj1Yfezt56r2bwKnRsRvgP34+6zExroX2JVaGAF4EHiw7kH2U4H/qJ75GAFc0kY/pwFTIuIe4MWNOH9b1ydJkiRtUbaImZDMbKL2CVar1se1se3DrRy7ADigrukzVXsjtWcrVu131nrG8CLQs2799DW2z6b2yV1rHtewxvoD1G4NW+WidZyzZYzV+Vu7vqup3Rq2av24tvqTJEmSOoOtaSZEkiRJ0hZgi5gJKSkibgQGrtH86cy8pR3PeSzwhTWaF2bmP7fXOSVJkqSOYghZQ0e88a8CTruFHEmSJKkz8XYsSZIkSUUZQiRJkiQVZQiRJEmSVJQhRJIkSVJRhhBJkiRJRRlCJEmSJBVlCJEkSZJUlCFEkiRJUlGGEEmSJElFGUIkSZIkFWUIkSRJklSUIUSSJElSUYYQSZIkSUUZQiRJkiQVZQiRJEmSVJQhRJIkSVJRhhBJkiRJRRlCJEmSJBVlCJEkSZJUVI+OHoDK226b7syfPLqjh9FlNDY20nRKQ0cPo0uw1mVZ77KsdznWuizr3TU5EyJJkiSpKEOIJEmSpKIMIZIkSZKKMoRIkiRJKsoQIkmSJKkoQ4gkSZKkogwhkiRJkooyhEiSJEkqyhAiSZIkqShDiCRJkqSiDCGSJEmSijKESJIkSSrKECJJkiSpqMjMjh6DCttz70HZ7cSvdvQwuoyzhzdz2ZweHT2MLsFal2W9y7Le5VjrslbVu2ny6I4eSpfQ2NhIQ0NDkXNFxAOZ+frWtjkTIkmSJKkoQ4gkSZKkogwhkiRJkooyhEiSJEkqyhAiSZIkqShDiCRJkqSiDCGSJEmSijKESJIkSSrKECJJkiSpKEOIJEmSpKIMIZIkSZKKMoRIkiRJKsoQIkmSJKkoQ4gkSZKkogwhkiRJkooyhEiSJEkqyhAiSZIkqShDiCRJkqSiDCGSJEmSijKESJIkSSrKECJJkiSpKEOIJEmSpKIMIZIkSZKKMoRIkiRJKsoQIkmSpE5j/Pjx9O3bl2HDhrW0XXTRRey+++6MGDGCESNGcPPNN7dsmzRpEoMGDWLw4MHccsstLe0NDQ0MHjy45ZilS5e2er62jlf76tHRA5AkSZJWGTduHGeddRbvf//7V2v/xCc+wTnnnLNa27x585g2bRpz587l8ccf55hjjuHhhx+me/fuAFx77bW8/vWvb/Nc6zte7ceZkA4WEQ0R8dOOHockSVJncOSRR9K7d+8N2nf69OmcdNJJ9OzZk4EDBzJo0CDuu+++DT7Xph6vV88QsoWLCGezJEnSVu8b3/gGBxxwAOPHj+eZZ54BYPHixeyxxx4t+/Tv35/Fixe3rJ922mmMGDGCz33uc2TmWn2u73i1ny7/BjYiBgA/B+4C3ggsBo6v2s7JzFkR0QeYlZkDImIccALQHRgGXAa8BhgLrADenplPt3GuQcDlwC7ASuA91aYdIuLHVX8PAO/LzIyIC4F3ANsBdwMfrtobq/XDgRkR8Wfgs1Wfz2Xmka2c+3TgdIA+fXbhwuHNr6pe2ni7bgdnW+8irHVZ1rss612OtS5rVb0bGxtb2p544gmWLVvW0nbAAQdw5ZVXEhFcddVVvPe97+XTn/40ixYt4qGHHmrZb8mSJcydO5c+ffpw5plnsssuu7B8+XI++9nPsnz5co499tjVzr2u47dWL7zwwmq17ihdPoRU9gVOzswPRcT1wLvWs/8w4CBgW+AR4NOZeVBEfBl4P/CVNo67FpicmTdGxLbUZqL2qPoaCjwO/JpauLgL+EZmXgIQEdcAxwE/qfraOTOPqrbNAY7NzMURsXNrJ87MK4ArAPbce1BeNscffSlnD2/Gepdhrcuy3mVZ73KsdVmr6t10SkNLW1NTE7169aKhoWGt/ffee2+OO+44GhoauOeeewBa9ps0aRKjRo3isMMOW+2YpUuXMmvWrLX629DjtyaNjY2t1rU0b8eqWZiZs6vlB4AB69n/9sx8PjP/AjzH34PBnLaOjYjXArtn5o0AmflSZi6vNt+XmYsy8xVgdl0fb46Ie6uQ8RZqQWWVH9Yt/xq4OiI+RG2GRpIkaauxZMmSluUbb7yx5ZOzxowZw7Rp01ixYgULFy5kwYIFHHrooTQ3N/PUU08B8PLLL/PTn/50tU/bWqWt49X+jPk1K+qWV1K7/amZv4e0bdex/yt166/Qdk1jI87fo5op+Sbw+sx8LCIuWmMcy1YtZOZHIuINwGhgdkSMyMy/ruN8kiRJndLJJ59MY2MjTz31FP379+fiiy+msbGR2bNnExEMGDCA//zP/wRg6NChnHjiiQwZMoQePXowZcoUunfvzrJlyzj22GN5+eWXWblyJccccwwf+tCHAJgxYwazZs3ikksuafN4tT9DSNuagEOA+4B3b2pnmfm/EbEoIk7IzJsioifrnrVYFTieiogdqjH8uLUdI2KfzLwXuDci3kHtFi9DiCRJ2uJcd911a7V94AMfaHP/iRMnMnHixNXaevXqxQMPPNDq/mPGjGHMmDHrPF7tz9ux2nYp8C8RcTewuZ5OGgt8NCIepPZg+f9pa8fMfBb4NrVbvG4C7l9Hv/8REXMi4g/AHcDvN9N4JUmSpM2uy8+EZGYTtQfNV61fWrf5gLrl86vtVwNX1+0/oG55tW2tnGsBtWc76v0JaKzb56y65fNXnXeNfhrWWH9nW+eUJEmSOhtnQiRJkiQV1eVnQtpDREyh9jG79b6amd/tiPFIkiRJnYkhpB1k5pkdPQZJkiSps/J2LEmSJElFGUIkSZIkFWUIkSRJklSUIUSSJElSUYYQSZIkSUUZQiRJkiQVZQiRJEmSVJQhRJIkSVJRhhBJkiRJRRlCJEmSJBVlCJEkSZJUlCFEkiRJUlGGEEmSJElFGUIkSZIkFWUIkSRJklSUIUSSJElSUYYQSZIkSUUZQiRJkiQVZQiRJEmSVFSPjh6Ayttum+7Mnzy6o4fRZTQ2NtJ0SkNHD6NLsNZlWe+yrHc51ros6901ORMiSZIkqShDiCRJkqSiDCGSJEmSijKESJIkSSrKECJJkiSpKEOIJEmSpKIMIZIkSZKKMoRIkiRJKsoQIkmSJKkoQ4gkSZKkogwhkiRJkooyhEiSJEkqyhAiSZIkqShDiCRJkqSienT0AFTeiy+vZMCEn3X0MLqMs4c3M856F2Gty7LeZVnvcjpLrZsmj+7oIUjtxpkQSZIkSUUZQiRJkiQVZQiRJEmSVJQhRJIkSVJRhhBJkiRJRRlCJEmSJBVlCJEkSZJUlCFEkiRJUlGGEEmSJElFGUIkSZIkFWUIkSRJklSUIUSSJElSUYYQSZIkSUUZQiRJkiQVZQiRJEmSVJQhRJIkSVJRhhBJkiRJRRlCJEmSJBVlCJEkSZJUlCFEkiRJUlGGEEmSJElFGUIkSZIkFWUIkSRJklSUIUSSJElSUYYQSZKkTmz8+PH07duXYcOGtbRdcMEFHHDAAYwYMYJRo0bx+OOPr3bMn//8Z3bYYQcuvfRSAJYvX87o0aPZf//9GTp0KBMmTGjzfJMmTWLQoEEMHjyYW265pX0uSl2eIUSSJKkTGzduHDNnzlyt7dxzz+XBBx9k9uzZHHfccVxyySWrbf/EJz7B2972ttXazjnnHP74xz/yu9/9jl//+tf8/Oc/X+tc8+bNY9q0acydO5eZM2dyxhlnsHLlys1/UeryDCFbsIh4oaPHIEmS2teRRx5J7969V2vbcccdW5aXLVtGRLSs33TTTey9994MHTq0pW377bfnzW9+MwCvec1rOPjgg1m0aNFa55o+fTonnXQSPXv2ZODAgQwaNIj77rtvc1+SZAjZUkREj44egyRJ6jwmTpzIHnvswbXXXtsyE7Js2TK+8IUv8NnPfrbN45599ll+8pOfcPTRR6+1bfHixeyxxx4t6/3792fx4sWbf/Dq8rrsG9uIGAD8HLgLeCOwGDi+ajsnM2dFRB9gVmYOiIhxwAlAd2AYcBnwGmAssAJ4e2Y+3cp5+gI/z8xDIuJAYDawV2b+OSL+BxgO7AJcVX3/C3Batf1q4GngIOC3EfEN4AfUfm4z687RD/ghsGO17V8y8841xnE6cDpAnz67cOHw5lddO22cXbeDs613Eda6LOtdlvUup7PUurGxsWX5iSeeYNmyZau1jRw5kpEjR3LttddyzjnncNppp/Gtb32LUaNGMWvWLJqamthuu+1WO2blypWcd955vP3tb+fPf/4zf/7zn1c756JFi3jooYdajlmyZAlz586lT58+7XadL7zwwmpjVPvqLPXusiGksi9wcmZ+KCKuB961nv2HUQsE2wKPAJ/OzIMi4svA+4GvrHlAZi6NiG0jYkfgCGAWcERE3AUszczlVbj4XmZOjYjxwNeoBR6A/YBjMnNlRMwAvpWZ34uIM+tO817glsz8fER0B7ZvZRxXAFcA7Ln3oLxsTlf/0Zdz9vBmrHcZ1ros612W9S6ns9S66ZSGvy83NdGrVy8aGhrW2m/gwIGMHj2aqVOncsEFF3DvvfcydepUnn32Wbp168bQoUM566yzgNpD7m94wxv42te+1uo577nnHoCW80yaNIlRo0Zx2GGHbdZrq9fY2Njqdal9dJZ6d/x/YR1rYWbOrpYfAAasZ//bM/N54PmIeA74SdU+BzhgHcfdDRwOHAn8O/BWIIBVsxWHAe+slq8Bvlh37I8yc9UTYYfz96B0DfCFavl+4KqI2Aa4qe6aJEnSVmjBggXsu+++AMyYMYP9998fgDvv/PuNEBdddBE77LBDSwA5//zzee655/jOd77TZr9jxozhve99L5/85Cd5/PHHWbBgAYceemg7Xom6qq4eQlbULa8EtgOa+fuzMtuuY/9X6tZfYd21vJPaLMhewHTg00ACP21j/6xbXraObbWGzDsi4khgNHBNRPxHZn5vHeORJElbiJNPPpnGxkaeeuop+vfvz8UXX8zNN9/M/Pnz6datG3vttReXX375OvtYtGgRn//859l///05+OCDATjrrLP44Ac/yIwZM5g1axaXXHIJQ4cO5cQTT2TIkCH06NGDKVOm0L179xKXqS6mq4eQ1jQBhwD3Ae/eTH3eAfwbcEdmvhIRTwNvBz5Tbb8bOIna7MYp1J5Tac2vq/2+X+0HQETsBSzOzG9HRC/gYMAQIknSVuC6665bq+0DH/jAeo+76KKLWpb79+9P5lq/xwRqsx9jxoxpWZ84cSITJ07c+IFKG8FPx1rbpcC/RMTdwGZ5Ciszm6rFO6rvdwHPZuYz1fpHgdMi4kFqD7p/rI2uPgacGRH3AzvVtTcAsyPid9Ru1/rq5hi3JEmS1B667ExIFQyG1a1fWre5/vmO86vtVwNX1+0/oG55tW1tnG/PuuV/p/ZsSP1Y3tLKMePWWF9I7fmRVSZX7VOBqes6vyRJktRZOBMiSZIkqaguOxPSHiJiCrVPsKr31cz8bkeMR5IkSeqMDCGbUWaeuf69JEmSpK7N27EkSZIkFWUIkSRJklSUIUSSJElSUYYQSZIkSUUZQiRJkiQVZQiRJEmSVJQhRJIkSVJRhhBJkiRJRRlCJEmSJBVlCJEkSZJUlCFEkiRJUlGGEEmSJElFGUIkSZIkFWUIkSRJklSUIUSSJElSUYYQSZIkSUUZQiRJkiQVZQiRJEmSVJQhRJIkSVJRPTp6ACpvu226M3/y6I4eRpfR2NhI0ykNHT2MLsFal2W9y7Le5Vhrqf05EyJJkiSpKEOIJEmSpKIMIZIkSZKKMoRIkiRJKsoQIkmSJKkoQ4gkSZKkogwhkiRJkooyhEiSJEkqyhAiSZIkqShDiCRJkqSiDCGSJEmSijKESJIkSSrKECJJkiSpqB4dPQCV9+LLKxkw4WcdPYwu4+zhzYyz3kVY67Ksd1nWu5yOrnXT5NEddm6pFGdCJEmSJBVlCJEkSZJUlCFEkiRJUlGGEEmSJElFGUIkSZIkFWUIkSRJklSUIUSSJElSUYYQSZIkSUUZQiRJkiQVZQiRJEmSVJQhRJIkSVJRhhBJkiRJRRlCJEmSJBVlCJEkSZJUlCFEkiRJUlGGEEmSJElFGUIkSZIkFWUIkSRJklSUIUSSJElSUYYQSZIkSUUZQiRJkiQVZQiRJEmSVJQhRJIkSVJRhhBJkiRJRRlCJEmSOqHx48fTt29fhg0b1tJ2wQUXcMABBzBixAhGjRrF448/DsB9993HiBEjGDFiBAceeCA33nhjyzF/+9vfOP3009lvv/3Yf//9ueGGG1o936RJkxg0aBCDBw/mlltuad+LU5dnCJEkSeqExo0bx8yZM1drO/fcc3nwwQeZPXs2xx13HJdccgkAw4YNY9asWcyePZuZM2fy4Q9/mObmZgA+//nP07dvXx5++GHmzZvHUUcdtda55s2bx7Rp05g7dy4zZ87kjDPOYOXKle1/keqyDCHrERENEfHGjh7HmiLi6oh4d0ePQ5IktY8jjzyS3r17r9a24447tiwvW7aMiABg++23p0ePHgC89NJLLe0AV111FZ/5zGcA6NatG3369FnrXNOnT+ekk06iZ8+eDBw4kEGDBnHfffdt9muSVjGErF8D0KEhJCK6d+T5JUlS5zFx4kT22GMPrr322paZEIB7772XoUOHMnz4cC6//HJ69OjBs88+C9Ru4zr44IN5z3vew5NPPrlWn4sXL2aPPfZoWe/fvz+LFy9u92tR19WjowewPhExAPg5cBe1MLAYOL5qOyczZ0VEH2BWZg6IiHHACUB3YBhwGfAaYCywAnh7Zj7dxrk+CnwEaAbmAROq9ZUR8T7gX4E/A1cBuwB/AU7LzD9HxNXAS8BQYFfgk5n504i4GZiQmQ9GxO+AGzPzkoj4HPAocCXwReBtQAL/lpk/jIgG4LPAEmBERAwFvg68BVgItPyKIyImA2Oqcf8iM89p5dpOB04H6NNnFy4c3rze2mvz2HU7ONt6F2Gty7LeZVnvcjq61o2NjS3LTzzxBMuWLVutbeTIkYwcOZJrr72Wc845h9NOO61l25QpU3j00Uc577zz6NWrFy+++CKLFi1ip5124ktf+hLXX389Y8eO5bzzzlvtnIsWLeKhhx5qOc+SJUuYO3duq7Mmm9sLL7yw2vWpfXWWenf6EFLZFzg5Mz8UEdcD71rP/sOAg4BtgUeAT2fmQRHxZeD9wFfaOG4CMDAzV0TEzpn5bERcDryQmZcCRMRPgO9l5tSIGA98jVroARgAHAXsA9weEYOAO4AjIqKJWkg4vNr3TcD3gXcCI4ADgT7A/RFxR7XPocCwzFwYEe8EBgPDqYWcecBVEdEb+Gdg/8zMiNi5tQvLzCuAKwD23HtQXjZnS/nRb/nOHt6M9S7DWpdlvcuy3uV0dK2bTmn4+3JTE7169aKhoWGt/QYOHMjo0aOZOnXqWtuuvvpqevfuzSGHHML222/PBRdcQLdu3dhnn31461vfulZ/99xzD0BL+6RJkxg1ahSHHXbY5rqsNjU2NrZ6fWofnaXeW8rtWAszc3a1/AC1N/vrcntmPp+ZfwGeA35Stc9Zz7EPAtdWsx5t/QrkMOAH1fI11MLEKtdn5iuZuQD4E7A/cCdwZLXfz4AdImJ7YEBmzq/ar8vMlZn5JPAr4J+q/u7LzIXV8pF1+z0O/LJq/19qMzDfqYLK8nVcnyRJ2oItWLCgZXnGjBnsv//+ACxcuLDlQfRHH32U+fPnM2DAACKCd7zjHS2/+b7tttsYMmTIWv2OGTOGadOmsWLFChYuXMiCBQs49NBD2/+C1GVtKb9SWVG3vBLYjlpIWBWitl3H/q/Urb/Cuq95NLU3+2OAC6pboNYn21hetX4/8HpqoeRWarMdH6IWpqDutqpWLFvHuWoNmc0RcShwNHAScBa1W7YkSdIW7OSTT6axsZGnnnqK/v37c/HFF3PzzTczf/58unXrxl577cXll18OwF133cXkyZPZZptt6NatG9/85jdbbqX6whe+wNixY/n4xz/OLrvswne/+12gFmJmzZrFJZdcwtChQznxxBMZMmQIPXr0YMqUKXTv7iOpaj9bSghpTRNwCHAfsMmfEhUR3YA9MvP2iLgLeC+wA/A8sGPdrndTe7N/DXAKtWdVVnlPREwFBgJ7A/Mz828R8RhwIvA5as+SXFp9Qe12rQ9Xx/WmFoLOpTaLUm/Vft8D+gJvBn4QETsA22fmzRHxG2q3n0mSpC3cddddt1bbBz7wgVb3HTt2LGPHjm1121577cUdd9yxVvuYMWMYM2ZMy/rEiROZOHHiqxyttHG25BByKXB9RIzl77cmbYruwPcjYidqsxNfrp4J+Qnw44g4ntqD6R+l9izGuVQPptf1MZ/a7VS7Ah/JzJeq9juBozNzeUTcCfSv2gBupHaL1++pzXR8KjOfiIg1Q8iN1GY45gAPV+cBeC0wPSK2rcb9ic1QC0mSJKnddPoQkplN1B40X7V+ad3mA+qWz6+2Xw1cXbf/gLrl1batcZ6XWf35jlXtD69xHmj7dqdfZ+ZaISAzLwAuqJYfp+4WrMxMajMf565xTCPQuMZ+Z7VxXm/alCRJ0hZjS3kwXZIkSdJWotPPhLSHiJjC3z8qd5WvZuZ3X22fmTlukwYlSZIkdRFdMoRk5pkdPQZJkiSpq/J2LEmSJElFGUIkSZIkFWUIkSRJklSUIUSSJElSUYYQSZIkSUUZQiRJkiQVZQiRJEmSVJQhRJIkSVJRGxRCImKfiOhZLTdExEcjYud2HZkkSZKkrdKGzoTcAKyMiEHAlcBA4AftNipJkiRJW60NDSGvZGYz8M/AVzLzE0C/9huWJEmSpK3VhoaQlyPiZOBU4KdV2zbtMyRJkiRJW7MNDSGnAYcBn8/MhRExEPh++w1LkiRJ0taqx4bslJnzIuLTwJ7V+kJgcnsOTJIkSdLWaUM/HesdwGxgZrU+IiJmtOO4JEmSJG2lNvR2rIuAQ4FnATJzNrVPyJIkSZKkjbKhIaQ5M59boy0392AkSZIkbf026JkQ4A8R8V6ge0TsC3wUuLv9hiVJkiRpa7WhMyH/CgwFVlD7I4XPAR9vpzFJkiRJ2oqtdyYkIroDMzLzGGBi+w9J7W27bbozf/Lojh5Gl9HY2EjTKQ0dPYwuwVqXZb3Lst7lWGup/a13JiQzVwLLI2KnAuORJEmStJXb0GdCXgLmRMStwLJVjZn50XYZlSRJkqSt1oaGkJ9VX5IkSZK0STb0L6ZPbe+BSJIkSeoaNiiERMRCWvm7IJm592YfkSRJkqSt2obejvX6uuVtgfcAvTf/cCRJkiRt7Tbo74Rk5l/rvhZn5leAt7Tv0CRJkiRtjTb0dqyD61a7UZsZeW27jEiSJEnSVm1Db8e6rG65GVgInLj5hyNJkiRpa7ehIeQDmfmn+oaIGNgO45EkSZK0ldugZ0KAH29gmyRJkiSt0zpnQiJif2AosFNEvLNu047UPiVLkiRJkjbK+m7HGgwcB+wMvKOu/XngQ+00JkmSJElbsXWGkMycDkyPiMMy855CY5IkSZK0FdvQB9N/FxFnUrs1q+U2rMwc3y6jUrt68eWVDJjws44eRpdx9vBmxlnvIqx1Wda7LOtdTkfXumny6A47t1TKhj6Yfg3wf4BjgV8B/andkiVJkiRJG2VDQ8igzLwAWJaZU4HRwPD2G5YkSZKkrdWGhpCXq+/PRsQwYCdgQLuMSJIkSdJWbUOfCbkiIv4BuACYAewAXNhuo5IkSZK01dqgEJKZ36kWfwXs3X7DkSRJkrS126DbsSJi14i4MiJ+Xq0PiYgPtO/QJEmSJG2NNvSZkKuBW4DdqvWHgY+3w3gkSZIkbeU2NIT0yczrgVcAMrMZWNluo5IkSZK01drQELIsIl4HJEBE/F/guXYblSRJkqSt1oZ+OtYnqX0q1j4R8WtgF+Dd7TYqSZIkSVutdYaQiNgzM/+cmb+NiKOAwUAA8zPz5XUdK0mSJEmtWd/tWDfVLf8wM+dm5h8MIJIkSZJerfWFkKhb9u+DSJIkSdpk6wsh2cayJEmSJL0q63sw/cCI+F9qMyLbVctU65mZO7br6CRJkiRtddYZQjKze6mBSJIkSeoaNvTvhEiSJEnSZmEIkSRJklSUIUSSJElSUYYQSZIkSUUZQiRJkiQVZQiRJEmSVJQhRJIkSVJRhhBJkiRJRRlCJEmSOqHx48fTt29fhg0b1tJ2wQUXcMABBzBixAhGjRrF448/DsB9993HiBEjGDFiBAceeCA33nhjyzF/+9vfOP3009lvv/3Yf//9ueGGG1o936RJkxg0aBCDBw/mlltuad+LU5dnCJEkSeqExo0bx8yZM1drO/fcc3nwwQeZPXs2xx13HJdccgkAw4YNY9asWcyePZuZM2fy4Q9/mObmZgA+//nP07dvXx5++GHmzZvHUUcdtda55s2bx7Rp05g7dy4zZ87kjDPOYOXKle1/keqyumwIiYibI2Lnjh7HqxURF0XEOR09DkmS1D6OPPJIevfuvVrbjjvu2LK8bNkyIgKA7bffnh49egDw0ksvtbQDXHXVVXzmM58BoFu3bvTp02etc02fPp2TTjqJnj17MnDgQAYNGsR999232a9JWqXLhpDMfHtmPtvR49gQUdNlf1aSJOnvJk6cyB577MG1117bMhMCcO+99zJ06FCGDx/O5ZdfTo8ePXj22WeB2m1cBx98MO95z3t48skn1+pz8eLF7LHHHi3r/fv3Z/Hixe1+Leq6erRXxxExAPg5cBfwRmAxcHzVdk5mzoqIPsCszBwQEeOAE4DuwDDgMuA1wFhgBfD2zHy6jXM1rqPPMcD2wD7AjZn5qeqYJuD1mflUREwE3g88BvwFeCAzL11Hv92ByUAD0BOYkpn/2cbYvgnMzMwZEXEj8Exmjo+IDwADM/P8iPgkML465DuZ+ZW6+t0OHAacEBHvW3Oc1Tk+CnwEaAbmZeZJrYzjdOB0gD59duHC4c2tDVftYNft4GzrXYS1Lst6l2W9y+noWjc2NrYsP/HEEyxbtmy1tpEjRzJy5EiuvfZazjnnHE477bSWbVOmTOHRRx/lvPPOo1evXrz44ossWrSInXbaiS996Utcf/31jB07lvPOO2+1cy5atIiHHnqo5TxLlixh7ty5rc6abG4vvPDCaten9tVZ6t1uIaSyL3ByZn4oIq4H3rWe/YcBBwHbAo8An87MgyLiy9TefH/lVYxhRNXnCmB+RHw9Mx9btTEiDgFOqvbpAfyW6s39OnwAeC4z/ykiegK/johfZObCVva9AzgCmAHsDvSr2t8ETKvOfxrwBiCAeyPiV8AzwGDgtMw8Yz3jnEAt0Kxo6xazzLwCuAJgz70H5WVz2vtHr1XOHt6M9S7DWpdlvcuy3uV0dK2bTmn4+3JTE7169aKhoWGt/QYOHMjo0aOZOnXqWtuuvvpqevfuzSGHHML222/PBRdcQLdu3dhnn31461vfulZ/99xzD0BL+6RJkxg1ahSHHXbY5rqsNjU2NrZ6fWofnaXe7X2Lz8LMnF0tPwAMWM/+t2fm85n5F+A54CdV+5wNOLYtt2Xmc5n5EjAP2GuN7UdQmyFZnpn/Sy0srM8o4P0RMRu4F3gdtcDVmjuBIyJiSHX+JyOiH7XZjbuphZEbM3NZZr4A/Fc1JoBHM/M3GzDOB4Frq5kSf00mSdJWasGCBS3LM2bMYP/99wdg4cKFLQ+iP/roo8yfP58BAwYQEbzjHe9o+c33bbfdxpAhQ9bqd8yYMUybNo0VK1awcOFCFixYwKGHHtr+F6Quq71j/oq65ZXAdtTeJK8KP9uuY/9X6tZfYd1j3dA+V7bRT25kvwH8a2au9/PrMnNxRPwD8FZqsyK9gROBFzLz+ah/cmxtyzZwnKOBI6ndenZBRAzNTMOIJElbsJNPPpnGxkaeeuop+vfvz8UXX8zNN9/M/Pnz6datG3vttReXX345AHfddReTJ09mm222oVu3bnzzm99suZXqC1/4AmPHjuXjH/84u+yyC9/97neBWoiZNWsWl1xyCUOHDuXEE09kyJAh9OjRgylTptC9e/cOu3Zt/TpirrEJOAS4D3h3J+jzDuDqiJhMrR7vAFY939FWv7cA/xIRv8zMlyNiP2BxZq4ZGla5B/g48BZqsyY/rr7WPH8A/0ztOZgNGmf1wPoemXl7RNwFvBfYAXh2Y4ogSZI6l+uuu26ttg984AOt7jt27FjGjm3t7QPstdde3HHHHWu1jxkzhjFjxrSsT5w4kYkTJ77K0UobpyNCyKXA9RExFvhlR/eZmb+NiB8Cs4FHqd0+tb5+v0Pt9rDfVjMZf6H2UH1b7gRGZeYjEfEotdmQO+vOfzW1oAO1B9N/Vz2YviHj7A58PyJ2ohZivrylfOqXJEmSuqZ2CyGZ2UTtQfNV65fWbT6gbvn8avvVwNV1+w+oW15tWyvn+uMG9nlcG/1/Hvg81P7+xgb0+wpwXvW1Xpl5JXBltfwy0GuN7V8CvrRGWxN19VtznGt404aMQ5IkSeoM/NsTkiRJkoraoj7rLyKmAIev0fzVzPzu5jpHZl70ao6LiOHANWs0r8jMN2zyoCRJkqStyBYVQjLzzI4eQ1sycw61v0kiSZIkaR28HUuSJElSUYYQSZIkSUUZQiRJkiQVZQiRJEmSVJQhRJIkSVJRhhBJkiRJRRlCJEmSJBVlCJEkSZJUlCFEkiRJUlGGEEmSJElFGUIkSZIkFWUIkSRJklSUIUSSJElSUYYQSZIkSUUZQiRJkiQVZQiRJEmSVJQhRJIkSVJRhhBJkiRJRRlCJEmSJBXVo6MHoPK226Y78yeP7uhhdBmNjY00ndLQ0cPoEqx1Wda7LOtdjrWW2p8zIZIkSZKKMoRIkiRJKsoQIkmSJKkoQ4gkSZKkogwhkiRJkooyhEiSJEkqyhAiSZIkqShDiCRJkqSiDCGSJEmSijKESJIkSSrKECJJkiSpKEOIJEmSpKIMIZIkSZKK6tHRA1B5L768kgETftbRw+gyzh7ezDjrXYS1Lst6l9Xe9W6aPLrd+pakNTkTIkmSJKkoQ4gkSZKkogwhkiRJkooyhEiSJEkqyhAiSZIkqShDiCRJkqSiDCGSJEmSijKESJIkSSrKECJJkiSpKEOIJEmSpKIMIZIkSZKKMoRIkiRJKsoQIkmSJKkoQ4gkSZKkogwhkiRJkooyhEiSJEkqyhAiSZIkqShDiCRJkqSiDCGSJEmSijKESJIkSSrKECJJkiSpKEOIJEmSpKIMIZIkSZKKMoRIkiRJKsoQIkmSABg/fjx9+/Zl2LBhLW1PP/00I0eOZN9992XkyJE888wzLdsefPBBDjvsMIYOHcrw4cN56aWXAGhoaGDw4MGMGDGCESNGsHTp0lbPN2nSJAYNGsTgwYO55ZZb2vfiJHUqhhBJkgTAuHHjmDlz5mptkydP5uijj2bBggUcffTRTJ48GYDm5mbe9773cfnllzN37lwaGxvZZpttWo679tprmT17NrNnz6Zv375rnWvevHlMmzaNuXPnMnPmTM444wxWrlzZvhcoqdMwhLSTiHjhVRxz3hrrd1ffB0TEe+vaGyLip5s+SkmS/u7II4+kd+/eq7VNnz6dU089FYBTTz2Vm266CYBf/OIXHHDAARx44IEAvO51r6N79+4bfK7p06dz0kkn0bNnTwYOHMigQYO47777Ns+FSOr0DCGbQUT02ExdrRZCMvON1eIA4L1r7S1JUjt78skn6devHwD9+vVrubXq4YcfJiI49thjOfjgg/niF7+42nGnnXYaI0aM4HOf+xyZuVa/ixcvZo899mhZ79+/P4sXL27HK5HUmWyuN8+bVUQMAH4O3AW8EVgMHF+1nZOZsyKiDzArMwdExDjgBKA7MAy4DHgNMBZYAbw9M59u5Tx9gZ9n5iERcSAwG9grM/8cEf8DDAd2Aa6qvv8FOK3afjXwNHAQ8NuI+AbwA2o1nVl3jn7AD4Edq23/kpl3tjKWycB2ETEbmJuZp0TEC5m5AzAZ+Mdq21Tgd3XH9QK+Xo21B3BRZk5vpf/TgdMB+vTZhQuHN7dae21+u24HZ1vvIqx1Wda7rPaud2NjIwBPPPEEy5Yta1lvbm5uWa5fnz9/Pv/93//N5ZdfTs+ePTn77LPp3r07hxxyCGeeeSa77LILy5cv57Of/SzLly/n2GOPXe18ixYt4qGHHmrpe8mSJcydO5c+ffq02zVuqBdeeGG1a1b7st5ldZZ6d8oQUtkXODkzPxQR1wPvWs/+w6gFgm2BR4BPZ+ZBEfFl4P3AV9Y8IDOXRsS2EbEjcAQwCzgiIu4Clmbm8ipcfC8zp0bEeOBr1AIPwH7AMZm5MiJmAN/KzO9FxJl1p3kvcEtmfj4iugPbtzb4zJwQEWdl5ohWNk+gFr6Og9rtWHXbJgK/zMzxEbEzcF9E/HdmLluj/yuAKwD23HtQXjanM//oty5nD2/Gepdhrcuy3mW1d72bTmmofW9qolevXjQ01NZ33313Bg8eTL9+/ViyZAm77bYbDQ0NPPHEE7z44oscf/zxANx///288sorLcetsnTpUmbNmrVW+z333APQ0j5p0iRGjRrFYYcd1l6XuMEaGxvXGq/aj/Uuq7PUuzPfjrUwM2dXyw9QuyVpXW7PzOcz8y/Ac8BPqvY56zn2buBw4Ejg36vvRwCrZisOozbDAXAN8Ka6Y3+UmaueojscuK5uv1XuB06LiIuA4Zn5/HquY2ONAiZUsySN1ELYnpv5HJKkLmrMmDFMnToVgKlTp7aEjmOPPZYHH3yQ5cuX09zczK9+9SuGDBlCc3MzTz31FAAvv/wyP/3pT1f7tK36fqdNm8aKFStYuHAhCxYs4NBDDy13YZI6VGf+FdaKuuWVwHZAM38PTtuuY/9X6tZfYd3XeSe10LEXMB34NJBAWw9+19/Yumwd22oNmXdExJHAaOCaiPiPzPzeOsazsQJ4V2bO34x9SpK6oJNPPpnGxkaeeuop+vfvz8UXX8yECRM48cQTufLKK9lzzz350Y9+BMA//MM/8MlPfpJ/+qd/IiJ4+9vfzujRo1m2bBnHHnssL7/8MitXruSYY47hQx/6EAAzZsxg1qxZXHLJJQwdOpQTTzyRIUOG0KNHD6ZMmbJRD7ZL2rJ15hDSmibgEOA+4N2bqc87gH8D7sjMVyLiaeDtwGeq7XcDJ1Gb3TiF2nMqrfl1td/3q/0AiIi9gMWZ+e3q+Y2DgbZCyMsRsU1mvrxG+/PAa9s45hbgXyPiXzMzI+KgzPxdG/tKktSm6667rtX22267rdX2973vfbzvfe9bra1Xr1488MADre4/ZswYxowZ07I+ceJEJk6c+CpHK2lL1plvx2rNpcC/VB9du1meXMvMpmrxjur7XcCzmbnqrzF9lNrtVA9Se9D9Y2109THgzIi4H9iprr0BmB0Rv6P2XMtX1zGcK4AHI+LaNdofBJoj4vcR8Yk1tn0O2KY67g/VuiRJktRpdcqZkCoYDKtbv7Ru8wF1y+dX268Grq7bf0Dd8mrb2jjfnnXL/07t2ZD6sbyllWPGrbG+kNrzI6tMrtqnUvtEq/XKzE9Tux1s1foO1feXgaPX2L2x2vYi8OEN6V+SJEnqDLa0mRBJkiRJW7hOORPSHiJiCrVPsKr31cz8bgeM5V6g5xrNYzNzTumxSJIkSaV1mRCSmWeuf68yMvMNHT0GSZIkqaN4O5YkSZKkogwhkiRJkooyhEiSJEkqyhAiSZIkqShDiCRJkqSiDCGSJEmSijKESJIkSSrKECJJkiSpKEOIJEmSpKIMIZIkSZKKMoRIkiRJKsoQIkmSJKkoQ4gkSZKkogwhkiRJkooyhEiSJEkqyhAiSZIkqShDiCRJkqSiDCGSJEmSijKESJIkSSqqR0cPQOVtt0135k8e3dHD6DIaGxtpOqWho4fRJVjrsqx3WdZb0tbEmRBJkiRJRRlCJEmSJBVlCJEkSZJUlCFEkiRJUlGGEEmSJElFGUIkSZIkFWUIkSRJklSUIUSSJElSUYYQSZIkSUUZQiRJkiQVZQiRJEmSVJQhRJIkSVJRhhBJkiRJRRlCJEmSJBXVo6MHoPJefHklAyb8rKOH0WWcPbyZcda7CGtdlvUua0Pr3TR5dIHRSNKmcSZEkiRJUlGGEEmSJElFGUIkSZIkFWUIkSRJklSUIUSSJElSUYYQSZIkSUUZQiRJkiQVZQiRJEmSVJQhRJIkSVJRhhBJkiRJRRlCJEmSJBVlCJEkSZJUlCFEkiRJUlGGEEmSJElFGUIkSZIkFWUIkSRJklSUIUSSJElSUYYQSZIkSUUZQiRJkiQVZQiRJEmSVJQhRJIkSVJRhhBJkiRJRRlCJEmSJBVlCJEkSZJUlCFEkqSt0Pjx4+nbty/Dhg1raXv66acZOXIk++67LyNHjuSZZ54B4NZbb+WQQw5h+PDhHHLIIfzyl79sOeaHP/whBxxwAEOHDuVTn/pUm+ebNGkSgwYNYvDgwdxyyy3td2GStgqGkFchIgZExHs3sY+PR8T2des3R8TOmzw4SZKAcePGMXPmzNXaJk+ezNFHH82CBQs4+uijmTx5MgB9+vThJz/5CXPmzGHq1KmMHTsWgL/+9a+ce+653HbbbcydO5cnn3yS2267ba1zzZs3j2nTpjF37lxmzpzJGWecwcqVK9v/IiVtsQwhr84AYJNCCPBxoCWEZObbM/PZTexTkiQAjjzySHr37r1a2/Tp0zn11FMBOPXUU7npppsAOOigg9htt90AGDp0KC+99BIrVqzgT3/6E/vttx+77LILAMcccww33HDDWueaPn06J510Ej179mTgwIEMGjSI++67rx2vTtKWrlgIqWYPHoqIb0fE3Ij4RURsFxGNEfH6ap8+EdFULY+LiJsi4icRsTAizoqIT0bE7yLiNxHRex3naoyIr0TE3RHxh4g4tGq/KCLOqdvvD9W4Wh1btc+giPjviPh9RPw2IvYBJgNHRMTsiPhENdZv1PX704hoqJa/FRGzqn4vrto+CuwG3B4Rt1dtTRHRp1r+ZDW2P0TEx9dVv1X9RcS8iHgwIqZtjp+XJGnr8+STT9KvXz8A+vXrx9KlS9fa54YbbuCggw6iZ8+eDBo0iD/+8Y80NTXR3NzMTTfdxGOPPbbWMYsXL2aPPfZoWe/fvz+LFy9uvwuRtMXrUfh8+wInZ+aHIuJ64F3r2X8YcBCwLfAI8OnMPCgivgy8H/jKOo7tlZlvjIgjgauqvjZ2bN8HrgUmZ+aNEbEtteA2ATgnM4+DWmBaR78TM/PpiOgO3BYRB2Tm1yLik8CbM/Op+p0j4hDgNOANQAD3RsSvgGfWMcYJwMDMXNHWLV0RcTpwOkCfPrtw4fDm9ZRDm8uu28HZ1rsIa12W9S5rQ+vd2NjYsvzEE0+wbNmylrbm5ubVtq+5vnDhQs4//3y++MUvtrSfccYZvO1tb6Nbt24MHTqUZ599drVjABYtWsRDDz3U0r5kyRLmzp1Lnz59XsWVdrwXXnhhrWtU+7HeZXWWepcOIQszc3a1/AC125rW5fbMfB54PiKeA35Stc8BDljPsdcBZOYdEbHjBjxvsdbYIuK1wO6ZeWPV10sAEbGerlZzYhUAegD9gCHAg+vY/03AjZm5rDrXfwFHADNaG2O1/CBwbUTcBNzUWqeZeQVwBcCeew/Ky+aU/tF3XWcPb8Z6l2Gty7LeZW1ovZtOafj7clMTvXr1oqGh1rb77rszePBg+vXrx5IlS9htt91ati1atIjTTz+d66+/nsMPP7ylj4aGBs477zwArrjiCh555JGWY1a55557WvaF2kPqo0aN4rDDDnt1F9vBGhsb17pGtR/rXVZnqXfpZ0JW1C2vpPbGvLluHNuuY/9X6tZfYf0BKltZrz/XmudrbWwbmjZa7TciBgLnAEdn5gHAz1j7Gte0rnO2NkaA0cAU4BDggYjwXYEkaS1jxoxh6tSpAEydOpXjjz8egGeffZbRo0czadKk1QII0HLL1jPPPMM3v/lNPvjBD7ba77Rp01ixYgULFy5kwYIFHHrooe18NZK2ZJ3hwfQmam+eAd69Gfv9fwAR8Sbgucx8rjrXwVX7wcDAdXWQmf8LLIqIE6pjelafaPU88Nq6XZuAERHRLSL2AFb9y7sjsAx4LiJ2Bd5Wd8yafaxyB3BCRGwfEb2AfwbubGuMEdEN2CMzbwc+BewM7LCu65Ikbf1OPvlkDjvsMObPn0///v258sormTBhArfeeiv77rsvt956KxMmTADgG9/4Bo888gif+9znGDFiBCNGjGgJHx/72McYMmQIhx9+OBMmTGC//fYDYMaMGVx44YVA7WH2E088kSFDhvDWt76VKVOm0L179465cElbhM7wG/NLgesjYizwy/XtvBGeiYi7qQWB8VXbDcD7I2I2cD/w8Ab0Mxb4z4i4BHgZeA+125+aI+L3wNXUnk1ZSO02sT8AvwXIzN9HxO+AucCfgF/X9XsF8POIWJKZb17VmJm/jYirgVUfK/KdzPxdRAxoY3zdge9HxE7UZlG+7KdsSZKuu+66Vttb+4jd888/n/PPP3+j+hkzZgxjxoxpWZ84cSITJ058FSOV1BUVCyGZ2UTdw+GZeWnd5vrnO86vtl9N7Q3+qv0H1C2vtq0NN2TmZ9YYw4vAqDb2b3VsmbkAeEsr+x+9xvoprXWamePaaP868PW69QF1y18CvrTG/k1tjZHacySSJEnSFqEz3I4lSZIkqQvpDLdjvWoRMQU4fI3mr2ZmQwcMR5IkSdIG2KJDSGae2dFjkCRJkrRxvB1LkiRJUlGGEEmSJElFGUIkSZIkFWUIkSRJklSUIUSSJElSUYYQSZIkSUUZQiRJkiQVZQiRJEmSVJQhRJIkSVJRhhBJkiRJRRlCJEmSJBVlCJEkSZJUlCFEkiRJUlGGEEmSJElFGUIkSZIkFWUIkSRJklSUIUSSJElSUYYQSZIkSUUZQiRJkiQVZQiRJEmSVFSPjh6Ayttum+7Mnzy6o4fRZTQ2NtJ0SkNHD6NLsNZlWe+yrLekrYkzIZIkSZKKMoRIkiRJKsoQIkmSJKkoQ4gkSZKkogwhkiRJkooyhEiSJEkqyhAiSZIkqShDiCRJkqSiDCGSJEmSijKESJIkSSrKECJJkiSpKEOIJEmSpKIMIZIkSZKK6tHRA1B5L768kgETftbRw+gyzh7ezDjrXYS1Lqu96t00efRm71OS1Lk4EyJJkiSpKEOIJEmSpKIMIZIkSZKKMoRIkiRJKsoQIkmSJKkoQ4gkSZKkogwhkiRJkooyhEiSJEkqyhAiSZIkqShDiCRJkqSiDCGSJEmSijKESJIkSSrKECJJkiSpKEOIJEmSpKIMIZIkSZKKMoRIkiRJKsoQIkmSJKkoQ4gkSZKkogwhkiRJkooyhEiSJEkqyhAiSZIkqShDiCRJkqSiDCGSJEmSijKESJIkSSrKECJJ6pS+/OUvM3ToUIYNG8bJJ5/MSy+9xNNPP83IkSPZd999GTlyJM888wwAf/vb3zjttNMYPnw4Bx54II2Nja322dbxkqSyDCGSpE5n8eLFfO1rX2PWrFn84Q9/YOXKlUybNo3Jkydz9NFHs2DBAo4++mgmT54MwLe//W0A5syZw6233srZZ5/NK6+8sla/bR0vSSrLELKRIuKFjh6DJHUFzc3NvPjiizQ3N7N8+XJ22203pk+fzqmnngrAqaeeyk033QTAvHnzOProowHo27cvO++8M7NmzVqrz7aOlySVZQhZh4jo0dFjeDUiontHj0GSNsXuu+/OOeecw5577km/fv3YaaedGDVqFE8++ST9+vUDoF+/fixduhSAAw88kOnTp9Pc3MzChQt54IEHeOyxx9bqt63jJUllFX2THREDgJ8DdwFvBBYDx1dt52TmrIjoA8zKzAERMQ44AegODAMuA14DjAVWAG/PzKdbOU9f4OeZeUhEHAjMBvbKzD9HxP8Aw4FdgKuq738BTqu2Xw08DRwE/DYivgH8gFqtZtadox/wQ2DHatu/ZOadbVz3C5m5Q7X8buC4zBxXneslYCiwK/DJzPxpdd3/DPQEBgI/yMyLq+PfB3y0qsO9wBmZubKaofkScCxwdlXj+jGcDpwO0KfPLlw4vLm1oaod7LodnG29i7DWZbVXvRsbG3n++eeZOnUq3//+99lhhx246KKLmDhxIs3Nzas977FqfZ999uHWW29l//33Z9ddd2X//ffnoYceWuvZkLaO3xK88MILW8xYt3TWuizrXVZnqXdH/KZ/X+DkzPxQRFwPvGs9+w+jFgi2BR4BPp2ZB0XEl4H3A19Z84DMXBoR20bEjsARwCzgiIi4C1iamcurcPG9zJwaEeOBr1ELPAD7AcdUb+5nAN/KzO9FxJl1p3kvcEtmfr6aedj+1RQDGAAcBewD3B4Rg6r2Q6trXw7cHxE/A5YB/w84PDNfjohvAqcA3wN6AX/IzAtbO0lmXgFcAbDn3oPysjlb5CTPFuns4c1Y7zKsdVntVe+mUxr40Y9+xEEHHcQJJ5wAwOOPP85vfvMbdt99dwYPHky/fv1YsmQJu+22Gw0NDQAtt2MBvPGNb+Sd73wnQ4YMWa3vdR3f2TU2Nm4xY93SWeuyrHdZnaXeHXE71sLMnF0tP0DtTfi63J6Zz2fmX4DngJ9U7XPWc+zdwOHAkcC/V9+PAFbNVhxGbYYD4BrgTXXH/igzV1bLhwPX1e23yv3AaRFxETA8M59fz3W05frMfCUzFwB/Avav2m/NzL9m5ovAf1XjOxo4hFoomV2t713tvxK44VWOQZI6lT333JPf/OY3LF++nMzktttu4x//8R8ZM2YMU6dOBWDq1Kkcf/zxACxfvpxly5YBcOutt9KjR4+1AgjQ5vGSpLI64leGK+qWVwLbAc38PRBtu479X6lbf4V1j/9OaqFjL2A68GkggZ+2sX/WLS9bx7ZaQ+YdEXEkMBq4JiL+IzO/twF9r3l9a/ad62gPYGpmfqaVc7xUF5wkaYv2hje8gXe/+90cfPDB9OjRg4MOOojTTz+dF154gRNPPJErr7ySPffckx/96EcALF26lGOPPZZu3bqx++67c801f/+d0Qc/+EE+8pGP8PrXv54JEya0erwkqazOct9CE7Xf8N8HvHsz9XkH8G/AHZn5SkQ8DbwdWPUG/m7gJGqzG6ewxjMUdX5d7ff9aj8AImIvYHFmfjsiegEHU7stqjVPRsQ/AvOpPetRP2vynoiYSu3Zj72rfQ4CRkZEb+BFareJjad2a9b0iPhydctZb+C1mfnoBtZEkrYYF198MRdffPFqbT179uS2225ba98BAwYwf/78Vvv5zne+07L8ute9rtXjJUlldZZPx7oU+JeIuBvoszk6zMymavGO6vtdwLOZueovU32U2u1UD1J70P1jbXT1MeDMiLgf2KmuvQGYHRG/o/Zcy1fXMZwJ1GZgfgksWWPbfOBX1B7O/0hmvlQ33muoPVR/Q2bOysx5wPnAL6px3wr0W8d5JUmSpE6n6ExIFQyG1a1fWrf5gLrl86vtVwNX1+0/oG55tW1tnG/PuuV/p/ZsSP1Y3tLKMePWWF9I7fmRVSZX7VOBqes6f10fPwZ+3MbmX2fmJ1ppX5qZZ7XS1w+pfSrXmu07bMhYJEmSpI7WWWZCJEmSJHURneWZkFctIqZQ+wSrel/NzO92wFjupfa3PeqNzcw5re2/5qxLXfvVrGeWR5IkSdpSbfEhJDPPXP9eZWTmGzp6DJIkSVJn5+1YkiRJkooyhEiSJEkqyhAiSZIkqShDiCRJkqSiDCGSJEmSijKESJIkSSrKECJJkiSpKEOIJEmSpKIMIZIkSZKKMoRIkiRJKsoQIkmSJKkoQ4gkSZKkogwhkiRJkooyhEiSJEkqyhAiSZIkqShDiCRJkqSiDCGSJEmSijKESJIkSSrKECJJkiSpqB4dPQCVt9023Zk/eXRHD6PLaGxspOmUho4eRpdgrcuy3pKkV8uZEEmSJElFGUIkSZIkFWUIkSRJklSUIUSSJElSUYYQSZIkSUUZQiRJkiQVZQiRJEmSVJQhRJIkSVJRhhBJkiRJRRlCJEmSJBVlCJEkSZJUlCFEkiRJUlGGEEmSJElFGUIkSZIkFWUIkSRJklSUIUSSJElSUYYQSZIkSUUZQiRJkiQVZQiRJEmSVJQhRJIkSVJRhhBJkiRJRRlCJEmSJBVlCJEkSZJUlCFEkiRJUlGGEEmSJElFGUIkSZIkFWUIkSRJklSUIUSSJElSUYYQSZIkSUUZQiRJkiQVZQiRJEmSVJQhRJIkSVJRhhBJkiRJRRlCJEmSJBVlCJEkSZJUlCFEkiRJUlGGEEmSJElFGUIkSZIkFWUIkSRJklSUIUSSJElSUYYQSZIkSUUZQiRJkiQVZQiRJEmSVFRkZkePQYVFxPPA/I4eRxfSB3iqowfRRVjrsqx3Wda7HGtdlvUuq2S998rMXVrb0KPQANS5zM/M13f0ILqKiJhlvcuw1mVZ77KsdznWuizrXVZnqbe3Y0mSJEkqyhAiSZIkqShDSNd0RUcPoIux3uVY67Ksd1nWuxxrXZb1LqtT1NsH0yVJkiQV5UyIJEmSpKIMIZIkSZKKMoR0IRHx1oiYHxGPRMSEjh7P1iAi9oiI2yPioYiYGxEfq9oviojFETG7+np73TGfqX4G8yPi2I4b/ZYpIpoiYk5V11lVW++IuDUiFlTf/6Fuf+v9KkTE4LrX7+yI+N+I+Liv7c0nIq6KiKUR8Ye6to1+LUfEIdV/E49ExNciIkpfy5agjXr/R0T8MSIejIgbI2Lnqn1ARLxY9zq/vO4Y670B2qj3Rv/7Yb3Xr41a/7Cuzk0RMbtq7zyv7cz0qwt8Ad2B/wH2Bl4D/B4Y0tHj2tK/gH7AwdXya4GHgSHARcA5rew/pKp9T2Bg9TPp3tHXsSV9AU1AnzXavghMqJYnAF+w3pu15t2BJ4C9fG1v1roeCRwM/KGubaNfy8B9wGFAAD8H3tbR19YZv9qo9yigR7X8hbp6D6jfb41+rPerr/dG//thvV9drdfYfhlwYbXcaV7bzoR0HYcCj2TmnzLzb8A04PgOHtMWLzOXZOZvq+XngYeA3ddxyPHAtMxckZkLgUeo/Wy0aY4HplbLU4ET6tqt96Y7GvifzHx0HftY642UmXcAT6/RvFGv5YjoB+yYmfdk7V3E9+qOUZ3W6p2Zv8jM5mr1N0D/dfVhvTdcG6/vtvj63gTrqnU1m3EicN26+uiIWhtCuo7dgcfq1hex7jfL2kgRMQA4CLi3ajqrmuK/qu6WCn8Omy6BX0TEAxFxetW2a2YugVowBPpW7dZ78ziJ1f8H5mu7/Wzsa3n3annNdm288dR++7vKwIj4XUT8KiKOqNqs96bbmH8/rPemOwJ4MjMX1LV1ite2IaTraO2+Pj+feTOJiB2AG4CPZ+b/At8C9gFGAEuoTYWCP4fN4fDMPBh4G3BmRBy5jn2t9yaKiNcAY4AfVU2+tjtGW/W17ptBREwEmoFrq6YlwJ6ZeRDwSeAHEbEj1ntTbey/H9Z7053M6r9E6jSvbUNI17EI2KNuvT/weAeNZasSEdtQCyDXZuZ/AWTmk5m5MjNfAb7N329L8eewiTLz8er7UuBGarV9sppKXjWlvLTa3XpvurcBv83MJ8HXdgEb+1pexOq3EFn3jRQRpwLHAadUt6FQ3Rb012r5AWrPKOyH9d4kr+LfD+u9CSKiB/BO4Ier2jrTa9sQ0nXcD+wbEQOr32yeBMzo4DFt8ap7La8EHsrML9W196vb7Z+BVZ9YMQM4KSJ6RsRAYF9qD4JpA0REr4h47aplag+V/oFaXU+tdjsVmF4tW+9Nt9pv0Xxtt7uNei1Xt2w9HxH/t/r36P11x2g9IuKtwKeBMZm5vK59l4joXi3vTa3ef7Lem2Zj//2w3pvsGOCPmdlym1Vnem33aM/O1XlkZnNEnAXcQu2Tbq7KzLkdPKytweHAWGDOqo+/A84DTo6IEdSmMpuADwNk5tyIuB6YR23q/8zMXFl4zFuyXYEbq08N7AH8IDNnRsT9wPUR8QHgz8B7wHpvqojYHhhJ9fqtfNHX9uYREdcBDUCfiFgEfBaYzMa/lv8FuBrYjtozDfXPNajSRr0/Q+0TmW6t/l35TWZ+hNqnDV0SEc3ASuAjmbnqwV/rvQHaqHfDq/j3w3qvR2u1zswrWft5PuhEr+2oZh4lSZIkqQhvx5IkSZJUlCFEkiRJUlGGEEmSJElFGUIkSZIkFWUIkSRJklSUH9ErSdrqRcRKYE5d0wmZ2dRBw5GkLs+P6JUkbfUi4oXM3KHg+XpkZnOp80nSlsbbsSRJXV5E9IuIOyJidkT8ISKOqNrfGhG/jYjfR8RtVVvviLgpIh6MiN9ExAFV+0URcUVE/AL4XvWXiW+IiPurr8M78BIlqVPxdixJUlewXUTMrpYXZuY/r7H9vcAtmfn5iOgObB8RuwDfBo7MzIUR0bva92Lgd5l5QkS8BfgeMKLadgjwpsx8MSJ+AHw5M++KiD2BW4B/bLcrlKQtiCFEktQVvJiZI9ax/X7gqojYBrgpM2dHRANwR2YuBMjMp6t93wS8q2r7ZUS8LiJ2qrbNyMwXq+VjgCERseocO0bEazPz+c11UZK0pTKESJK6vMy8IyKOBEYD10TEfwDPAq09OBmttK3ab1ldWzfgsLpQIkmq+EyIJKnLi4i9gKWZ+W3gSuBg4B7gqIgYWO2z6nasO4BTqrYG4KnM/N9Wuv0FcFbdOUa00/AlaYvjTIgkSdAAnBsRLwMvAO/PzL9ExOnAf0VEN2ApMBK4CPhuRDwILAdObaPPjwJTqv16UAsvH2nXq5CkLYQf0StJkiSpKG/HkiRJklSUIUSSJElSUYYQSZIkSUUZQiRJkiQVZQiRJEmSVJQhRJIkSVJRhhBJkiRJRf1/rynVfd2eoKwAAAAASUVORK5CYII=\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "### Plot the important variables ###\n", "fig, ax = plt.subplots(figsize=(12,12))\n", "xgb.plot_importance(model, max_num_features=50, height=0.8, ax=ax)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "739ce74e-af56-4fd6-a7b3-fe809f33bc54", "_uuid": "25225b8d3a5ed46ec90308478b3788760eb5ea1d" }, "source": [ "문자 수, 평균 단어 길이 및 고유 단어 수가 상위 3개 변수인 것으로 나타났습니다. 이제 몇 가지 텍스트 기반 변수를 만드는 데 집중해 보겠습니다.\n", "\n", "**텍스트 기반 변수 :**\n", "\n", "우리가 만들 수 있는 기본 기능 중 하나는 텍스트에 있는 단어의 tf-idf 값입니다. 그래서 우리는 그것으로 시작할 수 있습니다." ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "_cell_guid": "dd89dcab-b7a2-4b11-9564-ac8558f938e0", "_uuid": "41b4430c7e9699bd7f430c471009082bf7449928" }, "outputs": [], "source": [ "### Fit transform the tfidf vectorizer ###\n", "tfidf_vec = TfidfVectorizer(stop_words='english', ngram_range=(1,3))\n", "\n", "full_tfidf = tfidf_vec.fit_transform(train_df['text'].values.tolist() + test_df['text'].values.tolist())\n", "train_tfidf = tfidf_vec.transform(train_df['text'].values.tolist())\n", "test_tfidf = tfidf_vec.transform(test_df['text'].values.tolist())" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "fb46183a-43b6-4785-823a-f017a0821b78", "_uuid": "9c51efb2c3ff436baf8811a298b699b430b812bf" }, "source": [ "이제 tfidf 벡터를 얻었으므로 여기에 까다로운 부분이 있습니다. \n", "tfidf 출력은 희소 행렬이므로 다른 밀집변수와 함께 사용해야 하는 경우 몇 가지 선택 사항이 있습니다.\n", "\n", "1. 우리는 tfidf 벡터라이저에서 상위 'n' 변수(시스템 구성에 따라 다름)을 가져오도록 선택할 수 있으며, 이를 고밀도 형식으로 변환하고 다른 변수와 연결합니다.\n", "2. 희소 특성만을 사용하여 모델을 만든 다음 다른 밀집변수와 함께 변수 중 하나로 예측을 사용합니다.\n", "\n", "데이터 세트를 기반으로 하면 하나가 다른 것보다 더 잘 수행될 수 있습니다. \n", "여기에서는 tfidf의 모든 기능을 사용하는 매우 [좋은 점수 커널](https://www.kaggle.com/the1owl/python-tell-tale-tutorial)이 있으므로 두 번째 접근 방식을 사용할 수 있습니다.\n", "\n", "또한 이 데이터 세트에서 [Naive Bayes가 더 나은 성능을 보입니다](https://www.kaggle.com/thomasnelson/spooky-simple-naive-bayes-scores-0-399). \n", "따라서 훈련 속도가 더 빠르기 때문에 tfidf 기능을 사용하여 나이브베이즈 모델을 구축할 수 있습니다." ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "_cell_guid": "7ce26604-5935-41cd-85ac-33e499a186f1", "_uuid": "924a649ddf02867e7bd1e68d8cab33d3c36bb83e" }, "outputs": [], "source": [ "from sklearn.naive_bayes import MultinomialNB\n", "\n", "# runMNB(dev_X, dev_y, val_X, val_y, test_tfidf) # test_tfidf : 8392x550841\n", "def runMNB(train_X, train_y, test_X, test_y, test_X2):\n", " model = MultinomialNB()\n", " model.fit(train_X, train_y)\n", " \n", " pred_test_y = model.predict_proba(test_X)\n", " pred_test_y2 = model.predict_proba(test_X2)\n", " \n", " return pred_test_y, pred_test_y2, model" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "a409c693-6d31-485d-b05f-d98008a1c185", "_uuid": "f3ef5dd66afde39552eaa8f59929b263b831b190" }, "source": [ "**Naive Bayes on Word Tfidf Vectorizer:**" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "<19579x550841 sparse matrix of type ''\n", "\twith 611707 stored elements in Compressed Sparse Row format>" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_tfidf" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 0\n", "1 1\n", "2 0\n", "3 2\n", "4 1\n", " ..\n", "19574 0\n", "19575 0\n", "19576 0\n", "19577 0\n", "19578 1\n", "Name: author, Length: 19579, dtype: int64" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_y" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 0\n", "1 1\n", "2 0\n", "3 2\n", "4 1\n", " ..\n", "19574 0\n", "19575 0\n", "19576 0\n", "19577 0\n", "19578 1\n", "Name: author, Length: 15664, dtype: int64" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dev_y" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "5 2\n", "6 0\n", "9 2\n", "22 0\n", "24 0\n", " ..\n", "19515 2\n", "19522 1\n", "19537 2\n", "19538 1\n", "19573 2\n", "Name: author, Length: 3915, dtype: int64" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "val_y" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "<8392x550841 sparse matrix of type ''\n", "\twith 258688 stored elements in Compressed Sparse Row format>" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test_tfidf" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[0.4144724 , 0.17076084, 0.41476676],\n", " [0.51718498, 0.24926282, 0.2335522 ],\n", " [0.39304418, 0.24555086, 0.36140496],\n", " ...,\n", " [0.37382573, 0.21526166, 0.41091261],\n", " [0.42593811, 0.29615677, 0.27790512],\n", " [0.31885599, 0.26637083, 0.41477318]])" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pred_val_y" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3915" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(pred_val_y)" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "8392" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(pred_test_y)" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "_cell_guid": "61063815-507a-45f7-8866-5298da11e62d", "_uuid": "c0ed8394d5fbcfb833efa2836509583a2ec79de1" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Mean cv score : 0.8422161983612855\n" ] } ], "source": [ "from sklearn.model_selection import KFold\n", "\n", "kf = KFold(n_splits=5, shuffle=True, random_state=2017)\n", "cv_scores = []\n", "pred_full_test = 0\n", "pred_train = np.zeros([train_df.shape[0], 3])\n", "\n", "for dev_index, val_index in kf.split(train_X):\n", " dev_X, val_X = train_tfidf[dev_index], train_tfidf[val_index] # 15664x550841, 3915x550841\n", " dev_y, val_y = train_y[dev_index], train_y[val_index] # 15664, 3915\n", " pred_val_y, pred_test_y, model = runMNB(dev_X, dev_y, val_X, val_y, test_tfidf) # 3915,8392,model // test_tfidf : 8392x550841\n", " \n", " pred_full_test = pred_full_test + pred_test_y # 8392\n", " pred_train[val_index,:] = pred_val_y #3915\n", " cv_scores.append(metrics.log_loss(val_y, pred_val_y)) # (실제값, 예측값) # 3915*5\n", " \n", "print(\"Mean cv score : \", np.mean(cv_scores))\n", "pred_full_test = pred_full_test / 5." ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "e40f18ae-9db6-43dd-a010-27a34b1aa33f", "_uuid": "1f6da22535edd4c6faf256836f925bb18d18aba0" }, "source": [ "우리는 tfidf vectorizer를 사용하여 0.844의 mlogloss를 얻고 있습니다. 메타 변수보다 훨씬 좋습니다. 혼동 행렬을 살펴보겠습니다.\n", "\n", "메타변수 생성 -> xgb\n", "\n", "TfidfVectorizer -> MultinomialNB\n" ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "_cell_guid": "586f6708-8cc1-4422-90dd-6e4bef3cf2c0", "_kg_hide-input": true, "_uuid": "2bd4a5be4f492f1eb75836a7fbaed8c4374701cf" }, "outputs": [], "source": [ "### Function to create confusion matrix ###\n", "import itertools\n", "from sklearn.metrics import confusion_matrix\n", "\n", "### From http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html#sphx-glr-auto-examples-model-selection-plot-confusion-matrix-py #\n", "# plot_confusion_matrix(cnf_matrix, classes=['EAP', 'HPL', 'MWS'], title='Confusion matrix, without normalization')\n", "def plot_confusion_matrix(cm, classes, normalize=False, title='Confusion matrix', cmap=plt.cm.Blues):\n", " \"\"\"\n", " This function prints and plots the confusion matrix.\n", " Normalization can be applied by setting `normalize=True`.\n", " \"\"\"\n", " # 정규화\n", " if normalize:\n", " cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis] # 차원늘리기\n", "\n", " plt.imshow(cm, interpolation='nearest', cmap=cmap)\n", " plt.title(title)\n", " plt.colorbar()\n", " tick_marks = np.arange(len(classes))\n", " plt.xticks(tick_marks, classes, rotation=45)\n", " plt.yticks(tick_marks, classes)\n", "\n", " # 각 구역에 값넣기\n", " fmt = '.2f' if normalize else 'd'\n", " thresh = cm.max() / 2.\n", " for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):\n", " plt.text(j, i, format(cm[i, j], fmt),\n", " horizontalalignment=\"center\",\n", " color=\"white\" if cm[i, j] > thresh else \"black\")\n", "\n", " plt.tight_layout()\n", " plt.ylabel('True label')\n", " plt.xlabel('Predicted label')" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[0.4144724 , 0.17076084, 0.41476676],\n", " [0.51718498, 0.24926282, 0.2335522 ],\n", " [0.39304418, 0.24555086, 0.36140496],\n", " ...,\n", " [0.37382573, 0.21526166, 0.41091261],\n", " [0.42593811, 0.29615677, 0.27790512],\n", " [0.31885599, 0.26637083, 0.41477318]])" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pred_val_y # (3915, 3)" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([2, 0, 0, ..., 2, 0, 2], dtype=int64)" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.argmax(pred_val_y, axis=1)" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[1531, 17, 47],\n", " [ 554, 500, 44],\n", " [ 487, 11, 724]], dtype=int64)" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cnf_matrix" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[1595],\n", " [1098],\n", " [1222]], dtype=int64)" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cnf_matrix.sum(axis=1)[:, np.newaxis]" ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "_cell_guid": "800547fa-c070-4bdf-ad1a-1f97e7db3fdf", "_kg_hide-input": true, "_uuid": "b0792565a816242c8e29a242264c5e77ec90a7eb" }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "cnf_matrix = confusion_matrix(val_y, np.argmax(pred_val_y, axis=1))\n", "np.set_printoptions(precision=2)\n", "\n", "# Plot non-normalized confusion matrix\n", "plt.figure(figsize=(8,8))\n", "plot_confusion_matrix(cnf_matrix, classes=['EAP', 'HPL', 'MWS'],\n", " title='Confusion matrix, without normalization')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "404695a0-c33b-4cba-9dd1-bf2a5e8c12fc", "_uuid": "e16c6d867d6bf1bb4e2661cc28841c672091fe0a" }, "source": [ "많은 인스턴스가 EAP로 예측되고 해당 클래스에 크게 치우쳐 있습니다.\n", "\n", "**단어 TFIDF의 SVD:**\n", "\n", "tfidf 벡터는 희소하기 때문에 정보를 압축하고 훨씬 간결하게 표현하는 또 다른 방법은 SVD를 사용하는 것입니다. 또한 일반적으로 SVD 기능은 과거 텍스트 기반 대회에서 저에게 잘 수행되었습니다. 그래서 우리는 tfidf라는 단어에 svd 기능을 만들어 기능 세트에 추가할 수 있었습니다." ] }, { "cell_type": "code", "execution_count": 51, "metadata": { "_cell_guid": "94d49082-2834-4633-8420-10f12f1b16eb", "_uuid": "95a7ae438276387fa9f2c4a2b25ee29f9b1f7d2d" }, "outputs": [], "source": [ "n_comp = 20\n", "svd_obj = TruncatedSVD(n_components=n_comp, algorithm='arpack')\n", "svd_obj.fit(full_tfidf)\n", "train_svd = pd.DataFrame(svd_obj.transform(train_tfidf))\n", "test_svd = pd.DataFrame(svd_obj.transform(test_tfidf))\n", " \n", "train_svd.columns = ['svd_word_'+str(i) for i in range(n_comp)]\n", "test_svd.columns = ['svd_word_'+str(i) for i in range(n_comp)]\n", "train_df = pd.concat([train_df, train_svd], axis=1)\n", "test_df = pd.concat([test_df, test_svd], axis=1)\n", "del full_tfidf, train_tfidf, test_tfidf, train_svd, test_svd" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "77309c44-f947-46f4-814c-7d7a01768293", "_uuid": "468ef624b3f6d752737b9c112ae6f0592a929da2" }, "source": [ "**Naive Bayes on Word Count Vectorizer:**" ] }, { "cell_type": "code", "execution_count": 52, "metadata": { "_cell_guid": "92ab9f7c-8c2c-4233-a6dc-eaa12ca7d144", "_uuid": "9cd43c20764c81a3216b8371c477ae5552e24c7d" }, "outputs": [], "source": [ "### Fit transform the count vectorizer ###\n", "tfidf_vec = CountVectorizer(stop_words='english', ngram_range=(1,3))\n", "\n", "tfidf_vec.fit(train_df['text'].values.tolist() + test_df['text'].values.tolist())\n", "train_tfidf = tfidf_vec.transform(train_df['text'].values.tolist())\n", "test_tfidf = tfidf_vec.transform(test_df['text'].values.tolist())" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "48d68a84-6c6d-40b3-9069-d4c324355fc6", "_uuid": "2e05fb42bcd850e74c767961f58400b0dac463ec" }, "source": [ "이제 count vectorizer 기반 기능을 사용하여 다항식 NB 모델을 구축해 보겠습니다." ] }, { "cell_type": "code", "execution_count": 53, "metadata": { "_cell_guid": "b7927348-b6b1-4960-b76c-ebf14fae37f0", "_uuid": "4f9354473820d59076bf991b5110fc4dd309c204" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Mean cv score : 0.45091841616567435\n" ] } ], "source": [ "kf = KFold(n_splits=5, shuffle=True, random_state=2017)\n", "cv_scores = []\n", "pred_full_test = 0\n", "pred_train = np.zeros([train_df.shape[0], 3])\n", "\n", "for dev_index, val_index in kf.split(train_X):\n", " dev_X, val_X = train_tfidf[dev_index], train_tfidf[val_index]\n", " dev_y, val_y = train_y[dev_index], train_y[val_index]\n", " \n", " pred_val_y, pred_test_y, model = runMNB(dev_X, dev_y, val_X, val_y, test_tfidf)\n", " \n", " pred_full_test = pred_full_test + pred_test_y\n", " pred_train[val_index,:] = pred_val_y\n", " cv_scores.append(metrics.log_loss(val_y, pred_val_y))\n", " \n", "print(\"Mean cv score : \", np.mean(cv_scores))\n", "pred_full_test = pred_full_test / 5.\n", "\n", "# add the predictions as new features #\n", "train_df[\"nb_cvec_eap\"] = pred_train[:,0]\n", "train_df[\"nb_cvec_hpl\"] = pred_train[:,1]\n", "train_df[\"nb_cvec_mws\"] = pred_train[:,2]\n", "test_df[\"nb_cvec_eap\"] = pred_full_test[:,0]\n", "test_df[\"nb_cvec_hpl\"] = pred_full_test[:,1]\n", "test_df[\"nb_cvec_mws\"] = pred_full_test[:,2]" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idtextnum_wordsnum_unique_wordsnum_charsnum_stopwordsnum_punctuationsnum_words_uppernum_words_titlemean_word_len...svd_word_13svd_word_14svd_word_15svd_word_16svd_word_17svd_word_18svd_word_19nb_cvec_eapnb_cvec_hplnb_cvec_mws
0id02310Still, as I urged our leaving Ireland with suc...191911093134.842105...-0.0118370.036064-0.016591-0.025580-0.0187850.031289-0.0472200.0210180.0005950.978387
1id24541If a fire wanted fanning, it could readily be ...6249330337134.338710...-0.004397-0.000020-0.0085830.006335-0.0042160.0018100.0017670.9999850.0000090.000006
2id00134And when they had broken down the frail door t...3330189153014.757576...0.006063-0.003324-0.0094520.0132390.004852-0.0074780.0027860.2173250.7825270.000148
3id27757While I was thinking how I should possibly man...4134223195234.463415...0.004783-0.006865-0.0079600.0067630.002540-0.004558-0.0007280.7535910.2464080.000001
4id04081I am not sure to what limit his knowledge may ...11115361113.909091...-0.0018250.000123-0.0065040.002533-0.004004-0.0029020.0003150.9709500.0218240.007226
..................................................................
8387id11749All this is now the fitter for my purpose.994271013.777778...0.000610-0.0014960.001063-0.001688-0.005155-0.0039010.0019650.5659490.1060680.327983
8388id10526I fixed myself on a wide solitude.773441114.000000...-0.005626-0.006015-0.0086730.0056750.004995-0.004539-0.0002830.0315980.0334430.934959
8389id13477It is easily understood that what might improv...2524150112015.040000...-0.0061670.000023-0.0128480.006857-0.0055660.0066340.0117750.9997090.0001200.000171
8390id13761Be this as it may, I now began to feel the ins...3834197213234.210526...-0.010635-0.001895-0.004086-0.001623-0.002973-0.0072770.0044860.0006890.0000050.999307
8391id04282Long winded, statistical, and drearily genealo...3833247185015.526316...-0.003923-0.006325-0.006186-0.0079400.0098610.017919-0.0132860.0247050.9752870.000008
\n", "

8392 rows × 33 columns

\n", "
" ], "text/plain": [ " id text num_words \\\n", "0 id02310 Still, as I urged our leaving Ireland with suc... 19 \n", "1 id24541 If a fire wanted fanning, it could readily be ... 62 \n", "2 id00134 And when they had broken down the frail door t... 33 \n", "3 id27757 While I was thinking how I should possibly man... 41 \n", "4 id04081 I am not sure to what limit his knowledge may ... 11 \n", "... ... ... ... \n", "8387 id11749 All this is now the fitter for my purpose. 9 \n", "8388 id10526 I fixed myself on a wide solitude. 7 \n", "8389 id13477 It is easily understood that what might improv... 25 \n", "8390 id13761 Be this as it may, I now began to feel the ins... 38 \n", "8391 id04282 Long winded, statistical, and drearily genealo... 38 \n", "\n", " num_unique_words num_chars num_stopwords num_punctuations \\\n", "0 19 110 9 3 \n", "1 49 330 33 7 \n", "2 30 189 15 3 \n", "3 34 223 19 5 \n", "4 11 53 6 1 \n", "... ... ... ... ... \n", "8387 9 42 7 1 \n", "8388 7 34 4 1 \n", "8389 24 150 11 2 \n", "8390 34 197 21 3 \n", "8391 33 247 18 5 \n", "\n", " num_words_upper num_words_title mean_word_len ... svd_word_13 \\\n", "0 1 3 4.842105 ... -0.011837 \n", "1 1 3 4.338710 ... -0.004397 \n", "2 0 1 4.757576 ... 0.006063 \n", "3 2 3 4.463415 ... 0.004783 \n", "4 1 1 3.909091 ... -0.001825 \n", "... ... ... ... ... ... \n", "8387 0 1 3.777778 ... 0.000610 \n", "8388 1 1 4.000000 ... -0.005626 \n", "8389 0 1 5.040000 ... -0.006167 \n", "8390 2 3 4.210526 ... -0.010635 \n", "8391 0 1 5.526316 ... -0.003923 \n", "\n", " svd_word_14 svd_word_15 svd_word_16 svd_word_17 svd_word_18 \\\n", "0 0.036064 -0.016591 -0.025580 -0.018785 0.031289 \n", "1 -0.000020 -0.008583 0.006335 -0.004216 0.001810 \n", "2 -0.003324 -0.009452 0.013239 0.004852 -0.007478 \n", "3 -0.006865 -0.007960 0.006763 0.002540 -0.004558 \n", "4 0.000123 -0.006504 0.002533 -0.004004 -0.002902 \n", "... ... ... ... ... ... \n", "8387 -0.001496 0.001063 -0.001688 -0.005155 -0.003901 \n", "8388 -0.006015 -0.008673 0.005675 0.004995 -0.004539 \n", "8389 0.000023 -0.012848 0.006857 -0.005566 0.006634 \n", "8390 -0.001895 -0.004086 -0.001623 -0.002973 -0.007277 \n", "8391 -0.006325 -0.006186 -0.007940 0.009861 0.017919 \n", "\n", " svd_word_19 nb_cvec_eap nb_cvec_hpl nb_cvec_mws \n", "0 -0.047220 0.021018 0.000595 0.978387 \n", "1 0.001767 0.999985 0.000009 0.000006 \n", "2 0.002786 0.217325 0.782527 0.000148 \n", "3 -0.000728 0.753591 0.246408 0.000001 \n", "4 0.000315 0.970950 0.021824 0.007226 \n", "... ... ... ... ... \n", "8387 0.001965 0.565949 0.106068 0.327983 \n", "8388 -0.000283 0.031598 0.033443 0.934959 \n", "8389 0.011775 0.999709 0.000120 0.000171 \n", "8390 0.004486 0.000689 0.000005 0.999307 \n", "8391 -0.013286 0.024705 0.975287 0.000008 \n", "\n", "[8392 rows x 33 columns]" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test_df" ] }, { "cell_type": "code", "execution_count": 55, "metadata": { "_cell_guid": "5d412e2f-8988-4895-9434-33df93c263db", "_kg_hide-input": true, "_uuid": "75658bb392011311ffefbfc562483ac5e11a0cab" }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "cnf_matrix = confusion_matrix(val_y, np.argmax(pred_val_y,axis=1))\n", "np.set_printoptions(precision=2)\n", "\n", "# Plot non-normalized confusion matrix\n", "plt.figure(figsize=(8,8))\n", "plot_confusion_matrix(cnf_matrix, classes=['EAP', 'HPL', 'MWS'],\n", " title='Confusion matrix of NB on word count, without normalization')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "5a5d31ea-278c-432a-9840-ca9209410ff7", "_uuid": "fb099e48fde5f3960bc2409d323749e855b2e269" }, "source": [ "우와. tfidf vectorizer 대신 count vectorizer를 사용하여 0.451의 교차 검증 mlogloss를 얻었습니다. 이 모델을 사용한 LB 점수는 0.468입니다. 또한 혼동 행렬은 이전 것보다 훨씬 좋아 보입니다.\n", "\n", "** Character Count Vectorizer의 Naive Bayes:**\n", "\n", "\"데이터 아이볼링\"의 한 가지 아이디어는 특수 문자를 계산하는 것이 도움이 될 수 있다는 것입니다. 특수 문자를 계산하는 대신 문자 수준에서 count vectorizer를 사용하여 일부 기능을 얻을 수 있습니다. 다시 다항식 NB를 실행할 수 있습니다." ] }, { "cell_type": "code", "execution_count": 56, "metadata": { "_cell_guid": "aed10880-e717-4a04-85b9-a77f8f58aee9", "_uuid": "d26ba4563285c10418479465e10de172f527f2a5" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Mean cv score : 3.7507639226818825\n" ] } ], "source": [ "### Fit transform the tfidf vectorizer ###\n", "tfidf_vec = CountVectorizer(ngram_range=(1,7), analyzer='char')\n", "\n", "tfidf_vec.fit(train_df['text'].values.tolist() + test_df['text'].values.tolist())\n", "train_tfidf = tfidf_vec.transform(train_df['text'].values.tolist())\n", "test_tfidf = tfidf_vec.transform(test_df['text'].values.tolist())\n", "\n", "kf = KFold(n_splits=5, shuffle=True, random_state=2017)\n", "cv_scores = []\n", "pred_full_test = 0\n", "pred_train = np.zeros([train_df.shape[0], 3])\n", "\n", "for dev_index, val_index in kf.split(train_X):\n", " dev_X, val_X = train_tfidf[dev_index], train_tfidf[val_index]\n", " dev_y, val_y = train_y[dev_index], train_y[val_index]\n", "\n", " pred_val_y, pred_test_y, model = runMNB(dev_X, dev_y, val_X, val_y, test_tfidf)\n", " \n", " pred_full_test = pred_full_test + pred_test_y\n", " pred_train[val_index,:] = pred_val_y\n", " cv_scores.append(metrics.log_loss(val_y, pred_val_y))\n", " \n", "print(\"Mean cv score : \", np.mean(cv_scores))\n", "pred_full_test = pred_full_test / 5.\n", "\n", "# add the predictions as new features #\n", "train_df[\"nb_cvec_char_eap\"] = pred_train[:,0]\n", "train_df[\"nb_cvec_char_hpl\"] = pred_train[:,1]\n", "train_df[\"nb_cvec_char_mws\"] = pred_train[:,2]\n", "test_df[\"nb_cvec_char_eap\"] = pred_full_test[:,0]\n", "test_df[\"nb_cvec_char_hpl\"] = pred_full_test[:,1]\n", "test_df[\"nb_cvec_char_mws\"] = pred_full_test[:,2]" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "55e95b9a-807b-4e99-bf95-069a11e07a64", "_uuid": "227e565a4a8a6aa04dcfb648688bf0deb0853910" }, "source": [ "교차검증점수는 매우 높고 3.75입니다. \n", "그러나 이것은 단어 수준 변수와 다른 정보를 추가할 수 있으므로 최종 모델에도 사용하겠습니다.\n", "\n", "**Character Tfidf Vectorizer의 Naive Bayes:**\n", "\n", "또한 문자 tfidf 벡터라이저에 대한 나이브 베이즈 예측을 얻도록 합시다." ] }, { "cell_type": "code", "execution_count": 57, "metadata": { "_cell_guid": "a15097b8-f9c7-4b3b-815e-a8f4a5f2558e", "_uuid": "bfeaa62ae81e2af3fdb595422f08f560506c1f42" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Mean cv score : 0.790415258947421\n" ] } ], "source": [ "### Fit transform the tfidf vectorizer ###\n", "tfidf_vec = TfidfVectorizer(ngram_range=(1,5), analyzer='char')\n", "\n", "full_tfidf = tfidf_vec.fit_transform(train_df['text'].values.tolist() + test_df['text'].values.tolist())\n", "train_tfidf = tfidf_vec.transform(train_df['text'].values.tolist())\n", "test_tfidf = tfidf_vec.transform(test_df['text'].values.tolist())\n", "\n", "kf = KFold(n_splits=5, shuffle=True, random_state=2017)\n", "cv_scores = []\n", "pred_full_test = 0\n", "pred_train = np.zeros([train_df.shape[0], 3])\n", "\n", "for dev_index, val_index in kf.split(train_X):\n", " dev_X, val_X = train_tfidf[dev_index], train_tfidf[val_index]\n", " dev_y, val_y = train_y[dev_index], train_y[val_index]\n", "\n", " pred_val_y, pred_test_y, model = runMNB(dev_X, dev_y, val_X, val_y, test_tfidf)\n", " \n", " pred_full_test = pred_full_test + pred_test_y\n", " pred_train[val_index,:] = pred_val_y\n", " cv_scores.append(metrics.log_loss(val_y, pred_val_y))\n", "print(\"Mean cv score : \", np.mean(cv_scores))\n", "pred_full_test = pred_full_test / 5.\n", "\n", "# add the predictions as new features #\n", "train_df[\"nb_tfidf_char_eap\"] = pred_train[:,0]\n", "train_df[\"nb_tfidf_char_hpl\"] = pred_train[:,1]\n", "train_df[\"nb_tfidf_char_mws\"] = pred_train[:,2]\n", "test_df[\"nb_tfidf_char_eap\"] = pred_full_test[:,0]\n", "test_df[\"nb_tfidf_char_hpl\"] = pred_full_test[:,1]\n", "test_df[\"nb_tfidf_char_mws\"] = pred_full_test[:,2]" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "c807a434-c6d8-4a6c-a592-76c70bd043fe", "_uuid": "af6e2880a16f96304bd6556ed67d9eac5491a3bc" }, "source": [ "**캐릭터 TFIDF의 SVD:**\n", "\n", "또한 캐릭터 tfidf 변수에 svd 기능을 생성하고 모델링에 사용할 수도 있습니다." ] }, { "cell_type": "code", "execution_count": 58, "metadata": { "_cell_guid": "a764f24e-b48d-4460-9f1b-d242599049bd", "_uuid": "bd34ef340648d0f9f000a916d8cbcb258dc22838" }, "outputs": [], "source": [ "n_comp = 20\n", "svd_obj = TruncatedSVD(n_components=n_comp, algorithm='arpack')\n", "svd_obj.fit(full_tfidf)\n", "train_svd = pd.DataFrame(svd_obj.transform(train_tfidf))\n", "test_svd = pd.DataFrame(svd_obj.transform(test_tfidf))\n", " \n", "train_svd.columns = ['svd_char_'+str(i) for i in range(n_comp)]\n", "test_svd.columns = ['svd_char_'+str(i) for i in range(n_comp)]\n", "train_df = pd.concat([train_df, train_svd], axis=1)\n", "test_df = pd.concat([test_df, test_svd], axis=1)\n", "del full_tfidf, train_tfidf, test_tfidf, train_svd, test_svd" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "365abbc9-d630-4075-a2aa-c025d146d6aa", "_uuid": "0577ac7e16cd158944d1ee3c27ca92c588e2d10a" }, "source": [ "**XGBoost 모델:**\n", "\n", "이제 이러한 새 변수를 사용하여 xgboost 모델을 다시 실행하고 결과를 평가할 수 있습니다." ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
num_wordsnum_unique_wordsnum_charsnum_stopwordsnum_punctuationsnum_words_uppernum_words_titlemean_word_lensvd_word_0svd_word_1...svd_char_10svd_char_11svd_char_12svd_char_13svd_char_14svd_char_15svd_char_16svd_char_17svd_char_18svd_char_19
0191911093134.8421050.024516-0.010185...-0.077185-0.0065590.0056160.020747-0.0659070.0530550.040820-0.0608820.010134-0.003119
16249330337134.3387100.022294-0.011968...0.0164150.0117920.0134760.083765-0.0274040.055122-0.067695-0.0030330.0298630.006638
23330189153014.7575760.016906-0.008934...0.0119260.0225610.042856-0.0019810.058037-0.027065-0.0042070.026371-0.0237640.021925
34134223195234.4634150.013408-0.007515...-0.042390-0.0565760.053759-0.0269240.0707880.000571-0.033236-0.0718190.017313-0.027750
411115361113.9090910.012565-0.003185...0.0067900.042263-0.018383-0.0396780.017349-0.002632-0.0147370.061780-0.0553440.054694
..................................................................
8387994271013.7777780.006477-0.004209...0.0405770.0572880.039300-0.0645710.0136090.037641-0.0571160.055621-0.0218320.040692
8388773441114.0000000.011401-0.006803...0.014870-0.0466290.011422-0.0525630.0553450.014132-0.028990-0.036881-0.0176900.022852
83892524150112015.0400000.024211-0.013763...0.0229890.0146060.000173-0.044822-0.0652770.018879-0.0075560.0192180.024902-0.025109
83903834197213234.2105260.025443-0.013794...-0.0434910.046127-0.041112-0.021125-0.0308120.0201610.039208-0.055308-0.040508-0.018515
83913833247185015.5263160.021897-0.010181...-0.004355-0.0212530.0181190.031552-0.038812-0.0147920.0521560.022833-0.002455-0.028240
\n", "

8392 rows × 57 columns

\n", "
" ], "text/plain": [ " num_words num_unique_words num_chars num_stopwords num_punctuations \\\n", "0 19 19 110 9 3 \n", "1 62 49 330 33 7 \n", "2 33 30 189 15 3 \n", "3 41 34 223 19 5 \n", "4 11 11 53 6 1 \n", "... ... ... ... ... ... \n", "8387 9 9 42 7 1 \n", "8388 7 7 34 4 1 \n", "8389 25 24 150 11 2 \n", "8390 38 34 197 21 3 \n", "8391 38 33 247 18 5 \n", "\n", " num_words_upper num_words_title mean_word_len svd_word_0 svd_word_1 \\\n", "0 1 3 4.842105 0.024516 -0.010185 \n", "1 1 3 4.338710 0.022294 -0.011968 \n", "2 0 1 4.757576 0.016906 -0.008934 \n", "3 2 3 4.463415 0.013408 -0.007515 \n", "4 1 1 3.909091 0.012565 -0.003185 \n", "... ... ... ... ... ... \n", "8387 0 1 3.777778 0.006477 -0.004209 \n", "8388 1 1 4.000000 0.011401 -0.006803 \n", "8389 0 1 5.040000 0.024211 -0.013763 \n", "8390 2 3 4.210526 0.025443 -0.013794 \n", "8391 0 1 5.526316 0.021897 -0.010181 \n", "\n", " ... svd_char_10 svd_char_11 svd_char_12 svd_char_13 svd_char_14 \\\n", "0 ... -0.077185 -0.006559 0.005616 0.020747 -0.065907 \n", "1 ... 0.016415 0.011792 0.013476 0.083765 -0.027404 \n", "2 ... 0.011926 0.022561 0.042856 -0.001981 0.058037 \n", "3 ... -0.042390 -0.056576 0.053759 -0.026924 0.070788 \n", "4 ... 0.006790 0.042263 -0.018383 -0.039678 0.017349 \n", "... ... ... ... ... ... ... \n", "8387 ... 0.040577 0.057288 0.039300 -0.064571 0.013609 \n", "8388 ... 0.014870 -0.046629 0.011422 -0.052563 0.055345 \n", "8389 ... 0.022989 0.014606 0.000173 -0.044822 -0.065277 \n", "8390 ... -0.043491 0.046127 -0.041112 -0.021125 -0.030812 \n", "8391 ... -0.004355 -0.021253 0.018119 0.031552 -0.038812 \n", "\n", " svd_char_15 svd_char_16 svd_char_17 svd_char_18 svd_char_19 \n", "0 0.053055 0.040820 -0.060882 0.010134 -0.003119 \n", "1 0.055122 -0.067695 -0.003033 0.029863 0.006638 \n", "2 -0.027065 -0.004207 0.026371 -0.023764 0.021925 \n", "3 0.000571 -0.033236 -0.071819 0.017313 -0.027750 \n", "4 -0.002632 -0.014737 0.061780 -0.055344 0.054694 \n", "... ... ... ... ... ... \n", "8387 0.037641 -0.057116 0.055621 -0.021832 0.040692 \n", "8388 0.014132 -0.028990 -0.036881 -0.017690 0.022852 \n", "8389 0.018879 -0.007556 0.019218 0.024902 -0.025109 \n", "8390 0.020161 0.039208 -0.055308 -0.040508 -0.018515 \n", "8391 -0.014792 0.052156 0.022833 -0.002455 -0.028240 \n", "\n", "[8392 rows x 57 columns]" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test_X" ] }, { "cell_type": "code", "execution_count": 59, "metadata": { "_cell_guid": "7091aede-e589-45d9-9a85-1560af442725", "_uuid": "633eb97dc8b15a7959b962c20222697590071af7" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[18:55:51] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.5.1/src/learner.cc:576: \n", "Parameters: { \"silent\" } might not be used.\n", "\n", " This could be a false alarm, with some parameters getting used by language bindings but\n", " then being mistakenly passed down to XGBoost core, or some parameter actually being used\n", " but getting flagged wrongly here. Please open an issue if you find any such cases.\n", "\n", "\n", "[0]\ttrain-mlogloss:1.00354\ttest-mlogloss:1.00354\n", "[20]\ttrain-mlogloss:0.40979\ttest-mlogloss:0.41471\n", "[40]\ttrain-mlogloss:0.33858\ttest-mlogloss:0.35061\n", "[60]\ttrain-mlogloss:0.31308\ttest-mlogloss:0.33272\n", "[80]\ttrain-mlogloss:0.29566\ttest-mlogloss:0.32390\n", "[100]\ttrain-mlogloss:0.28243\ttest-mlogloss:0.31827\n", "[120]\ttrain-mlogloss:0.27152\ttest-mlogloss:0.31475\n", "[140]\ttrain-mlogloss:0.26187\ttest-mlogloss:0.31187\n", "[160]\ttrain-mlogloss:0.25342\ttest-mlogloss:0.30982\n", "[180]\ttrain-mlogloss:0.24557\ttest-mlogloss:0.30863\n", "[200]\ttrain-mlogloss:0.23779\ttest-mlogloss:0.30748\n", "[220]\ttrain-mlogloss:0.23124\ttest-mlogloss:0.30644\n", "[240]\ttrain-mlogloss:0.22498\ttest-mlogloss:0.30562\n", "[260]\ttrain-mlogloss:0.21865\ttest-mlogloss:0.30576\n", "[280]\ttrain-mlogloss:0.21249\ttest-mlogloss:0.30538\n", "[300]\ttrain-mlogloss:0.20648\ttest-mlogloss:0.30545\n", "[320]\ttrain-mlogloss:0.20132\ttest-mlogloss:0.30555\n", "[340]\ttrain-mlogloss:0.19604\ttest-mlogloss:0.30558\n", "[344]\ttrain-mlogloss:0.19489\ttest-mlogloss:0.30562\n", "cv scores : [0.30519857905295633]\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "C:\\Users\\HOME\\anaconda3\\lib\\site-packages\\xgboost\\core.py:105: UserWarning: ntree_limit is deprecated, use `iteration_range` or model slicing instead.\n", " warnings.warn(\n" ] } ], "source": [ "cols_to_drop = ['id', 'text']\n", "train_X = train_df.drop(cols_to_drop+['author'], axis=1)\n", "test_X = test_df.drop(cols_to_drop, axis=1)\n", "\n", "kf = KFold(n_splits=5, shuffle=True, random_state=2017)\n", "cv_scores = []\n", "pred_full_test = 0\n", "pred_train = np.zeros([train_df.shape[0], 3])\n", "\n", "for dev_index, val_index in kf.split(train_X):\n", " dev_X, val_X = train_X.loc[dev_index], train_X.loc[val_index]\n", " dev_y, val_y = train_y[dev_index], train_y[val_index]\n", "\n", " pred_val_y, pred_test_y, model = runXGB(dev_X, dev_y, val_X, val_y, test_X, seed_val=0, colsample=0.7)\n", " \n", " pred_full_test = pred_full_test + pred_test_y\n", " pred_train[val_index,:] = pred_val_y\n", " cv_scores.append(metrics.log_loss(val_y, pred_val_y))\n", " break\n", "print(\"cv scores : \", cv_scores)\n", "\n", "out_df = pd.DataFrame(pred_full_test)\n", "out_df.columns = ['EAP', 'HPL', 'MWS']\n", "out_df.insert(0, 'id', test_id)\n", "out_df.to_csv(\"sub_fe.csv\", index=False)" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "7d93a2d6-435a-46bb-a522-c4d6b03d84f0", "_uuid": "4512493a6527cc244b9c47eb1ed9ff0dbce5eb5a" }, "source": [ "**이것은 0.3055의 val 점수와 0.32xx의 LB 점수를 가지고 있습니다** 모든 폴드에서 실행하면 더 나은 점수를 얻을 수 있습니다. 이제 중요한 변수를 다시 확인해보자." ] }, { "cell_type": "code", "execution_count": 60, "metadata": { "_cell_guid": "ed36de2b-e0bb-42e6-bed8-bbbdeb513ba5", "_uuid": "ddefa6111b588449df9f2abbc541c0ca051174ab" }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "### Plot the important variables ###\n", "fig, ax = plt.subplots(figsize=(12,12))\n", "xgb.plot_importance(model, max_num_features=50, height=0.8, ax=ax)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "c4b258fe-ac27-4d5b-908c-eda43670fbf7", "_uuid": "c818e95a21559bdb4def3aed522b6616eb1a9de8" }, "source": [ "Naive Bayes 변수는 예상대로 최고의 변수입니다. 이제 오분류 오류를 확인하기 위해 정오분류표를 구해 보겠습니다." ] }, { "cell_type": "code", "execution_count": 61, "metadata": { "_cell_guid": "2f3aaa7b-396c-426b-89b3-1e66385946fd", "_uuid": "c195596cd7dff2de98bddec1453e30d8d5511169" }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "cnf_matrix = confusion_matrix(val_y, np.argmax(pred_val_y,axis=1))\n", "np.set_printoptions(precision=2)\n", "\n", "# Plot non-normalized confusion matrix\n", "plt.figure(figsize=(8,8))\n", "plot_confusion_matrix(cnf_matrix, classes=['EAP', 'HPL', 'MWS'],\n", " title='Confusion matrix of XGB, without normalization')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "855a4e2b-d377-4a91-bac9-4202119c8666", "_uuid": "5e40512bdaf05ffbd7a1914735c46784d7c33329" }, "source": [ "EAP와 MWS는 다른 것보다 더 자주 잘못 분류되는 것 같습니다. 이 쌍에 대한 예측을 개선하는 기능을 잠재적으로 생성할 수 있습니다.\n", "\n", "**이 FE 노트북의 다음 단계:**\n", "* 단어 임베딩 기반 기능 사용\n", "* 기타 메타 기능이 있는 경우\n", "* 문장에 대한 감상" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "7a43b172-b6b6-4fd2-89d6-7300672cc462", "_uuid": "117205aae679d7b6481df1c446cd42f41cf924de" }, "source": [ "**추가 개선을 위한 아이디어:**\n", "* tfidf 및 카운트 벡터라이저에 대한 매개변수 조정\n", "* 나이브 베이즈 및 XGB 모델에 대한 매개변수 조정\n", "* 다른 모델과의 앙상블/스태킹" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" } }, "nbformat": 4, "nbformat_minor": 1 }