# 목적
주제단어 추출로 어떤글이 어떤 작가것인지 예측한다


예) id, 작가1, 작가2, 작가3


"id02310",0.403493538995863,0.287808366106543,0.308698094897594

# Approaching (Almost) Any NLP Problem on Kaggle


### This covers:
- TF-IDF(Term Frequency-Inverse Document Frequency) 단어빈도-역문서빈도
- Count Features
- Logistic Regression
- naive bayes
- svm
- xgboost


- grid search
- word vectors
- LSTM
- GRU
- Ensembling

*참고*: 이 노트북은 이 데이터 세트의 리더보드에서 매우 높은 점수를 얻기 위한 것이 아닙니다. 하지만 잘 따라하시면 약간의 튜닝으로 매우 높은 점수를 얻을 수 있습니다. ;)

따라서 시간을 낭비하지 않고 내가 사용할 몇 가지 중요한 파이썬 모듈을 가져오는 것부터 시작하겠습니다.

In [1]:
import pandas as pd
import numpy as np
from tqdm import tqdm

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression # 특정 카테고리에 속할지를 0과 1사이의 연속적인 확률로 예측하는 회귀 알고리즘 
from sklearn.naive_bayes import MultinomialNB
# 나이브 베이즈 분류기가 효과적인 이유는 각 특성을 개별로 취급해 파라미터를 학습하고 각 특성에서 클래스별 통계를 단순하게 취합하기 때문입니다.
# GaussianNB : 연속적인 어떤 데이터, BernoulliNB : 이진 데이터,
# MultinomialNB : 카운트 데이터(특성이 어떤 것을 헤아린 정수 카운트로, 예를 들면 문장에 나타난 단어의 횟수입니다) 
# BernoulliNB, MultinomialNB는 대부분 텍스트 데이터를 분류할 때 사용합니다
# MultinomialNB는 클래스별로 특성의 평균을 계산하고 GaussianNB는 클래스별로 각 특성의 표준편차와 평균을 저장합니다.
# GaussianNB : 고차원인 데이터셋, 다른 두 나이브 베이즈 모델 : 텍스트 같은 희소행렬를 카운트하는 데 사용합니다

import xgboost as xgb
from sklearn.svm import SVC

from keras.models import Sequential
from keras.layers.recurrent import LSTM, GRU # recurrent : 재현 / Long Short-Term Memory/ Gated Recurrent Unit
from keras.layers import GlobalMaxPooling1D, Conv1D, MaxPooling1D, Flatten, Bidirectional, SpatialDropout1D # Bidirectional : 양방향
from keras.layers.core import Dense, Activation, Dropout # dad
from tensorflow.keras.layers import BatchNormalization
from keras.layers.embeddings import Embedding # 사람이 쓰는 자연어를 기계가 이해할 수 있는 숫자의 나열인 벡터로 바꾼 결과
from keras.utils import np_utils
from keras.preprocessing import sequence, text
from keras.callbacks import EarlyStopping


from sklearn import preprocessing, decomposition, model_selection, metrics, pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import TruncatedSVD # SVD(Singular Value Decomposition): 특이값분해

from nltk import word_tokenize
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

Let's load the datasets

In [2]:
train = pd.read_csv('./input/train.csv')
test = pd.read_csv('./input/test.csv')
sample = pd.read_csv('./input/sample_submission.csv')

A quick look at the data

In [3]:
train.head()

Unnamed: 0,id,text,author
0,id26305,"This process, however, afforded me no means of...",EAP
1,id17569,It never once occurred to me that the fumbling...,HPL
2,id11008,"In his left hand was a gold snuff box, from wh...",EAP
3,id27763,How lovely is spring As we looked from Windsor...,MWS
4,id12958,"Finding nothing else, not even gold, the Super...",HPL


In [4]:
test.head()

Unnamed: 0,id,text
0,id02310,"Still, as I urged our leaving Ireland with suc..."
1,id24541,"If a fire wanted fanning, it could readily be ..."
2,id00134,And when they had broken down the frail door t...
3,id27757,While I was thinking how I should possibly man...
4,id04081,I am not sure to what limit his knowledge may ...


In [5]:
sample.head()

Unnamed: 0,id,EAP,HPL,MWS
0,id02310,0.403494,0.287808,0.308698
1,id24541,0.403494,0.287808,0.308698
2,id00134,0.403494,0.287808,0.308698
3,id27757,0.403494,0.287808,0.308698
4,id04081,0.403494,0.287808,0.308698


문제는 우리가 저자, 즉 텍스트가 주어진 EAP, HPL 및 MWS를 예측하도록 요구합니다. 간단히 말해서 3개의 다른 클래스로 된 텍스트 분류입니다.

이 특정 문제에 대해 Kaggle은 다중 클래스 로그 손실을 평가 메트릭으로 지정했습니다. 이것은 다음과 같은 방식으로 구현됩니다(https://github.com/dnouri/nolearn/blob/master/nolearn/lasagne/util.py에서 가져옴).

In [6]:
def multiclass_logloss(actual, predicted, eps=1e-15):
    """Multi class version of Logarithmic Loss metric.
    :param actual: Array containing the actual target classes
    :param predicted: Matrix with class predictions, one probability per class
    """
    # 'actual'가 아직 아닌 경우 이진 배열로 변환합니다.    
    if len(actual.shape) == 1:
        actual2 = np.zeros((actual.shape[0], predicted.shape[1]))
        for i, val in enumerate(actual):
            actual2[i, val] = 1
        actual = actual2

    clip = np.clip(predicted, eps, 1 - eps) # np.clip(array, min, max), min값보다 작은값 = min값, max값보다 큰값 = max값
    rows = actual.shape[0]
    vsota = np.sum(actual * np.log(clip))
    return -1.0 / rows * vsota

scikit-learn의 LabelEncoder를 사용하여 텍스트 레이블을 정수 0, 1 2로 변환합니다.

In [7]:
lbl_enc = preprocessing.LabelEncoder()
y = lbl_enc.fit_transform(train.author.values)

더 진행하기 전에 데이터를 훈련 세트와 검증 세트로 나누는 것이 중요합니다. scikit-learn의 'model_selection' 모듈에서 'train_test_split'을 사용하여 수행할 수 있습니다.

In [8]:
xtrain, xvalid, ytrain, yvalid = train_test_split(train.text.values, y, 
                                                  stratify=y, 
                                                  random_state=42, 
                                                  test_size=0.1, shuffle=True)

In [9]:
print (xtrain.shape)
print (xvalid.shape)

(17621,)
(1958,)


## 기본 모델 구축

첫 번째 모델을 구축해 보겠습니다.

우리의 첫 번째 모델은 간단한 TF-IDF(Term Frequency - Inverse Document Frequency)와 간단한 Logistic Regression입니다.

In [10]:
# 희소행렬화
tfv = TfidfVectorizer(min_df=3,  max_features=None, 
            strip_accents='unicode', analyzer='word', token_pattern=r'\w{1,}',
            ngram_range=(1, 3), use_idf=1, smooth_idf=1, sublinear_tf=1,
            stop_words = 'english')

# 훈련 세트와 테스트 세트 모두에 TF-IDF 맞추기(반 지도 학습)
tfv.fit(list(xtrain) + list(xvalid))
xtrain_tfv =  tfv.transform(xtrain) 
xvalid_tfv = tfv.transform(xvalid)

In [11]:
xvalid_tfv

<1958x15102 sparse matrix of type '<class 'numpy.float64'>'
	with 22260 stored elements in Compressed Sparse Row format>

In [12]:
# Fitting a simple Logistic Regression on TFIDF
clf = LogisticRegression(C=1.0)
clf.fit(xtrain_tfv, ytrain)
predictions = clf.predict_proba(xvalid_tfv)
predictions

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


array([[0.69431515, 0.07172108, 0.23396376],
       [0.79796083, 0.08133735, 0.12070182],
       [0.61024583, 0.16350202, 0.22625215],
       ...,
       [0.30012425, 0.25166439, 0.44821136],
       [0.20335046, 0.16891226, 0.62773728],
       [0.05947419, 0.90771649, 0.03280933]])

In [13]:
print ("logloss: %0.3f " % multiclass_logloss(yvalid, predictions))

logloss: 0.572 


다중 클래스 로그 손실이 0.626인 첫 번째 모델이 있습니다.

그러나 우리는 욕심이 많고 더 나은 점수를 원합니다. 다른 데이터를 사용하여 동일한 모델을 살펴보겠습니다.

TF-IDF를 사용하는 대신 단어 수를 변수로 사용할 수도 있습니다. 이것은 scikit-learn의 CountVectorizer를 사용하여 쉽게 수행할 수 있습니다.

In [14]:
# 희소행렬화
ctv = CountVectorizer(analyzer='word',token_pattern=r'\w{1,}',
            ngram_range=(1, 3), stop_words = 'english')

# Fitting Count Vectorizer to both training and test sets (semi-supervised learning)
ctv.fit(list(xtrain) + list(xvalid))
xtrain_ctv =  ctv.transform(xtrain) 
xvalid_ctv = ctv.transform(xvalid)

In [15]:
# Fitting a simple Logistic Regression on Counts
clf = LogisticRegression(C=1.0)
clf.fit(xtrain_ctv, ytrain)
predictions = clf.predict_proba(xvalid_ctv)
predictions

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


array([[7.70816633e-01, 1.65952992e-02, 2.12588067e-01],
       [9.02504595e-01, 3.32687937e-02, 6.42266108e-02],
       [7.90083818e-01, 1.06503473e-01, 1.03412709e-01],
       ...,
       [3.69985111e-01, 1.83415733e-01, 4.46599156e-01],
       [1.35182535e-01, 6.35558441e-02, 8.01261621e-01],
       [1.57128155e-04, 9.99824857e-01, 1.80143913e-05]])

In [16]:
print ("logloss: %0.3f " % multiclass_logloss(yvalid, predictions))

logloss: 0.527 


In [21]:
yvalid # (1958,)

array([0, 0, 0, ..., 2, 2, 1])

아아아아아아아아아아아아! 우리는 첫 번째 모델을 0.1 개선했습니다!!!

다음으로 고대에 꽤 유명했던 아주 간단한 모델인 Naive Bayes를 사용해 보겠습니다.

이 두 데이터 세트에 나이브베이즈를 사용할 때 어떤 일이 발생하는지 봅시다.

In [15]:
# Fitting a simple Naive Bayes on TFIDF
clf = MultinomialNB()
clf.fit(xtrain_tfv, ytrain)
predictions = clf.predict_proba(xvalid_tfv)

print ("logloss: %0.3f " % multiclass_logloss(yvalid, predictions))

logloss: 0.578 


좋은 성능! 그러나 개수에 대한 로지스틱 회귀가 여전히 더 좋습니다! 대신 카운트 데이터에 이 모델을 사용하면 어떻게 됩니까?

In [16]:
# Fitting a simple Naive Bayes on Counts
clf = MultinomialNB()
clf.fit(xtrain_ctv, ytrain)
predictions = clf.predict_proba(xvalid_ctv)

print ("logloss: %0.3f " % multiclass_logloss(yvalid, predictions))

logloss: 0.485 


와! 오래된 물건이 여전히 잘 작동하는 것 같습니다!!!! 목록에 있는 또 하나의 고대 알고리즘은 SVM입니다. 어떤 사람들은 SVM을 "사랑"합니다. 따라서 이 데이터 세트에서 SVM을 시도해야 합니다.

SVM은 시간이 많이 걸리므로 SVM을 적용하기 전에 Singular Value Decomposition을 사용하여 TF-IDF의 기능 수를 줄입니다.

또한 SVM을 적용하기 전에 데이터를 *반드시* 표준화해야 합니다.

In [22]:
xtrain_tfv

<17621x15102 sparse matrix of type '<class 'numpy.float64'>'
	with 198521 stored elements in Compressed Sparse Row format>

In [23]:
xtrain_ctv

<17621x400266 sparse matrix of type '<class 'numpy.int64'>'
	with 556265 stored elements in Compressed Sparse Row format>

In [17]:
# SVD를 적용하고 120개의 컴포넌트를 선택했습니다. 120-200 구성 요소는 SVM 모델에 충분합니다.
svd = decomposition.TruncatedSVD(n_components=120)
svd.fit(xtrain_tfv)
xtrain_svd = svd.transform(xtrain_tfv) # (17621, 120)
xvalid_svd = svd.transform(xvalid_tfv) # (1958, 120)

# SVD에서 얻은 데이터의 크기를 조정합니다. 크기 조정 없이 재사용할 수 있도록 변수 이름을 바꿉니다.
scl = preprocessing.StandardScaler()
scl.fit(xtrain_svd)
xtrain_svd_scl = scl.transform(xtrain_svd)
xvalid_svd_scl = scl.transform(xvalid_svd)

In [21]:
xvalid_svd.shape

(1958, 120)

이제 SVM을 적용할 차례입니다. 다음 셀을 실행한 후 자유롭게 산책을 가거나 여자 친구/남자 친구와 이야기를 나누십시오. 

In [18]:
# Fitting a simple SVM
clf = SVC(C=1.0, probability=True) # since we need probabilities
clf.fit(xtrain_svd_scl, ytrain)
predictions = clf.predict_proba(xvalid_svd_scl)

print ("logloss: %0.3f " % multiclass_logloss(yvalid, predictions))

logloss: 0.730 


앗! 일어날 시간! 이 데이터에서 SVM이 잘 수행되지 않는 것 같습니다...!

계속 진행하기 전에 Kaggle에서 가장 인기 있는 알고리즘인 xgboost를 적용해 보겠습니다!

In [25]:
# xtrain_ctv.tocsc()
clf = xgb.XGBClassifier(max_depth=7, n_estimators=200, colsample_bytree=0.8, 
                        subsample=0.8, nthread=10, learning_rate=0.1)
clf.fit(xtrain_ctv.tocsc(), ytrain) # 희소행렬화
predictions = clf.predict_proba(xvalid_ctv.tocsc())

print ("logloss: %0.3f " % multiclass_logloss(yvalid, predictions))

logloss: 0.772 


In [26]:
# xtrain_svd
clf = xgb.XGBClassifier(max_depth=7, n_estimators=200, colsample_bytree=0.8, 
                        subsample=0.8, nthread=10, learning_rate=0.1)
clf.fit(xtrain_svd, ytrain)
predictions = clf.predict_proba(xvalid_svd)

print ("logloss: %0.3f " % multiclass_logloss(yvalid, predictions))

logloss: 0.770 


In [27]:
# xtrain_svd
clf = xgb.XGBClassifier(nthread=10)
clf.fit(xtrain_svd, ytrain)
predictions = clf.predict_proba(xvalid_svd)

print ("logloss: %0.3f " % multiclass_logloss(yvalid, predictions))

logloss: 0.772 


XGBoost는 운이 없는 것 같습니다! 그러나 그것은 옳지 않습니다. 아직 초매개변수 최적화를 수행하지 않았습니다. 그리고 저는 게으르므로 방법을 알려드릴 테니 직접 하셔도 됩니다! ;). 이것은 다음 섹션에서 논의될 것입니다:


## 그리드 검색

하이퍼파라미터 최적화를 위한 기술입니다. 그다지 효과적이지는 않지만 사용하려는 그리드를 알고 있으면 좋은 결과를 얻을 수 있습니다. 이 게시물에서 일반적으로 사용해야 하는 매개변수를 지정합니다. http://blog.kaggle.com/2016/07/21/approaching-almost-any-machine-learning-problem-abhishek-thakur/ 이것들은 내가 일반적으로 사용하는 매개변수입니다. 효과적일 수도 있고 그렇지 않을 수도 있는 하이퍼파라미터 최적화의 다른 많은 방법이 있습니다.

이 섹션에서는 로지스틱 회귀를 사용한 그리드 검색에 대해 설명합니다.

그리드 검색을 시작하기 전에 스코어링 기능을 생성해야 합니다. 이것은 scikit-learn의 `make_scorer` 기능을 사용하여 수행됩니다.

In [28]:
mll_scorer = metrics.make_scorer(multiclass_logloss, greater_is_better=False, needs_proba=True)

다음으로 파이프라인이 필요합니다. 여기에서 데모를 위해 SVD, 스케일링, 로지스틱 회귀로 구성된 파이프라인을 사용하겠습니다. 파이프라인의 모듈이 하나만 있는 것보다 더 많은 것을 이해하는 것이 좋습니다. ;)

In [29]:
# Initialize SVD
svd = TruncatedSVD()
    
# Initialize the standard scaler 
scl = preprocessing.StandardScaler()

# We will use logistic regression here..
lr_model = LogisticRegression()

# Create the pipeline 
clf = pipeline.Pipeline([('svd', svd),
                         ('scl', scl),
                         ('lr', lr_model)])

Next we need a grid of parameters:

In [30]:
param_grid = {'svd__n_components' : [120, 180],
              'lr__C': [0.1, 1.0, 10], 
              'lr__penalty': ['l1', 'l2']}

따라서 SVD의 경우 120개 및 180개의 구성 요소를 평가하고 로지스틱 회귀 분석의 경우 l1 및 l2 패널티로 C의 세 가지 다른 값을 평가합니다. 이제 이 매개변수에서 그리드 검색을 시작할 수 있습니다.

In [32]:
# Initialize Grid Search Model
model = GridSearchCV(estimator=clf, param_grid=param_grid, scoring=mll_scorer,
                                 verbose=10, n_jobs=-1, refit=True, cv=2)

# Fit Grid Search Model
model.fit(xtrain_tfv, ytrain)  # we can use the full data here but im only using xtrain
print("Best score: %0.3f" % model.best_score_)
print("Best parameters set:")
best_parameters = model.best_estimator_.get_params()
for param_name in sorted(param_grid.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))

Fitting 2 folds for each of 12 candidates, totalling 24 fits


12 fits failed out of a total of 24.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
12 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\HOME\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\HOME\anaconda3\lib\site-packages\sklearn\pipeline.py", line 394, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "C:\Users\HOME\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py", line 1461, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "C:\Users\HOME\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py", line 447, in

Best score: -0.737
Best parameters set:
	lr__C: 0.1
	lr__penalty: 'l2'
	svd__n_components: 180


점수는 SVM에 대한 점수와 유사합니다. 이 기술은 xgboost 또는 다항 순진한 베이를 아래와 같이 미세 조정하는 데 사용할 수 있습니다. 여기서 tfidf 데이터를 사용합니다.

In [34]:
nb_model = MultinomialNB()

# Create the pipeline 
clf = pipeline.Pipeline([('nb', nb_model)])

# parameter grid
param_grid = {'nb__alpha': [0.001, 0.01, 0.1, 1, 10, 100]}

# Initialize Grid Search Model
model = GridSearchCV(estimator=clf, param_grid=param_grid, scoring=mll_scorer,
                                 verbose=10, n_jobs=-1, refit=True, cv=2)

# Fit Grid Search Model
model.fit(xtrain_tfv, ytrain)  # we can use the full data here but im only using xtrain. 
print("Best score: %0.3f" % model.best_score_)
print("Best parameters set:")
best_parameters = model.best_estimator_.get_params()
for param_name in sorted(param_grid.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))

Fitting 2 folds for each of 6 candidates, totalling 12 fits
Best score: -0.492
Best parameters set:
	nb__alpha: 0.1


이것은 원래의 순진한 베이즈 점수보다 8% 향상된 것입니다!

NLP 문제에서는 단어 벡터를 보는 것이 일반적입니다. 단어 벡터는 데이터에 대한 많은 통찰력을 제공합니다. 그 내용에 대해 알아보겠습니다.

## 단어 벡터

너무 자세히 설명하지 않고 문장 벡터를 만드는 방법과 이를 사용하여 그 위에 기계 학습 모델을 만드는 방법을 설명하겠습니다. 저는 GloV 벡터, word2vec 및 fasttext의 팬입니다. 이 포스트에서는 GloV 벡터를 사용할 것입니다. 여기 `http://www-nlp.stanford.edu/data/glove.840B.300d.zip`에서 GloVe 벡터를 다운로드할 수 있습니다.

In [35]:
import shutil

shutil.unpack_archive('./input/glove.840B.300d.zip', './input/glove.840B.300d', 'zip')

In [81]:
f = open('./input/glove.840B.300d/glove.840B.300d.txt', 'rt', encoding='UTF8')
for line in tqdm(f):
    print(line)

360it [00:00, 3597.51it/s]

, -0.082752 0.67204 -0.14987 -0.064983 0.056491 0.40228 0.0027747 -0.3311 -0.30691 2.0817 0.031819 0.013643 0.30265 0.0071297 -0.5819 -0.2774 -0.062254 1.1451 -0.24232 0.1235 -0.12243 0.33152 -0.006162 -0.30541 -0.13057 -0.054601 0.037083 -0.070552 0.5893 -0.30385 0.2898 -0.14653 -0.27052 0.37161 0.32031 -0.29125 0.0052483 -0.13212 -0.052736 0.087349 -0.26668 -0.16897 0.015162 -0.0083746 -0.14871 0.23413 -0.20719 -0.091386 0.40075 -0.17223 0.18145 0.37586 -0.28682 0.37289 -0.16185 0.18008 0.3032 -0.13216 0.18352 0.095759 0.094916 0.008289 0.11761 0.34046 0.03677 -0.29077 0.058303 -0.027814 0.082941 0.1862 -0.031494 0.27985 -0.074412 -0.13762 -0.21866 0.18138 0.040855 -0.113 0.24107 0.3657 -0.27525 -0.05684 0.34872 0.011884 0.14517 -0.71395 0.48497 0.14807 0.62287 0.20599 0.58379 -0.13438 0.40207 0.18311 0.28021 -0.42349 -0.25626 0.17715 -0.54095 0.16596 -0.036058 0.08499 -0.64989 0.075549 -0.28831 0.40626 -0.2802 0.094062 0.32406 0.28437 -0.26341 0.11553 0.071918 -0.47215 -0.18366 -0.3

1507it [00:00, 5251.78it/s]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

2555it [00:00, 5014.25it/s]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

3757it [00:00, 5036.08it/s]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit

derivatives -0.67485 -0.45477 -0.41128 0.21927 -0.34152 -0.037233 0.082827 -0.10035 -0.51989 0.69126 0.031576 -0.11356 0.63312 0.31546 0.54507 0.17766 -0.099448 1.3676 0.17544 -0.083142 -0.23973 0.27596 -0.17957 0.53262 -0.13027 0.12007 -0.14333 0.27964 -0.031996 -0.75652 0.53455 0.47767 0.88412 -0.091789 0.56174 -0.196 0.25055 0.2192 -0.29076 -0.13484 -0.40107 -1.2289 0.30201 0.3236 -0.49053 0.13103 0.028311 -0.74515 0.45587 0.094762 -0.10232 -0.029996 -0.27436 0.79538 0.54643 -0.23283 -0.47662 -0.53665 -0.56801 -0.23525 0.26004 -0.0053897 -0.3255 0.56613 -0.046568 -0.46943 -0.13183 0.6964 -0.12531 -0.87124 0.24906 0.55785 -0.15488 0.70834 -0.38943 -0.34799 0.010022 -0.18395 0.31025 0.36748 -0.47754 0.27898 -0.2846 0.33755 -0.44413 -0.138 0.44121 0.077065 -0.9607 0.17585 0.30379 0.352 -0.0065939 -0.55953 0.4945 0.030165 -0.025755 0.38363 -0.064433 -0.39758 -0.34919 0.14007 -0.2129 -0.14907 0.042408 -2.1783 -0.17983 -0.12922 0.0058504 0.039465 0.08451 0.22897 0.099049 -0.29156 -0.61823

KeyboardInterrupt: 

In [50]:
coefs = np.asarray(values[1:], dtype='float32')
coefs

array([-7.9690e-02, -2.2905e-01,  8.0366e-01, -7.8865e-01, -4.0567e-01,
       -1.5716e-01, -4.2302e-01,  6.4081e-01, -1.3215e-01, -1.4109e+00,
        7.3118e-01, -3.7391e-01, -3.6422e-01,  2.4199e-02, -2.4359e-01,
        1.0140e+00,  6.5176e-04, -8.9537e-01,  8.0540e-01, -7.3101e-02,
        2.0257e-01,  5.9553e-01, -3.4971e-03, -2.8126e-01,  5.8631e-01,
       -1.7115e-01,  1.2428e-01,  5.3392e-01,  4.8289e-01,  3.6989e-01,
       -9.1151e-02, -2.3874e-01,  3.8864e-01, -1.6403e-01, -8.5745e-01,
        1.9000e-01,  4.1450e-01,  3.5958e-01, -1.8726e-02,  5.5213e-01,
       -9.1331e-03, -4.8204e-01, -6.4685e-01,  6.1736e-01, -2.7128e-01,
        1.3459e-01,  9.4729e-01, -4.2939e-01, -3.2462e-01, -8.8466e-02,
        3.7337e-01,  2.9062e-01, -7.4411e-03,  1.9840e-01, -4.2686e-01,
       -7.1294e-02, -4.3443e-02, -3.3026e-03, -1.0519e-01,  2.0885e-01,
       -3.0217e-01,  2.7366e-01, -3.5602e-01, -8.9143e-01,  2.8561e-01,
       -1.1656e-01,  2.2460e-01, -2.1561e-02, -1.6219e-02, -9.62

In [54]:
# load the GloVe vectors in a dictionary:

embeddings_index = {}
f = open('./input/glove.840B.300d/glove.840B.300d.txt', 'rt', encoding='UTF8')
for line in tqdm(f):
    values = line.split(" ")
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

2196017it [03:29, 10506.54it/s]

Found 2196016 word vectors.





In [58]:
# this function creates a normalized vector for the whole sentence
def sent2vec(s):
    words = str(s).lower()
    words = word_tokenize(words)
    words = [w for w in words if not w in stop_words]
    words = [w for w in words if w.isalpha()] # 알파벳이면 True
    
    M = []
    for w in words:
        try:
            M.append(embeddings_index[w])
        except:
            continue
            
    M = np.array(M)
    v = M.sum(axis=0)
    if type(v) != np.ndarray:
        return np.zeros(300)
    return v / np.sqrt((v ** 2).sum())

In [55]:
xtrain

array(['Her hair was the brightest living gold, and despite the poverty of her clothing, seemed to set a crown of distinction on her head.',
       '"No," he said, "oh, no a member of my family my niece, and a most accomplished woman."',
       'The magistrate appeared at first perfectly incredulous, but as I continued he became more attentive and interested; I saw him sometimes shudder with horror; at others a lively surprise, unmingled with disbelief, was painted on his countenance.',
       ...,
       'The medical testimony spoke confidently of the virtuous character of the deceased.',
       'When we arrived, after a little rest, he led me over the house and pointed out to me the rooms which my mother had inhabited.',
       'Some were destroyed; the major part escaped by quick and well ordered movements; and danger made them careful.'],
      dtype=object)

In [86]:
# this function creates a normalized vector for the whole sentence
def aaa(s):
    words = str(s).lower()
    words = word_tokenize(words)
    words = [w for w in words if not w in stop_words]
    words = [w for w in words if w.isalpha()] # 알파벳이면 True
    
    M = []
    for w in words:
        try:
            M.append(embeddings_index[w])
        except:
            continue
    return np.array(M)

In [89]:
aaa(xtrain[0]).shape

(12, 300)

In [90]:
aaa(xtrain[0]).sum(axis=0).shape

(300,)

In [59]:
# 훈련 및 검증 세트에 대해 위의 함수를 사용하여 문장 벡터를 생성합니다.
xtrain_glove = [sent2vec(x) for x in tqdm(xtrain)]
xvalid_glove = [sent2vec(x) for x in tqdm(xvalid)]


  0%|                                                                                        | 0/17621 [00:00<?, ?it/s][A
  0%|                                                                              | 1/17621 [00:00<1:04:53,  4.53it/s][A
  0%|                                                                               | 17/17621 [00:00<04:36, 63.57it/s][A
  0%|▏                                                                             | 36/17621 [00:00<02:43, 107.76it/s][A
  0%|▎                                                                             | 65/17621 [00:00<01:44, 167.88it/s][A
  1%|▍                                                                            | 107/17621 [00:00<01:10, 247.73it/s][A
  1%|▋                                                                            | 152/17621 [00:00<00:56, 309.36it/s][A
  1%|▉                                                                            | 206/17621 [00:00<00:46, 378.21it/s][A
  2%|█▏        

In [60]:
xtrain_glove = np.array(xtrain_glove)
xvalid_glove = np.array(xvalid_glove)

In [62]:
xvalid_glove.shape

(1958, 300)

Let's see the performance of xgboost on glove features:

In [66]:
# Fitting a simple xgboost on glove features
clf = xgb.XGBClassifier(nthread=10, silent=False)
clf.fit(xtrain_glove, ytrain)
predictions = clf.predict_proba(xvalid_glove)

print ("logloss: %0.3f " % multiclass_logloss(yvalid, predictions))



Parameters: { "silent" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


logloss: 0.708 


In [67]:
# Fitting a simple xgboost on glove features
clf = xgb.XGBClassifier(max_depth=7, n_estimators=200, colsample_bytree=0.8, 
                        subsample=0.8, nthread=10, learning_rate=0.1, silent=False)
clf.fit(xtrain_glove, ytrain)
predictions = clf.predict_proba(xvalid_glove)

print ("logloss: %0.3f " % multiclass_logloss(yvalid, predictions))

Parameters: { "silent" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


logloss: 0.682 


매개변수의 간단한 조정이 GloVe 변수의 xgboost 점수를 향상시킬 수 있음을 확인했습니다! 당신이 그것에서 더 많은 것을 짜낼 수 있다고 믿으십시오.
## Deep Learning

하지만 지금은 딥러닝의 시대입니다! 우리는 몇 가지 신경망을 훈련하지 않고는 살 수 없습니다. 여기서는 GloV 기능에 대해 LSTM과 단순 밀집 네트워크를 훈련합니다. 먼저 고밀도 네트워크부터 시작하겠습니다.

In [68]:
# scale the data before any neural net:
scl = preprocessing.StandardScaler()
xtrain_glove_scl = scl.fit_transform(xtrain_glove)
xvalid_glove_scl = scl.transform(xvalid_glove)

In [69]:
# we need to binarize the labels for the neural net
ytrain_enc = np_utils.to_categorical(ytrain)
yvalid_enc = np_utils.to_categorical(yvalid)

In [70]:
# create a simple 3 layer sequential neural net
model = Sequential()

model.add(Dense(300, input_dim=300, activation='relu'))
model.add(Dropout(0.2))
model.add(BatchNormalization())

model.add(Dense(300, activation='relu'))
model.add(Dropout(0.3))
model.add(BatchNormalization())

model.add(Dense(3))
model.add(Activation('softmax'))

# compile the model
model.compile(loss='categorical_crossentropy', optimizer='adam')

In [71]:
model.fit(xtrain_glove_scl, y=ytrain_enc, batch_size=64, 
          epochs=5, verbose=1, 
          validation_data=(xvalid_glove_scl, yvalid_enc))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x1367011ef70>

더 나은 결과를 얻으려면 신경망의 매개변수를 계속 조정하고, 더 많은 레이어를 추가하고, 드롭아웃을 늘려야 합니다. 여기에서는 구현 및 실행이 빠르고 최적화 없이 xgboost보다 더 나은 결과를 얻는다는 것을 보여주고 있습니다. :)

더 나아가려면, 즉 LSTM을 사용하여 텍스트 데이터를 토큰화해야 합니다.

In [72]:
# using keras tokenizer here
token = text.Tokenizer(num_words=None)
max_len = 70

token.fit_on_texts(list(xtrain) + list(xvalid))
xtrain_seq = token.texts_to_sequences(xtrain)
xvalid_seq = token.texts_to_sequences(xvalid)

# zero pad the sequences
xtrain_pad = sequence.pad_sequences(xtrain_seq, maxlen=max_len)
xvalid_pad = sequence.pad_sequences(xvalid_seq, maxlen=max_len)

word_index = token.word_index

In [73]:
# create an embedding matrix for the words we have in the dataset
embedding_matrix = np.zeros((len(word_index) + 1, 300))
for word, i in tqdm(word_index.items()):
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector


  0%|                                                                                        | 0/25943 [00:00<?, ?it/s][A
 12%|████████▉                                                                 | 3120/25943 [00:00<00:00, 31077.70it/s][A
 31%|██████████████████████▉                                                   | 8024/25943 [00:00<00:00, 41469.79it/s][A
 54%|███████████████████████████████████████▍                                 | 13994/25943 [00:00<00:00, 49544.38it/s][A
100%|█████████████████████████████████████████████████████████████████████████| 25943/25943 [00:00<00:00, 54269.93it/s][A


In [74]:
# A simple LSTM with glove embeddings and two dense layers
model = Sequential()
model.add(Embedding(len(word_index) + 1,
                     300,
                     weights=[embedding_matrix],
                     input_length=max_len,
                     trainable=False))
model.add(SpatialDropout1D(0.3))
model.add(LSTM(100, dropout=0.3, recurrent_dropout=0.3))

model.add(Dense(1024, activation='relu'))
model.add(Dropout(0.8))

model.add(Dense(1024, activation='relu'))
model.add(Dropout(0.8))

model.add(Dense(3))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')

In [76]:
model.fit(xtrain_pad, y=ytrain_enc, batch_size=512, epochs=3, verbose=1, validation_data=(xvalid_pad, yvalid_enc))

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x13681606640>

이제 점수가 0.5 미만임을 알 수 있습니다. 최고로 멈추지 않고 여러 시대 동안 실행했지만 조기 중지를 사용하여 최상의 반복을 중지할 수 있습니다. 조기 중지는 어떻게 사용합니까?

음, 꽤 쉽습니다. 모델을 다시 컴파일해 보겠습니다.

In [77]:
# A simple LSTM with glove embeddings and two dense layers
model = Sequential()
model.add(Embedding(len(word_index) + 1,
                     300,
                     weights=[embedding_matrix],
                     input_length=max_len,
                     trainable=False))
model.add(SpatialDropout1D(0.3))
model.add(LSTM(300, dropout=0.3, recurrent_dropout=0.3))

model.add(Dense(1024, activation='relu'))
model.add(Dropout(0.8))

model.add(Dense(1024, activation='relu'))
model.add(Dropout(0.8))

model.add(Dense(3))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')

# Fit the model with early stopping callback
earlystop = EarlyStopping(monitor='val_loss', min_delta=0, patience=3, verbose=0, mode='auto')
model.fit(xtrain_pad, y=ytrain_enc, batch_size=512, epochs=100, 
          verbose=1, validation_data=(xvalid_pad, yvalid_enc), callbacks=[earlystop])

Epoch 1/100


  0%|                                                                                        | 0/17621 [45:10<?, ?it/s]


 7/35 [=====>........................] - ETA: 1:48 - loss: 1.1188

KeyboardInterrupt: 

한 가지 질문이 될 수 있습니다. 왜 그렇게 많은 dropout을 사용합니까? 음, 드롭아웃이 없거나 거의 없는 모델을 맞추면 과적합되기 시작합니다. :)

양방향 LSTM이 더 나은 결과를 제공할 수 있는지 봅시다. Keras와 함께 할 수 있는 케이크 한 조각 :)

In [42]:
# A simple bidirectional LSTM with glove embeddings and two dense layers
model = Sequential()
model.add(Embedding(len(word_index) + 1,
                     300,
                     weights=[embedding_matrix],
                     input_length=max_len,
                     trainable=False))
model.add(SpatialDropout1D(0.3))
model.add(Bidirectional(LSTM(300, dropout=0.3, recurrent_dropout=0.3)))

model.add(Dense(1024, activation='relu'))
model.add(Dropout(0.8))

model.add(Dense(1024, activation='relu'))
model.add(Dropout(0.8))

model.add(Dense(3))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')

# Fit the model with early stopping callback
earlystop = EarlyStopping(monitor='val_loss', min_delta=0, patience=3, verbose=0, mode='auto')
model.fit(xtrain_pad, y=ytrain_enc, batch_size=512, epochs=100, 
          verbose=1, validation_data=(xvalid_pad, yvalid_enc), callbacks=[earlystop])

Train on 17621 samples, validate on 1958 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100


<keras.callbacks.History at 0x7f1eedc11a20>

Pretty close! Lets try two layers of GRU:

In [78]:
# GRU with glove embeddings and two dense layers
model = Sequential()
model.add(Embedding(len(word_index) + 1,
                     300,
                     weights=[embedding_matrix],
                     input_length=max_len,
                     trainable=False))
model.add(SpatialDropout1D(0.3))
model.add(GRU(300, dropout=0.3, recurrent_dropout=0.3, return_sequences=True))
model.add(GRU(300, dropout=0.3, recurrent_dropout=0.3))

model.add(Dense(1024, activation='relu'))
model.add(Dropout(0.8))

model.add(Dense(1024, activation='relu'))
model.add(Dropout(0.8))

model.add(Dense(3))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')

# Fit the model with early stopping callback
earlystop = EarlyStopping(monitor='val_loss', min_delta=0, patience=3, verbose=0, mode='auto')
model.fit(xtrain_pad, y=ytrain_enc, batch_size=512, epochs=100, 
          verbose=1, validation_data=(xvalid_pad, yvalid_enc), callbacks=[earlystop])

Epoch 1/100
Epoch 2/100
 6/35 [====>.........................] - ETA: 4:12 - loss: 0.9652

KeyboardInterrupt: 

멋진! 우리가 이전에 가지고 있던 것보다 훨씬 낫습니다! 계속 최적화하면 성능이 계속 향상됩니다.
시도해 볼 가치: 형태소 분석 및 표제어 추출. 이것은 내가 지금 건너 뛰고있는 것입니다.

Kaggle 세계에서 최고 점수를 얻으려면 모델의 앙상블이 있어야 합니다. 앙상블을 조금 확인해보자!


## 앙상블

몇 달 전에 간단한 앙상블러를 만들었지만 완전히 개발할 시간이 없었습니다. https://github.com/abhishekkrthakur/pysembler에서 찾을 수 있습니다. 여기에서 일부를 사용하겠습니다.

In [82]:
# this is the main ensembling class. how to use it is in the next cell!
import numpy as np
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold, KFold
import pandas as pd
import os
import sys
import logging

logging.basicConfig(
    level=logging.DEBUG,
    format="[%(asctime)s] %(levelname)s %(message)s",
    datefmt="%H:%M:%S", stream=sys.stdout)
logger = logging.getLogger(__name__)


class Ensembler(object):
    def __init__(self, model_dict, num_folds=3, task_type='classification', optimize=roc_auc_score,
                 lower_is_better=False, save_path=None):
        """
        Ensembler init function
        :param model_dict: model dictionary, see README for its format
        :param num_folds: the number of folds for ensembling
        :param task_type: classification or regression
        :param optimize: the function to optimize for, e.g. AUC, logloss, etc. Must have two arguments y_test and y_pred
        :param lower_is_better: is lower value of optimization function better or higher
        :param save_path: path to which model pickles will be dumped to along with generated predictions, or None
        """

        self.model_dict = model_dict
        self.levels = len(self.model_dict)
        self.num_folds = num_folds
        self.task_type = task_type
        self.optimize = optimize
        self.lower_is_better = lower_is_better
        self.save_path = save_path

        self.training_data = None
        self.test_data = None
        self.y = None
        self.lbl_enc = None
        self.y_enc = None
        self.train_prediction_dict = None
        self.test_prediction_dict = None
        self.num_classes = None

    def fit(self, training_data, y, lentrain):
        """
        :param training_data: training data in tabular format
        :param y: binary, multi-class or regression
        :return: chain of models to be used in prediction
        """

        self.training_data = training_data
        self.y = y

        if self.task_type == 'classification':
            self.num_classes = len(np.unique(self.y))
            logger.info("Found %d classes", self.num_classes)
            self.lbl_enc = LabelEncoder()
            self.y_enc = self.lbl_enc.fit_transform(self.y)
            kf = StratifiedKFold(n_splits=self.num_folds)
            train_prediction_shape = (lentrain, self.num_classes)
        else:
            self.num_classes = -1
            self.y_enc = self.y
            kf = KFold(n_splits=self.num_folds)
            train_prediction_shape = (lentrain, 1)

        self.train_prediction_dict = {}
        for level in range(self.levels):
            self.train_prediction_dict[level] = np.zeros((train_prediction_shape[0],
                                                          train_prediction_shape[1] * len(self.model_dict[level])))

        for level in range(self.levels):

            if level == 0:
                temp_train = self.training_data
            else:
                temp_train = self.train_prediction_dict[level - 1]

            for model_num, model in enumerate(self.model_dict[level]):
                validation_scores = []
                foldnum = 1
                for train_index, valid_index in kf.split(self.train_prediction_dict[0], self.y_enc):
                    logger.info("Training Level %d Fold # %d. Model # %d", level, foldnum, model_num)

                    if level != 0:
                        l_training_data = temp_train[train_index]
                        l_validation_data = temp_train[valid_index]
                        model.fit(l_training_data, self.y_enc[train_index])
                    else:
                        l0_training_data = temp_train[0][model_num]
                        if type(l0_training_data) == list:
                            l_training_data = [x[train_index] for x in l0_training_data]
                            l_validation_data = [x[valid_index] for x in l0_training_data]
                        else:
                            l_training_data = l0_training_data[train_index]
                            l_validation_data = l0_training_data[valid_index]
                        model.fit(l_training_data, self.y_enc[train_index])

                    logger.info("Predicting Level %d. Fold # %d. Model # %d", level, foldnum, model_num)

                    if self.task_type == 'classification':
                        temp_train_predictions = model.predict_proba(l_validation_data)
                        self.train_prediction_dict[level][valid_index,
                        (model_num * self.num_classes):(model_num * self.num_classes) +
                                                       self.num_classes] = temp_train_predictions

                    else:
                        temp_train_predictions = model.predict(l_validation_data)
                        self.train_prediction_dict[level][valid_index, model_num] = temp_train_predictions
                    validation_score = self.optimize(self.y_enc[valid_index], temp_train_predictions)
                    validation_scores.append(validation_score)
                    logger.info("Level %d. Fold # %d. Model # %d. Validation Score = %f", level, foldnum, model_num,
                                validation_score)
                    foldnum += 1
                avg_score = np.mean(validation_scores)
                std_score = np.std(validation_scores)
                logger.info("Level %d. Model # %d. Mean Score = %f. Std Dev = %f", level, model_num,
                            avg_score, std_score)

            logger.info("Saving predictions for level # %d", level)
            train_predictions_df = pd.DataFrame(self.train_prediction_dict[level])
            train_predictions_df.to_csv(os.path.join(self.save_path, "train_predictions_level_" + str(level) + ".csv"),
                                        index=False, header=None)

        return self.train_prediction_dict

    def predict(self, test_data, lentest):
        self.test_data = test_data
        if self.task_type == 'classification':
            test_prediction_shape = (lentest, self.num_classes)
        else:
            test_prediction_shape = (lentest, 1)

        self.test_prediction_dict = {}
        for level in range(self.levels):
            self.test_prediction_dict[level] = np.zeros((test_prediction_shape[0],
                                                         test_prediction_shape[1] * len(self.model_dict[level])))
        self.test_data = test_data
        for level in range(self.levels):
            if level == 0:
                temp_train = self.training_data
                temp_test = self.test_data
            else:
                temp_train = self.train_prediction_dict[level - 1]
                temp_test = self.test_prediction_dict[level - 1]

            for model_num, model in enumerate(self.model_dict[level]):

                logger.info("Training Fulldata Level %d. Model # %d", level, model_num)
                if level == 0:
                    model.fit(temp_train[0][model_num], self.y_enc)
                else:
                    model.fit(temp_train, self.y_enc)

                logger.info("Predicting Test Level %d. Model # %d", level, model_num)

                if self.task_type == 'classification':
                    if level == 0:
                        temp_test_predictions = model.predict_proba(temp_test[0][model_num])
                    else:
                        temp_test_predictions = model.predict_proba(temp_test)
                    self.test_prediction_dict[level][:, (model_num * self.num_classes): (model_num * self.num_classes) +
                                                                                        self.num_classes] = temp_test_predictions

                else:
                    if level == 0:
                        temp_test_predictions = model.predict(temp_test[0][model_num])
                    else:
                        temp_test_predictions = model.predict(temp_test)
                    self.test_prediction_dict[level][:, model_num] = temp_test_predictions

            test_predictions_df = pd.DataFrame(self.test_prediction_dict[level])
            test_predictions_df.to_csv(os.path.join(self.save_path, "test_predictions_level_" + str(level) + ".csv"),
                                       index=False, header=None)

        return self.test_prediction_dict


In [83]:
# specify the data to be used for every level of ensembling:
train_data_dict = {0: [xtrain_tfv, xtrain_ctv, xtrain_tfv, xtrain_ctv], 1: [xtrain_glove]}
test_data_dict = {0: [xvalid_tfv, xvalid_ctv, xvalid_tfv, xvalid_ctv], 1: [xvalid_glove]}

model_dict = {0: [LogisticRegression(), LogisticRegression(), MultinomialNB(alpha=0.1), MultinomialNB()],

              1: [xgb.XGBClassifier(silent=True, n_estimators=120, max_depth=7)]}

ens = Ensembler(model_dict=model_dict, num_folds=3, task_type='classification',
                optimize=multiclass_logloss, lower_is_better=True, save_path='')

ens.fit(train_data_dict, ytrain, lentrain=xtrain_glove.shape[0])
preds = ens.predict(test_data_dict, lentest=xvalid_glove.shape[0])

[21:24:20] INFO Found 3 classes
[21:24:20] INFO Training Level 0 Fold # 1. Model # 0
[21:24:23] INFO Predicting Level 0. Fold # 1. Model # 0
[21:24:23] INFO Level 0. Fold # 1. Model # 0. Validation Score = 0.626621
[21:24:23] INFO Training Level 0 Fold # 2. Model # 0


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[21:24:24] INFO Predicting Level 0. Fold # 2. Model # 0
[21:24:24] INFO Level 0. Fold # 2. Model # 0. Validation Score = 0.616452
[21:24:24] INFO Training Level 0 Fold # 3. Model # 0


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[21:24:26] INFO Predicting Level 0. Fold # 3. Model # 0
[21:24:26] INFO Level 0. Fold # 3. Model # 0. Validation Score = 0.619625
[21:24:26] INFO Level 0. Model # 0. Mean Score = 0.620899. Std Dev = 0.004248
[21:24:26] INFO Training Level 0 Fold # 1. Model # 1


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[21:24:47] INFO Predicting Level 0. Fold # 1. Model # 1
[21:24:47] INFO Level 0. Fold # 1. Model # 1. Validation Score = 0.573485
[21:24:47] INFO Training Level 0 Fold # 2. Model # 1
[21:25:06] INFO Predicting Level 0. Fold # 2. Model # 1
[21:25:06] INFO Level 0. Fold # 2. Model # 1. Validation Score = 0.563451
[21:25:06] INFO Training Level 0 Fold # 3. Model # 1
[21:25:27] INFO Predicting Level 0. Fold # 3. Model # 1
[21:25:27] INFO Level 0. Fold # 3. Model # 1. Validation Score = 0.567765
[21:25:27] INFO Level 0. Model # 1. Mean Score = 0.568233. Std Dev = 0.004110
[21:25:27] INFO Training Level 0 Fold # 1. Model # 2
[21:25:27] INFO Predicting Level 0. Fold # 1. Model # 2
[21:25:27] INFO Level 0. Fold # 1. Model # 2. Validation Score = 0.463292
[21:25:27] INFO Training Level 0 Fold # 2. Model # 2
[21:25:27] INFO Predicting Level 0. Fold # 2. Model # 2
[21:25:27] INFO Level 0. Fold # 2. Model # 2. Validation Score = 0.456477
[21:25:27] INFO Training Level 0 Fold # 3. Model # 2
[21:25:



[21:25:36] INFO Predicting Level 1. Fold # 1. Model # 0
[21:25:36] INFO Level 1. Fold # 1. Model # 0. Validation Score = 0.486937
[21:25:36] INFO Training Level 1 Fold # 2. Model # 0
Parameters: { "silent" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


[21:25:45] INFO Predicting Level 1. Fold # 2. Model # 0
[21:25:45] INFO Level 1. Fold # 2. Model # 0. Validation Score = 0.471962
[21:25:45] INFO Training Level 1 Fold # 3. Model # 0
Parameters: { "silent" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


[21:25:5

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[21:26:16] INFO Predicting Test Level 0. Model # 1
[21:26:16] INFO Training Fulldata Level 0. Model # 2
[21:26:16] INFO Predicting Test Level 0. Model # 2
[21:26:16] INFO Training Fulldata Level 0. Model # 3
[21:26:16] INFO Predicting Test Level 0. Model # 3
[21:26:17] INFO Training Fulldata Level 1. Model # 0


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Parameters: { "silent" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


[21:26:29] INFO Predicting Test Level 1. Model # 0


In [84]:
# check error:
multiclass_logloss(yvalid, preds[1])

0.4639972620827762

따라서 앙상블이 점수를 크게 향상시키는 것을 볼 수 있습니다! 이것은 튜토리얼일 뿐이므로 리더보드에 제출할 수 있는 CSV는 제공하지 않습니다.