Kaggle チャレンジ 5日目 映画レビューから感情分析をしてみた

シェアする

  • このエントリーをはてなブックマークに追加
スポンサーリンク

はじめに

 今回は、Rotten Tomatoesの映画レビューデータセットから感情分析をしていきたいと思います。

前回の記事:

Kaggle チャレンジ 1日目 タイタニックの問題からデータを読み解いてみる

Kaggle チャレンジ 4日目 住宅価格問題を解いていく

今回もGoogleColaboratoryを使って進めていくので、はじめ方などは前回の記事を参考にしてください。

コンペの説明

 Rotten Tomatoesの映画レビューデータセットは、元々Pang and Lee によって収集された感情分析に使用される映画レビューのコーパスです。AmazonのMechanical Turkを使用して、コーパス内のすべての解析されたフレーズの細かいラベルを作成しました。このコンテストは、Rotten Tomatoesデータセットであなたの感情分析のアイデアをベンチマークするチャンスを与えます。negative、somewhat negative、neutral、somewhat positive、positiveの5つの値でフレーズにラベルを付けるように求められます。

(コンペの説明欄から引用)

評価方法

提出は、すべての解析されたフレーズの分類精度(正確に予測されるラベルの割合)で評価されます。感情ラベルは次のようになっています。

0 – negative
1 – somewhat negative
2 – neutral
3 – somewhat positive
4 – positive


やること

 GoogleColaboratoryとKernelsを使ってRotten Tomatoesの映画レビューデータセットから感情分析(5つのラベルにフレーズを分類する)をしていきたいと思います。

・今回参考にするKernels: Keras-models:LSTM,CNN,GRU,Bidirectional,glove

GoogleColaboratoryの使い方はこちら

対象者

機械学習をKaggleを使って学びたい方、Kaggleに興味がある方、いろいろなデータセットを試してみたい方。

 

Movie Review Sentiment Analysis (Kernels Only)

今回のコンペはKernels Onlyという事になっています。この場合は、予測されたデータのファイルだけでなく、ファイルの事前処理を含む全てのモデルのコードのカーネルの提出が必要となります。

データセットの準備

from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

id = '*************************' # 共有リンクで取得した id= より後の部分を*の部分に入力
downloaded = drive.CreateFile({'id': id})
downloaded.GetContentFile('train.tsv') #ファイルの名前
id = '*************************' 
downloaded = drive.CreateFile({'id': id})
downloaded.GetContentFile('test.tsv')
id = '*************************'
downloaded = drive.CreateFile({'id': id})
downloaded.GetContentFile('sample_submission.csv')

ライブラリの準備

In[1]:

import numpy as np 
import pandas as pd 
import nltk
import os
import gc
from keras.preprocessing import sequence,text
from keras.preprocessing.text import Tokenizer
from keras.models import Sequential
from keras.layers import Dense,Dropout,Embedding,LSTM,Conv1D,GlobalMaxPooling1D,Flatten,MaxPooling1D,GRU,SpatialDropout1D,Bidirectional
from keras.callbacks import EarlyStopping
from keras.utils import to_categorical
from keras.losses import categorical_crossentropy
from keras.optimizers import Adam
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report,f1_score
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")
#pd.set_option('display.max_colwidth',100)
pd.set_option('display.max_colwidth', -1)

Out[1]:

Using TensorFlow backend.

In[2]:

gc.collect()
Out[2]:
0

データの読み込み

In[3]:

train=pd.read_csv('train.tsv',sep='\t')
print(train.shape)
train.head()
(156060, 4)
Out[3]:
PhraseId SentenceId Phrase Sentiment
0 1 1 A series of escapades demonstrating the adage that what is good for the goose is also good for the gander , some of which occasionally amuses but none of which amounts to much of a story . 1
1 2 1 A series of escapades demonstrating the adage that what is good for the goose 2
2 3 1 A series 2
3 4 1 A 2
4 5 1 series 2

In[4]:

test=pd.read_csv('test.tsv',sep='\t')
print(test.shape)
test.head()
(66292, 3)
Out[4]:
PhraseId SentenceId Phrase
0 156061 8545 An intermittently pleasing but mostly routine effort .
1 156062 8545 An intermittently pleasing but mostly routine effort
2 156063 8545 An
3 156064 8545 intermittently pleasing but mostly routine effort
4 156065 8545 intermittently pleasing but mostly routine

In[5]:

sub=pd.read_csv('sample_submission.csv')
sub.head()
Out[5]:
PhraseId Sentiment
0 156061 2
1 156062 2
2 156063 2
3 156064 2
4 156065 2

前処理のためにtrainとtestを連結し、感情カラムを追加する

In[6]:

test['Sentiment']=-999
test.head()
Out[6]:
PhraseId SentenceId Phrase Sentiment
0 156061 8545 An intermittently pleasing but mostly routine effort . -999
1 156062 8545 An intermittently pleasing but mostly routine effort -999
2 156063 8545 An -999
3 156064 8545 intermittently pleasing but mostly routine effort -999
4 156065 8545 intermittently pleasing but mostly routine -999

In[7]:

df=pd.concat([train,test],ignore_index=True)
print(df.shape)
df.tail()
(222352, 4)
Out[7]:
PhraseId SentenceId Phrase Sentiment
222347 222348 11855 A long-winded , predictable scenario . -999
222348 222349 11855 A long-winded , predictable scenario -999
222349 222350 11855 A long-winded , -999
222350 222351 11855 A long-winded -999
222351 222352 11855 predictable scenario -999

In[8]:

del train,test
gc.collect()
Out[8]:
26

cleaning review

カーネルのコードでは、GoogleColaboratory上で、エラーを吐いたので赤に示したコードを追加してください。

In[9]:

from nltk.tokenize import word_tokenize
from nltk import FreqDist
from nltk.stem import SnowballStemmer,WordNetLemmatizer
stemmer=SnowballStemmer('english')
lemma=WordNetLemmatizer()
from string import punctuation
import re
import nltk
nltk.download('punkt')
nltk.download('wordnet')

Out[9]:

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
True
In [10]:
def clean_review(review_col):
    review_corpus=[]
    for i in range(0,len(review_col)):
        review=str(review_col[i])
        review=re.sub('[^a-zA-Z]',' ',review)
        #review=[stemmer.stem(w) for w in word_tokenize(str(review).lower())]
        review=[lemma.lemmatize(w) for w in word_tokenize(str(review).lower())]
        review=' '.join(review)
        review_corpus.append(review)
    return review_corpus
In [11]:
df['clean_review']=clean_review(df.Phrase.values)
df.head()
Out[11]:
PhraseId SentenceId Phrase Sentiment clean_review
0 1 1 A series of escapades demonstrating the adage that what is good for the goose is also good for the gander , some of which occasionally amuses but none of which amounts to much of a story . 1 a series of escapade demonstrating the adage that what is good for the goose is also good for the gander some of which occasionally amuses but none of which amount to much of a story
1 2 1 A series of escapades demonstrating the adage that what is good for the goose 2 a series of escapade demonstrating the adage that what is good for the goose
2 3 1 A series 2 a series
3 4 1 A 2 a
4 5 1 series 2 series

trainとtestを分ける

In[12]:

df_train=df[df.Sentiment!=-999]
df_train.shape
Out[12]:
(156060, 5)
In [13]:
df_test=df[df.Sentiment==-999]
df_test.drop('Sentiment',axis=1,inplace=True)
print(df_test.shape)
df_test.head()
(66292, 4)
Out[13]:
PhraseId SentenceId Phrase clean_review
156060 156061 8545 An intermittently pleasing but mostly routine effort . an intermittently pleasing but mostly routine effort
156061 156062 8545 An intermittently pleasing but mostly routine effort an intermittently pleasing but mostly routine effort
156062 156063 8545 An an
156063 156064 8545 intermittently pleasing but mostly routine effort intermittently pleasing but mostly routine effort
156064 156065 8545 intermittently pleasing but mostly routine intermittently pleasing but mostly routine
In [14]:
del df
gc.collect()
Out[14]:
23

trainデータセットのうち20%を検証データセットとして分ける

In[15]:

train_text=df_train.clean_review.values
test_text=df_test.clean_review.values
target=df_train.Sentiment.values
y=to_categorical(target)
print(train_text.shape,target.shape,y.shape)
Out[15]:
(156060,) (156060,) (156060, 5)
In [16]:
X_train_text,X_val_text,y_train,y_val=train_test_split(train_text,y,test_size=0.2,stratify=y,random_state=123)
print(X_train_text.shape,y_train.shape)
print(X_val_text.shape,y_val.shape)
Out[16]:
(124848,) (124848, 5)
(31212,) (31212, 5)

trainデータセット内の単語数を調べる

In[17]:

all_words=' '.join(X_train_text)
all_words=word_tokenize(all_words)
dist=FreqDist(all_words)
num_unique_word=len(dist)
num_unique_word
Out[17]:
13732

trainデータセット内のレビューの最大の長さを検索

In[18]:

r_len=[]
for text in X_train_text:
    word=word_tokenize(text)
    l=len(word)
    r_len.append(l)
    
MAX_REVIEW_LEN=np.max(r_len)
MAX_REVIEW_LEN
Out[18]:
48

Keras で LSTMモデルを作る

In[19]:

max_features = num_unique_word
max_words = MAX_REVIEW_LEN
batch_size = 128
epochs = 3
num_classes=5

Tokenize Text

In [20]:
tokenizer = Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(list(X_train_text))
X_train = tokenizer.texts_to_sequences(X_train_text)
X_val = tokenizer.texts_to_sequences(X_val_text)
X_test = tokenizer.texts_to_sequences(test_text)

sequence padding

In [21]:
X_train = sequence.pad_sequences(X_train, maxlen=max_words)
X_val = sequence.pad_sequences(X_val, maxlen=max_words)
X_test = sequence.pad_sequences(X_test, maxlen=max_words)
print(X_train.shape,X_val.shape,X_test.shape)
Out[21]:
(124848, 48) (31212, 48) (66292, 48)

1. LSTM model

In [22]:
model1=Sequential()
model1.add(Embedding(max_features,100,mask_zero=True))
model1.add(LSTM(64,dropout=0.4, recurrent_dropout=0.4,return_sequences=True))
model1.add(LSTM(32,dropout=0.5, recurrent_dropout=0.5,return_sequences=False))
model1.add(Dense(num_classes,activation='softmax'))
model1.compile(loss='categorical_crossentropy',optimizer=Adam(lr=0.001),metrics=['accuracy'])
model1.summary()
Out[22]:
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_1 (Embedding)      (None, None, 100)         1373200   
_________________________________________________________________
lstm_1 (LSTM)                (None, None, 64)          42240     
_________________________________________________________________
lstm_2 (LSTM)                (None, 32)                12416     
_________________________________________________________________
dense_1 (Dense)              (None, 5)                 165       
=================================================================
Total params: 1,428,021
Trainable params: 1,428,021
Non-trainable params: 0
_________________________________________________________________

In[23]:

%%time
history1=model1.fit(X_train, y_train, validation_data=(X_val, y_val),epochs=epochs, batch_size=batch_size, verbose=1)
Out[23]:
Train on 124848 samples, validate on 31212 samples
Epoch 1/3
124848/124848 [==============================] - 268s 2ms/step - loss: 1.0872 - acc: 0.5741 - val_loss: 0.8716 - val_acc: 0.6456
Epoch 2/3
124848/124848 [==============================] - 264s 2ms/step - loss: 0.8357 - acc: 0.6592 - val_loss: 0.8324 - val_acc: 0.6588
Epoch 3/3
124848/124848 [==============================] - 265s 2ms/step - loss: 0.7784 - acc: 0.6791 - val_loss: 0.8204 - val_acc: 0.6656
CPU times: user 18min 48s, sys: 1min 19s, total: 20min 8s
Wall time: 13min 19s
In [24]:
y_pred1=model1.predict_classes(X_test,verbose=1)
Out[24]:
66292/66292 [==============================] - 139s 2ms/step

In[25]:

sub.Sentiment=y_pred1
sub.to_csv('sub1.csv',index=False)
sub.head()
Out[25]:
PhraseId Sentiment
0 156061 2
1 156062 2
2 156063 2
3 156064 2
4 156065 2

2. CNN

In[26]:

model2= Sequential()
model2.add(Embedding(max_features,100,input_length=max_words))
model2.add(Dropout(0.2))

model2.add(Conv1D(64,kernel_size=3,padding='same',activation='relu',strides=1))
model2.add(GlobalMaxPooling1D())

model2.add(Dense(128,activation='relu'))
model2.add(Dropout(0.2))

model2.add(Dense(num_classes,activation='softmax'))


model2.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy'])

model2.summary()
Out[26]:
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_2 (Embedding)      (None, 48, 100)           1373200   
_________________________________________________________________
dropout_1 (Dropout)          (None, 48, 100)           0         
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 48, 64)            19264     
_________________________________________________________________
global_max_pooling1d_1 (Glob (None, 64)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 128)               8320      
_________________________________________________________________
dropout_2 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 5)                 645       
=================================================================
Total params: 1,401,429
Trainable params: 1,401,429
Non-trainable params: 0
_________________________________________________________________

In[27]:

%%time
history2=model2.fit(X_train, y_train, validation_data=(X_val, y_val),epochs=epochs, batch_size=batch_size, verbose=1)
Out[27]:
Train on 124848 samples, validate on 31212 samples
Epoch 1/3
124848/124848 [==============================] - 11s 85us/step - loss: 0.9937 - acc: 0.5998 - val_loss: 0.8541 - val_acc: 0.6526
Epoch 2/3
124848/124848 [==============================] - 9s 72us/step - loss: 0.7839 - acc: 0.6786 - val_loss: 0.8165 - val_acc: 0.6626
Epoch 3/3
124848/124848 [==============================] - 9s 72us/step - loss: 0.7037 - acc: 0.7085 - val_loss: 0.8051 - val_acc: 0.6613
CPU times: user 32.4 s, sys: 5.9 s, total: 38.3 s
Wall time: 29 s
In [28]:
y_pred2=model2.predict_classes(X_test, verbose=1)
Out[28]:
66292/66292 [==============================] - 3s 44us/step
In [29]:
sub.Sentiment=y_pred2
sub.to_csv('sub2.csv',index=False)
sub.head()
Out[29]:
PhraseId Sentiment
0 156061 2
1 156062 2
2 156063 2
3 156064 2
4 156065 2

3. CNN +GRU

In[30]:

model3= Sequential()
model3.add(Embedding(max_features,100,input_length=max_words))
model3.add(Conv1D(64,kernel_size=3,padding='same',activation='relu'))
model3.add(MaxPooling1D(pool_size=2))
model3.add(Dropout(0.25))
model3.add(GRU(128,return_sequences=True))
model3.add(Dropout(0.3))
model3.add(Flatten())
model3.add(Dense(128,activation='relu'))
model3.add(Dropout(0.5))
model3.add(Dense(5,activation='softmax'))
model3.compile(loss='categorical_crossentropy',optimizer=Adam(lr=0.001),metrics=['accuracy'])
model3.summary()
Out[30]:
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_3 (Embedding)      (None, 48, 100)           1373200   
_________________________________________________________________
conv1d_2 (Conv1D)            (None, 48, 64)            19264     
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 24, 64)            0         
_________________________________________________________________
dropout_3 (Dropout)          (None, 24, 64)            0         
_________________________________________________________________
gru_1 (GRU)                  (None, 24, 128)           74112     
_________________________________________________________________
dropout_4 (Dropout)          (None, 24, 128)           0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 3072)              0         
_________________________________________________________________
dense_4 (Dense)              (None, 128)               393344    
_________________________________________________________________
dropout_5 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_5 (Dense)              (None, 5)                 645       
=================================================================
Total params: 1,860,565
Trainable params: 1,860,565
Non-trainable params: 0
_________________________________________________________________
In [31]:
%%time
history3=model3.fit(X_train, y_train, validation_data=(X_val, y_val),epochs=epochs, batch_size=batch_size, verbose=1)
Out[31]:
Train on 124848 samples, validate on 31212 samples
Epoch 1/3
124848/124848 [==============================] - 45s 364us/step - loss: 1.0375 - acc: 0.5840 - val_loss: 0.8593 - val_acc: 0.6517
Epoch 2/3
124848/124848 [==============================] - 43s 348us/step - loss: 0.8134 - acc: 0.6668 - val_loss: 0.8134 - val_acc: 0.6662
Epoch 3/3
124848/124848 [==============================] - 43s 347us/step - loss: 0.7303 - acc: 0.6963 - val_loss: 0.8016 - val_acc: 0.6710
CPU times: user 2min 46s, sys: 12.3 s, total: 2min 58s
Wall time: 2min 13s
In [32]:
y_pred3=model3.predict_classes(X_test, verbose=1)
Out[32]:
66292/66292 [==============================] - 18s 271us/step
In [33]:
sub.Sentiment=y_pred3
sub.to_csv('sub3.csv',index=False)
sub.head()
Out[33]:
PhraseId Sentiment
0 156061 2
1 156062 2
2 156063 2
3 156064 2
4 156065 3

4. Bidirectional GRU

In[34]:

model4 = Sequential()

model4.add(Embedding(max_features, 100, input_length=max_words))
model4.add(SpatialDropout1D(0.25))
model4.add(Bidirectional(GRU(128)))
model4.add(Dropout(0.5))

model4.add(Dense(5, activation='softmax'))
model4.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model4.summary()
Out[34]:
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_4 (Embedding)      (None, 48, 100)           1373200   
_________________________________________________________________
spatial_dropout1d_1 (Spatial (None, 48, 100)           0         
_________________________________________________________________
bidirectional_1 (Bidirection (None, 256)               175872    
_________________________________________________________________
dropout_6 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_6 (Dense)              (None, 5)                 1285      
=================================================================
Total params: 1,550,357
Trainable params: 1,550,357
Non-trainable params: 0
_________________________________________________________________
In [35]:
%%time
history4=model4.fit(X_train, y_train, validation_data=(X_val, y_val),epochs=epochs, batch_size=batch_size, verbose=1)
Out[35]:
Train on 124848 samples, validate on 31212 samples
Epoch 1/3
124848/124848 [==============================] - 137s 1ms/step - loss: 1.0025 - acc: 0.5972 - val_loss: 0.8539 - val_acc: 0.6494
Epoch 2/3
124848/124848 [==============================] - 137s 1ms/step - loss: 0.8024 - acc: 0.6699 - val_loss: 0.8116 - val_acc: 0.6717
Epoch 3/3
124848/124848 [==============================] - 136s 1ms/step - loss: 0.7365 - acc: 0.6945 - val_loss: 0.8114 - val_acc: 0.6713
CPU times: user 8min 16s, sys: 27.2 s, total: 8min 43s
Wall time: 6min 51s
In [36]:
y_pred4=model4.predict_classes(X_test, verbose=1)
Out[36]:
66292/66292 [==============================] - 61s 915us/step
In [37]:
sub.Sentiment=y_pred4
sub.to_csv('sub4.csv',index=False)
sub.head()
Out[37]:
PhraseId Sentiment
0 156061 2
1 156062 2
2 156063 2
3 156064 2
4 156065 2

5. Glove word embedding

In[38]:

def get_coefs(word, *arr):
    return word, np.asarray(arr, dtype='float32')
    
def get_embed_mat(EMBEDDING_FILE, max_features,embed_dim):
    # word vectors
    embeddings_index = dict(get_coefs(*o.rstrip().rsplit(' ')) for o in open(EMBEDDING_FILE, encoding='utf8'))
    print('Found %s word vectors.' % len(embeddings_index))

    # embedding matrix
    word_index = tokenizer.word_index
    num_words = min(max_features, len(word_index) + 1)
    all_embs = np.stack(embeddings_index.values()) #for random init
    embedding_matrix = np.random.normal(all_embs.mean(), all_embs.std(), 
                                        (num_words, embed_dim))
    for word, i in word_index.items():
        if i >= max_features:
            continue
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector
    max_features = embedding_matrix.shape[0]
    
    return embedding_matrix
In [39]:
# embedding matrix
EMBEDDING_FILE = '../input/glove6b100dtxt/glove.6B.100d.txt'
embed_dim = 100 #word vector dim
embedding_matrix = get_embed_mat(EMBEDDING_FILE,max_features,embed_dim)
print(embedding_matrix.shape)
Out[39]:
Found 400000 word vectors.
(13732, 100)
In [40]:
model5 = Sequential()
model5.add(Embedding(max_features, embed_dim, input_length=X_train.shape[1],weights=[embedding_matrix],trainable=True))
model5.add(SpatialDropout1D(0.25))
model5.add(Bidirectional(GRU(128,return_sequences=True)))
model5.add(Bidirectional(GRU(64,return_sequences=False)))
model5.add(Dropout(0.5))
model5.add(Dense(num_classes, activation='softmax'))
model5.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model5.summary()
Out[40]:
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_5 (Embedding)      (None, 48, 100)           1373200   
_________________________________________________________________
spatial_dropout1d_2 (Spatial (None, 48, 100)           0         
_________________________________________________________________
bidirectional_2 (Bidirection (None, 48, 256)           175872    
_________________________________________________________________
bidirectional_3 (Bidirection (None, 128)               123264    
_________________________________________________________________
dropout_7 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_7 (Dense)              (None, 5)                 645       
=================================================================
Total params: 1,672,981
Trainable params: 1,672,981
Non-trainable params: 0
_________________________________________________________________
In [41]:
%%time
history5=model5.fit(X_train, y_train, validation_data=(X_val, y_val),epochs=4, batch_size=batch_size, verbose=1)
Out[41]:
Train on 124848 samples, validate on 31212 samples
Epoch 1/4
124848/124848 [==============================] - 269s 2ms/step - loss: 0.9964 - acc: 0.5902 - val_loss: 0.8401 - val_acc: 0.6543
Epoch 2/4
124848/124848 [==============================] - 265s 2ms/step - loss: 0.8454 - acc: 0.6513 - val_loss: 0.7911 - val_acc: 0.6721
Epoch 3/4
124848/124848 [==============================] - 265s 2ms/step - loss: 0.7881 - acc: 0.6742 - val_loss: 0.7741 - val_acc: 0.6796
Epoch 4/4
124848/124848 [==============================] - 265s 2ms/step - loss: 0.7476 - acc: 0.6898 - val_loss: 0.7682 - val_acc: 0.6834
CPU times: user 22min 13s, sys: 1min 5s, total: 23min 18s
Wall time: 17min 46s
In [42]:
y_pred5=model5.predict_classes(X_test, verbose=1)
Out[42]:
66292/66292 [==============================] - 124s 2ms/step
In [43]:
sub.Sentiment=y_pred5
sub.to_csv('sub5.csv',index=False)
sub.head()
Out[43]:
PhraseId Sentiment
0 156061 2
1 156062 2
2 156063 2
3 156064 2
4 156065 2

全ての出力を組み合わせる

In[44]:

sub_all=pd.DataFrame({'model1':y_pred1,'model2':y_pred2,'model3':y_pred3,'model4':y_pred4,'model5':y_pred5})
pred_mode=sub_all.agg('mode',axis=1)[0].values
sub_all.head()
Out[44]:
model1 model2 model3 model4 model5
0 2 2 2 2 2
1 2 2 2 2 2
2 2 2 2 2 2
3 2 2 2 2 2
4 2 2 3 2 2

In[45]:

pred_mode=[int(i) for i in pred_mode]
In [46]:
sub.Sentiment=pred_mode
sub.to_csv('sub_mode.csv',index=False)
sub.head()

Out[46]:
PhraseId Sentiment
0 156061 2
1 156062 2
2 156063 2
3 156064 2
4 156065 2


今回は、Rotten Tomatoesの映画レビューデータセットから感情分析をしました。ぜひ、予測したsentimentラベルがフレーズに合っているか確かめてみてください。

最後まで読んでいただきありがとうございました。よろしければこの記事をシェアしていただけると励みになります。よろしくお願いします。

スポンサーリンク
レクタングル広告(大)
レクタングル広告(大)

シェアする

  • このエントリーをはてなブックマークに追加

フォローする

スポンサーリンク
レクタングル広告(大)