BERTが世間を賑わせていますが、そもそもself-attentionって何してるんだか良くわからないんだよね…。ということで、EMNLP2018からself-attentionについての論文です。

概要

self-attentionを使ったmodelを - Topic classification - Sentiment analysis に適応し、その違いを検討した。

比較的層の浅いシンプルなモデルを使った＆事前学習もしていないので、今年のSOTA (ULMFit)には敵わないが、比較できる程度の成績。 topic classificationではattentionによるメリットはなく、sentiment analysisでのみ成績向上がみられた。

attention-layerを可視化してみるとtopic classification taskではcolum based matrixとなり、sentiment analysisではband matrixとなった。

モデル

f:id:tmitani-tky:20181122223148p:plain

モデルはほぼほぼVaswani et al. (2017)のモデルを踏襲。

Input embeddingとしては学習済みのGloVeを用い、 Vaswani et al(2017)同様にpositional embeddingを加算している。

全結合層でembedding_size→hidden_size_dへaugmentationした後、 self-attention blockに入力される。

パラメータの異なる以下の2つのモデルを作成した。 - base model：N=1, embedding_size=100, hidden_size_d=128 - big model：N=2, embedding_size=200, hidden_size_d=256

これらのattention modelの他にbaselineとしてattention layerをFFNで置き換えたbaseline modelも作成。

self-attention blockは、これもVaswani et al. (2017)などの通り。 f:id:tmitani-tky:20181122223927p:plain

入力行列XからXWq, XWk, XWvを作ってattentionを作っている。

なお、self-attentionについては、RyobotさんによるVaswani et al(2017)の解説に詳しい。 deeplearning.hatenablog.com

どうしてもself-attention部分に注目してしまいがちだが、あくまで元の入力Xに加えられる形で使われていることは改めて強調しておく。 ResNetのanalogyである。

classification taskに特化したモデルであるため、 attention blockからの出力をmax-poolingをとってhidden_size_dのベクトルへ縮約してから最後の全結合層＋Softmaxへ入力している。これによって、sequence sizeによらないモデル化が可能となった。

実験

topic classification task

AG's News with 4 classes of news articles
DBPedia with 14 classes of the Wikipedia ontology
Yahoo! Answers containing 10 categories of questions/answers

sentiment analysis task

Yelp rating from 1 to 5 stars
Yelp polarity (negative for 1-2stars, positive for 4-5 stars)
Amazon rating
Amazon polarity

これらの7つのタスクについて上記のモデルを学習させ、結果を検証した。なお、成績については前述の通りsotaには少し遠い。

結果

topic classificationではbaseline modelからの成績向上が得られなかった（むしろ低下した）が sentiment analysisではbaseline modelからの成績向上が得られた。

f:id:tmitani-tky:20181122225305p:plain 上記は、attention layerにおけるsoftmax出力を可視化した図である。

左のtopic classificationでは、いくつかの行に集中している。これは遠い近いなどに関わらず、どの単語からもある特定の単語へattentionしてしまっている状況であり。本質的にbag-of-word approachに近くなってしまう。（完全にcolumnでon/offが分かれているわけではないので、bag-of-wordsはちょっと言いすぎな気もしますが…。）

右のsentiment analysisでは、band matrixに近い行列が得られた。これはattention mechanismが本質的に近距離のskip-bigramで表されることを示している。sentiment analysisにおけるself-attentionではこのような近い単語間での複雑な関係を学習することに成功したようだ。

band matrix：帯行列

[Pythonによる科学・技術計算] 数値線形代数でヒンパンに出てくる行列の一覧 - Qiita

論文中にはランダムに採取した数十のattention outputがずらっと載っているので、気になる方は見てみてほしい。上記のfigureと同様、topic classification taskでは"column-based pattern"が支配的で、sentiment analysis taskではYelp Review Polarityを除いて、"diagonal band pattern"が支配的なようだった。