4本目！BERT:Pre-training 0f Deep Bidirectional Transformers for Language Understanding.

ではでは、さーて、gpt-2で使われていた論文に行きたいと思います。

まず、こいつがどんな使われ方を確認しましょう！

参照元の論文のもう一方ではRTEにおいて、<Premise><SEP><Hypothesis>という入力をして、この文章がNeural, Contradiction, Entailmentのどれに当てはまるのかの識別をするタスクを行っていました。

でも、GPT-2では識別タスクの他にも生成タスク(Question-Answering)を行っています。つまり、生成タスクの元となった論文がこれらしいので、調べてみましょう！

...というか、これはBERTの元論文じゃん( ﾟДﾟ)

Introduction

まず、Pre-Training modelを教師あり学習に適用する方法にはFine-TuningとFeature-Basedの2種類あるそうな。

Fine-TuningはGPT-2でも用いられている方法で、モデルをあまり変更せずに、入力の方を調整する方法。

Feature-BasedはELMoで用いられている方法で、Pre-trainingのModelのパラメータを固有のArchitectureに組み込む方法らしい。

Fine-Tuningのデメリットは単方向の文章の流れしか、学習できないので、今回のタスクではFine-Tuningのタスクを教師なし学習の方法を変更して、双方向(Bidirectional)にしよう！ということらしい。

実装はここ

Pre-trainingの方法としては、文章の単語をMaskしてその単語を当てよう！というタスク(Masked Language Model)と、次の文章の予測しよう！というタスク(Next Sentence Prediction)の2つを行う。

BERT

Model Architecture

Modelのベースとなっているのはやはりというか、何というか、やっぱりAttention is All You Need.である。

ここで、BERT Transformerはbidirectional self-attentionを使っている、とあるのだがどんなAttentionなんだろう？

実装を見ても、Encoderのみ残っていてDecoderが見当たらない。

maskについても attention_maskがあるだけで、subsequent_maskは見つからない。

def transformer_model(input_tensor,
                      attention_mask=None,
                      hidden_size=768,
                      num_hidden_layers=12,
                      num_attention_heads=12,
                      intermediate_size=3072,
                      intermediate_act_fn=gelu,
                      hidden_dropout_prob=0.1,
                      attention_probs_dropout_prob=0.1,
                      initializer_range=0.02,
                      do_return_all_layers=False):
  """Multi-headed, multi-layer Transformer from "Attention is All You Need".
  This is almost an exact implementation of the original Transformer encoder.
  See the original paper:
  https://arxiv.org/abs/1706.03762
  Also see:
  https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/models/transformer.py
  Args:
    input_tensor: float Tensor of shape [batch_size, seq_length, hidden_size].
    attention_mask: (optional) int32 Tensor of shape [batch_size, seq_length,
      seq_length], with 1 for positions that can be attended to and 0 in
      positions that should not be.
    hidden_size: int. Hidden size of the Transformer.
    num_hidden_layers: int. Number of layers (blocks) in the Transformer.
    num_attention_heads: int. Number of attention heads in the Transformer.
    intermediate_size: int. The size of the "intermediate" (a.k.a., feed
      forward) layer.
    intermediate_act_fn: function. The non-linear activation function to apply
      to the output of the intermediate/feed-forward layer.
    hidden_dropout_prob: float. Dropout probability for the hidden layers.
    attention_probs_dropout_prob: float. Dropout probability of the attention
      probabilities.
    initializer_range: float. Range of the initializer (stddev of truncated
      normal).
    do_return_all_layers: Whether to also return all layers or just the final
      layer.
  Returns:
    float Tensor of shape [batch_size, seq_length, hidden_size], the final
    hidden layer of the Transformer.
  Raises:
    ValueError: A Tensor shape or parameter is invalid.
  """
  if hidden_size % num_attention_heads != 0:
    raise ValueError(
        "The hidden size (%d) is not a multiple of the number of attention "
        "heads (%d)" % (hidden_size, num_attention_heads))

  attention_head_size = int(hidden_size / num_attention_heads)
  input_shape = get_shape_list(input_tensor, expected_rank=3)
  batch_size = input_shape[0]
  seq_length = input_shape[1]
  input_width = input_shape[2]

  # The Transformer performs sum residuals on all layers so the input needs
  # to be the same as the hidden size.
  if input_width != hidden_size:
    raise ValueError("The width of the input tensor (%d) != hidden size (%d)" %
                     (input_width, hidden_size))

  # We keep the representation as a 2D tensor to avoid re-shaping it back and
  # forth from a 3D tensor to a 2D tensor. Re-shapes are normally free on
  # the GPU/CPU but may not be free on the TPU, so we want to minimize them to
  # help the optimizer.
  prev_output = reshape_to_matrix(input_tensor)

  all_layer_outputs = []
  for layer_idx in range(num_hidden_layers):
    with tf.variable_scope("layer_%d" % layer_idx):
      layer_input = prev_output

      with tf.variable_scope("attention"):
        attention_heads = []
        with tf.variable_scope("self"):
          attention_head = attention_layer(
              from_tensor=layer_input,
              to_tensor=layer_input,
              attention_mask=attention_mask,
              num_attention_heads=num_attention_heads,
              size_per_head=attention_head_size,
              attention_probs_dropout_prob=attention_probs_dropout_prob,
              initializer_range=initializer_range,
              do_return_2d_tensor=True,
              batch_size=batch_size,
              from_seq_length=seq_length,
              to_seq_length=seq_length)
          attention_heads.append(attention_head)

        attention_output = None
        if len(attention_heads) == 1:
          attention_output = attention_heads[0]
        else:
          # In the case where we have other sequences, we just concatenate
          # them to the self-attention head before the projection.
          attention_output = tf.concat(attention_heads, axis=-1)

        # Run a linear projection of `hidden_size` then add a residual
        # with `layer_input`.
        with tf.variable_scope("output"):
          attention_output = tf.layers.dense(
              attention_output,
              hidden_size,
              kernel_initializer=create_initializer(initializer_range))
          attention_output = dropout(attention_output, hidden_dropout_prob)
          attention_output = layer_norm(attention_output + layer_input)

      # The activation is only applied to the "intermediate" hidden layer.
      with tf.variable_scope("intermediate"):
        intermediate_output = tf.layers.dense(
            attention_output,
            intermediate_size,
            activation=intermediate_act_fn,
            kernel_initializer=create_initializer(initializer_range))

      # Down-project back to `hidden_size` then add the residual.
      with tf.variable_scope("output"):
        layer_output = tf.layers.dense(
            intermediate_output,
            hidden_size,
            kernel_initializer=create_initializer(initializer_range))
        layer_output = dropout(layer_output, hidden_dropout_prob)
        layer_output = layer_norm(layer_output + attention_output)
        prev_output = layer_output
        all_layer_outputs.append(layer_output)

  if do_return_all_layers:
    final_outputs = []
    for layer_output in all_layer_outputs:
      final_output = reshape_from_matrix(layer_output, input_shape)
      final_outputs.append(final_output)
    return final_outputs
  else:
    final_output = reshape_from_matrix(prev_output, input_shape)
    return final_output

注釈を読むと、TransformerのEncoderがbidirectional Transformerで、Decoderがleft-context-only Transformerみたいな書き方をしているけど、それでいいのかな？

でも、そう考えるとGPT-2はどうやって、DecoderをEncoderに変更したんだろう？マジでわからん。

Input/Output Representations

入力については 1文 or 2文を<SEP>tokenで結び付けたものの先頭に<CLS>tokenを当てはめたものを使用する。

<CLS>tokenの入力で、２文が連続したものどうかの判定を行うみたい。

他にはtokenizationにはsentence pieceをつかっている。sentence pieceについては、ほとんどわかってないからあとで確認したい。

Embeddingに関しては　Token embedding + Segment embedding + Positional embeddingとなっている。

Token embedding と Segment embedding に関しては、Attention is All You Needと同じだとして、Segment embedding って何？

実装を見てみると、

        (self.embedding_output, self.embedding_table) = embedding_lookup(
            input_ids=input_ids,
            vocab_size=config.vocab_size,
            embedding_size=config.hidden_size,
            initializer_range=config.initializer_range,
            word_embedding_name="word_embeddings",
            use_one_hot_embeddings=use_one_hot_embeddings)

        # Add positional embeddings and token type embeddings, then layer
        # normalize and perform dropout.
        self.embedding_output = embedding_postprocessor(
            input_tensor=self.embedding_output,
            use_token_type=True,
            token_type_ids=token_type_ids,
            token_type_vocab_size=config.type_vocab_size,
            token_type_embedding_name="token_type_embeddings",
            use_position_embeddings=True,
            position_embedding_name="position_embeddings",
            initializer_range=config.initializer_range,
            max_position_embeddings=config.max_position_embeddings,
            dropout_prob=config.hidden_dropout_prob)

っていう風になっていて、この実装が

def embedding_lookup(input_ids,
                     vocab_size,
                     embedding_size=128,
                     initializer_range=0.02,
                     word_embedding_name="word_embeddings",
                     use_one_hot_embeddings=False):
  """Looks up words embeddings for id tensor.
  Args:
    input_ids: int32 Tensor of shape [batch_size, seq_length] containing word
      ids.
    vocab_size: int. Size of the embedding vocabulary.
    embedding_size: int. Width of the word embeddings.
    initializer_range: float. Embedding initialization range.
    word_embedding_name: string. Name of the embedding table.
    use_one_hot_embeddings: bool. If True, use one-hot method for word
      embeddings. If False, use `tf.gather()`.
  Returns:
    float Tensor of shape [batch_size, seq_length, embedding_size].
  """
  # This function assumes that the input is of shape [batch_size, seq_length,
  # num_inputs].
  #
  # If the input is a 2D tensor of shape [batch_size, seq_length], we
  # reshape to [batch_size, seq_length, 1].
  if input_ids.shape.ndims == 2:
    input_ids = tf.expand_dims(input_ids, axis=[-1])

  embedding_table = tf.get_variable(
      name=word_embedding_name,
      shape=[vocab_size, embedding_size],
      initializer=create_initializer(initializer_range))

  flat_input_ids = tf.reshape(input_ids, [-1])
  if use_one_hot_embeddings:
    one_hot_input_ids = tf.one_hot(flat_input_ids, depth=vocab_size)
    output = tf.matmul(one_hot_input_ids, embedding_table)
  else:
    output = tf.gather(embedding_table, flat_input_ids)

  input_shape = get_shape_list(input_ids)

  output = tf.reshape(output,
                      input_shape[0:-1] + [input_shape[-1] * embedding_size])
  return (output, embedding_table)


def embedding_postprocessor(input_tensor,
                            use_token_type=False,
                            token_type_ids=None,
                            token_type_vocab_size=16,
                            token_type_embedding_name="token_type_embeddings",
                            use_position_embeddings=True,
                            position_embedding_name="position_embeddings",
                            initializer_range=0.02,
                            max_position_embeddings=512,
                            dropout_prob=0.1):
  """Performs various post-processing on a word embedding tensor.
  Args:
    input_tensor: float Tensor of shape [batch_size, seq_length,
      embedding_size].
    use_token_type: bool. Whether to add embeddings for `token_type_ids`.
    token_type_ids: (optional) int32 Tensor of shape [batch_size, seq_length].
      Must be specified if `use_token_type` is True.
    token_type_vocab_size: int. The vocabulary size of `token_type_ids`.
    token_type_embedding_name: string. The name of the embedding table variable
      for token type ids.
    use_position_embeddings: bool. Whether to add position embeddings for the
      position of each token in the sequence.
    position_embedding_name: string. The name of the embedding table variable
      for positional embeddings.
    initializer_range: float. Range of the weight initialization.
    max_position_embeddings: int. Maximum sequence length that might ever be
      used with this model. This can be longer than the sequence length of
      input_tensor, but cannot be shorter.
    dropout_prob: float. Dropout probability applied to the final output tensor.
  Returns:
    float tensor with same shape as `input_tensor`.
  Raises:
    ValueError: One of the tensor shapes or input values is invalid.
  """
  input_shape = get_shape_list(input_tensor, expected_rank=3)
  batch_size = input_shape[0]
  seq_length = input_shape[1]
  width = input_shape[2]

  output = input_tensor

  if use_token_type:
    if token_type_ids is None:
      raise ValueError("`token_type_ids` must be specified if"
                       "`use_token_type` is True.")
    token_type_table = tf.get_variable(
        name=token_type_embedding_name,
        shape=[token_type_vocab_size, width],
        initializer=create_initializer(initializer_range))
    # This vocab will be small so we always do one-hot here, since it is always
    # faster for a small vocabulary.
    flat_token_type_ids = tf.reshape(token_type_ids, [-1])
    one_hot_ids = tf.one_hot(flat_token_type_ids, depth=token_type_vocab_size)
    token_type_embeddings = tf.matmul(one_hot_ids, token_type_table)
    token_type_embeddings = tf.reshape(token_type_embeddings,
                                       [batch_size, seq_length, width])
    output += token_type_embeddings

  if use_position_embeddings:
    assert_op = tf.assert_less_equal(seq_length, max_position_embeddings)
    with tf.control_dependencies([assert_op]):
      full_position_embeddings = tf.get_variable(
          name=position_embedding_name,
          shape=[max_position_embeddings, width],
          initializer=create_initializer(initializer_range))
      # Since the position embedding table is a learned variable, we create it
      # using a (long) sequence length `max_position_embeddings`. The actual
      # sequence length might be shorter than this, for faster training of
      # tasks that do not have long sequences.
      #
      # So `full_position_embeddings` is effectively an embedding table
      # for position [0, 1, 2, ..., max_position_embeddings-1], and the current
      # sequence has positions [0, 1, 2, ... seq_length-1], so we can just
      # perform a slice.
      position_embeddings = tf.slice(full_position_embeddings, [0, 0],
                                     [seq_length, -1])
      num_dims = len(output.shape.as_list())

      # Only the last two dimensions are relevant (`seq_length` and `width`), so
      # we broadcast among the first dimensions, which is typically just
      # the batch size.
      position_broadcast_shape = []
      for _ in range(num_dims - 2):
        position_broadcast_shape.append(1)
      position_broadcast_shape.extend([seq_length, width])
      position_embeddings = tf.reshape(position_embeddings,
                                       position_broadcast_shape)
      output += position_embeddings

  output = layer_norm_and_dropout(output, dropout_prob)
  return output

embedding_lookupは普通のembeddingの処理で、実際のいろいろな情報の埋め込みはembedding_postprocessorが行っているみたい。

ここで問題となってくるのがembedding_postprocessorで使われている、token_typeとposition_embeddingsについて。

まず、token_typeについて。

こいつは、token_type_idsってやつをone_hot vector に変換して、それにtoken_type_tableっていう学習パラメータを掛けて作っている。つまり、tokenの種類によって、token_idによるembedding以外でも学習しているってこと。

ちなみにこのtoken_type_idsってやつを調べると、run_classiferでsegment_idsってやつを元として使われている。

create_pretraining_data.pyを見てみると、

        tokens = []
        segment_ids = []
        tokens.append("[CLS]")
        segment_ids.append(0)
        for token in tokens_a:
          tokens.append(token)
          segment_ids.append(0)

        tokens.append("[SEP]")
        segment_ids.append(0)

        for token in tokens_b:
          tokens.append(token)
          segment_ids.append(1)
        tokens.append("[SEP]")
        segment_ids.append(1)

ってな具合で最初の<CLS>から<SEP>までを0,　次のtokenから最後の<SEP>までを1として、segment_idsを作っている。

ってことはtoken_type_idsって２種類しかないのね。つまり、1文目と2文目についての埋め込み表現を学習して、Embeddingに加えると...。

つぎにposition_embeddingについて。最初はあのPositional Encodingと同じかなと思ったけど、違うみたい。最大のsequence長さ以下のEmbedding表現を作って、各Batchの最大長に合わせて切り出してくっつけている。

つまり、Positional_encodingも学習パラメータってこと。でも、これだと１文目の長さの違いとか、１文目と２文目の位置の関係性とか本当に学習できてるんかな？

例えば、<今日はどこに行ってきたの？><今日は山梨県に行ってきた。>と<どこに今日は行ってきたの？><山梨県に今日は行ってきた。>の２つの例文として、単語間の距離は、１文目の「今日」と１文目の「どこ」、２文目の「山梨県」と１文目の「どこ」が一定になってるけど、そういった１文目の中だけじゃなく、１文目と２文目の相対的な位置のの情報とかも記録できんのかな？

Pre-training BERT

Maked LM

Taskはマスクされた単語の予測。これだけだとよくわからないが、例えば<私　が　大好きな　[Mask]　は　Code:Geass だ> みたいな文章を入力したとして、出力は<私　が　大好きな　アニメは　Code:Geass だ >となるように出力するタスクである。

この時に注意するのが、アニメの出力以外の「私」とか「だ」みたいな、マスクされていない単語の予測のLossは計算しないということ。

GPTの論文にあったようにこれを学習すると、ただ単に入力文を垂れ流しにするあほみたいなモデルが出きるっぽい（笑）

実際には[Mask]のtokenに置き換えるのが80%で、ランダムな単語に置き換えるのが10%、あと10%はそのままで学習するらしい。

何で、すべてのやつをMaskしないかというと、実際の入力には[Mask]が存在しないからだそうだ。

だったら、[Mask]のところをすべてランダムな単語に置き換えて、間違っている単語を訂正しましょう！っていうタスクでもいい気が済んだけど、駄目なのかな？

Next Sentence Prediction

<CLS>をぶち込んだところの出力がNonNextかNextかを当てるタスク。これにより、言語モデルにおいて、２文間の連続性について学習できるそうだ。

っていうことはDialogueみたいな長文は学習できないのか。

ここら辺はもしかしたら、GPTに軍配が上がるかも？

Question-AnweringとNLIの精度向上に役立つらしい。

ちなみにNLIっていうのは、前に紹介した論文にもあったタスクで、

PremiseとHypothesisの関係がcontradiction, neural, entailmentのいずれかを識別するタスク。

Fine-tuning BERT

ていうか、なんだか想定したいたQuestion-AnsweringのTaskと違う気がしてきたから、調べてみたら案の定違ったorz...

予測していたTask

Trainingにおいて<Question, Answer>を入力し、どうにかして学習し、Preditionにおいて、QuestionをInputとして、Answerを出力する。

実際のTask

Trainingにおいて<Question, Answer>を入力しAnswerの中のQuestionの回答となるtokenの確率分布を出力する。Preditionにおいて<Question, Answer>を入力しAnswerの中のQuestionの回答となるtokenの確率分布を出力する。

つまり、どうにかしてQuestion-Answeringでは、問題と答えを含む文章を用意しないといけないのかー。

会話とかどうすんねん（笑）

調べたら、Google から MeenaっていうHuman Likeなchatbotを作成するPaperが出ていたので、次に調べよう！