您當(dāng)前位置：首頁(yè) > 服務(wù)器 > Google深度學(xué)習(xí)筆記循環(huán)神經(jīng)網(wǎng)絡(luò)實(shí)踐

Google深度學(xué)習(xí)筆記循環(huán)神經(jīng)網(wǎng)絡(luò)實(shí)踐

來(lái)源：程序員人生發(fā)布時(shí)間：2016-08-09 08:10:46 閱讀次數(shù)：3961次

轉(zhuǎn)載請(qǐng)注明作者：夢(mèng)里風(fēng)林
Github工程地址：https://github.com/ahangchen/GDLnotes
歡迎star，有問(wèn)題可以到Issue區(qū)討論
官方教程地址
視頻/字幕下載

加載數(shù)據(jù)

使用text8作為訓(xùn)練的文本數(shù)據(jù)集

text8中只包括27種字符：小寫(xiě)的從a到z，和空格符。如果把它打出來(lái)，讀起來(lái)就像是去掉了所有標(biāo)點(diǎn)的wikipedia。

直接調(diào)用lesson1中maybe_download下載text8.zip
用zipfile讀取zip內(nèi)容為字符串，并拆分成單詞list
用connections模塊統(tǒng)計(jì)單詞數(shù)量并找出最多見(jiàn)的單詞

達(dá)成隨機(jī)取數(shù)據(jù)的目標(biāo)

構(gòu)造計(jì)算單元

embeddings = tf.Variable(
        tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))

構(gòu)造1個(gè)vocabulary_size x embedding_size的矩陣，作為embeddings容器，
有vocabulary_size個(gè)容量為embedding_size的向量，每一個(gè)向量代表1個(gè)vocabulary，
每一個(gè)向量的中的份量的值都在⑴到1之間隨機(jī)散布

embed = tf.nn.embedding_lookup(embeddings, train_dataset)

調(diào)用tf.nn.embedding_lookup，索引與train_dataset對(duì)應(yīng)的向量，相當(dāng)于用train_dataset作為1個(gè)id，去檢索矩陣中與這個(gè)id對(duì)應(yīng)的embedding

loss = tf.reduce_mean(
        tf.nn.sampled_softmax_loss(softmax_weights, softmax_biases, embed,
                                   train_labels, num_sampled, vocabulary_size))

采樣計(jì)算訓(xùn)練損失

optimizer = tf.train.AdagradOptimizer(1.0).minimize(loss)

自適應(yīng)梯度調(diào)理器，調(diào)理embedding列表的數(shù)據(jù)，使得偏差最小
預(yù)測(cè)，并用cos值計(jì)算預(yù)測(cè)向量與實(shí)際數(shù)據(jù)的夾角作為預(yù)測(cè)準(zhǔn)確度（類似度）指標(biāo)

傳入數(shù)據(jù)進(jìn)行訓(xùn)練

切割數(shù)據(jù)用于訓(xùn)練，其中：

data_index = (data_index + 1) % len(data)

照舊是每次取1部份隨機(jī)數(shù)據(jù)傳入
- 等距離截取1小段文本
- 構(gòu)造訓(xùn)練集：每一個(gè)截取窗口的中間位置作為1個(gè)train_data
- 構(gòu)造標(biāo)簽：每一個(gè)截取窗口中，除train_data以外的部份，隨機(jī)取幾個(gè)成為1個(gè)list，作為label（這里只隨機(jī)取了1個(gè)）
- 這樣就構(gòu)成了根據(jù)目標(biāo)辭匯預(yù)測(cè)上下文的機(jī)制，即Skip-gram
訓(xùn)練100001次，每2000次輸出這兩千次的平均損失
每10000次計(jì)算類似度，并輸出與驗(yàn)證集中的詞最接近的辭匯列表
用tSNE降維顯現(xiàn)辭匯接近程度
用matplotlib繪制結(jié)果

實(shí)現(xiàn)代碼見(jiàn)word2vec.py

CBOW

上面訓(xùn)練的是Skip-gram模型，是根據(jù)目標(biāo)辭匯預(yù)測(cè)上下文，而word2vec還有1種方式，CBOW，根據(jù)上下文預(yù)測(cè)目標(biāo)辭匯。

實(shí)際上就是將Skip-gram中的輸入輸出反過(guò)來(lái)。

修改截取數(shù)據(jù)的方式
- 構(gòu)造標(biāo)簽：每一個(gè)截取窗口的中間位置作為1個(gè)train_label
- 構(gòu)造訓(xùn)練集：每一個(gè)截取窗口中，除train_label以外的部份，作為train_data（這里只隨機(jī)取了1個(gè)）
- 這樣就構(gòu)成了根據(jù)上下文預(yù)測(cè)目標(biāo)辭匯的機(jī)制，即CBOW
分別從embeding里找到train_data里每一個(gè)word對(duì)應(yīng)的vector，用tf.reduce_sum將其相加，將相加結(jié)果與train_label比較

# Look up embeddings for inputs.
embed = tf.nn.embedding_lookup(embeddings, train_dataset)
# sum up vectors on first dimensions, as context vectors
embed_sum = tf.reduce_sum(embed, 0)

訓(xùn)練中照舊是調(diào)理embeding的參數(shù)來(lái)優(yōu)化loss
訓(xùn)練結(jié)果以下圖，可以看到不同單詞的接近程度

代碼見(jiàn)：
cbow.py

RNN 造句

整體思路是，以1個(gè)文本中的1個(gè)詞作為train data，后續(xù)的所有詞作為train label，從而能夠根據(jù)1個(gè)給定詞，預(yù)測(cè)后續(xù)的片斷。

訓(xùn)練數(shù)據(jù)

BatchGenerator
- text: 全部的文本數(shù)據(jù)
- text_size：全部文本的字符串長(zhǎng)度
- batch_size：每段訓(xùn)練數(shù)據(jù)的大小
- num_unrollings：要生成的訓(xùn)練數(shù)據(jù)段的數(shù)目
- segment：全部訓(xùn)練數(shù)據(jù)集可以分成幾個(gè)訓(xùn)練數(shù)據(jù)片斷
- cursor：重要，
- 1開(kāi)始記錄每一個(gè)訓(xùn)練數(shù)據(jù)片斷的起始位置坐標(biāo)，即這個(gè)片斷位于text的哪一個(gè)index
- 履行next_batch生成1個(gè)訓(xùn)練數(shù)據(jù)的時(shí)候，游標(biāo)會(huì)從初始位置自增，直到取夠batch_size個(gè)數(shù)據(jù)
- last_batch：上1個(gè)訓(xùn)練數(shù)據(jù)片斷
- 每調(diào)用1次next，生成1個(gè)num_unrollings長(zhǎng)的array，以last_batch開(kāi)頭，隨著num_unrollings個(gè)batch
- 每一個(gè)batch的作為train_input，每一個(gè)batch后面的1個(gè)batch作為train_label，每一個(gè)step訓(xùn)練num_unrolling個(gè)batch

lstm-cell

為了解決消失的梯度問(wèn)題，引入lstm-cell，增強(qiáng)model的記憶能力
根據(jù)這篇論文設(shè)計(jì)lstm-cell: http://arxiv.org/pdf/1402.1128v1.pdf
分別有3個(gè)門(mén)：輸入門(mén)，遺忘門(mén)，輸出門(mén)，構(gòu)成1個(gè)cell
- 輸入數(shù)據(jù)是num_nodes個(gè)詞，可能有vocabulary_size種詞
- 輸入門(mén)：

input_gate = sigmoid(i * ix + o * im + ib)

- 給輸入乘1個(gè)vocabulary_size * num_nodes大小的矩陣，給輸出乘1個(gè)num_nodes * num_nodes大小的矩陣;
- 用這兩個(gè)矩陣調(diào)理對(duì)輸入數(shù)據(jù)的取舍程度
- 用sigmoid這個(gè)非線性函數(shù)進(jìn)行激活

遺忘門(mén)：

forget_gate = sigmoid(i * fx + o * fm + fb)

思路同輸入門(mén)，用以對(duì)歷史數(shù)據(jù)做取舍

輸出門(mén)：

output_gate = sigmoid(i * ox + o * om + ob)

思路同輸入門(mén)，用以對(duì)輸出狀態(tài)做取舍

組合：

  update = i * cx + o * cm + cb
  state = forget_gate * state + input_gate * tanh(update)
  lstm_cell = output_gate * tanh(state)

- 用一樣的方式構(gòu)造新?tīng)顟B(tài)update
- 用遺忘門(mén)處理歷史狀態(tài)state
- 用tanh激活新?tīng)顟B(tài)update
- 用輸入門(mén)處理新?tīng)顟B(tài)update
- 整合新舊狀態(tài)，再用tanh激活狀態(tài)state
- 用輸出門(mén)處理state

lstm優(yōu)化

上面的cell中，update，output_gate，forget_gate，input_gate計(jì)算方法都是1樣的，
可以把4組參數(shù)分別合并，1次計(jì)算，再分別取出：

values = tf.split(1, gate_count, tf.matmul(i, input_weights) + tf.matmul(o, output_weights) + bias)
input_gate = tf.sigmoid(values[0])
forget_gate = tf.sigmoid(values[1])
update = values[2]

再將lstm-cell的輸出扔到1個(gè)WX+b中調(diào)劑作為輸出

實(shí)現(xiàn)代碼見(jiàn)singlew_lstm.py

Optimizer

采取one-hot encoding作為label預(yù)測(cè)
采取交叉熵計(jì)算損失
引入learning rate decay

Flow

填入訓(xùn)練數(shù)據(jù)到placeholder中
驗(yàn)證集的準(zhǔn)確性用logprob來(lái)計(jì)算，即對(duì)可能性取對(duì)數(shù)
每10次訓(xùn)練隨機(jī)挑取5個(gè)字母作為起始詞，進(jìn)行造句測(cè)試
你可能注意到輸出的sentence是由sample得到的詞組成的，而非選擇幾率最高的詞，這是由于，如果1直取幾率最高的詞，最后會(huì)1直重復(fù)這個(gè)幾率最高的詞

實(shí)現(xiàn)代碼見(jiàn)lstm.py

Beam Search

上面的流程里，每次都是以1個(gè)字符作為單位，可使用多1點(diǎn)的字符做預(yù)測(cè)，取最高幾率的那個(gè)，避免特殊情況致使的誤判

在這里我們?cè)黾幼址麨?個(gè)，構(gòu)成bigram，代碼見(jiàn)：bigram_lstm.py

主要通過(guò)BigramBatchGenerator類實(shí)現(xiàn)

Embedding look up

由于bigram情況下，vocabulary_size變成 27*27個(gè)，使用one-hot encoding 做predict的話會(huì)產(chǎn)生非常稀疏的矩陣，浪費(fèi)算力，計(jì)算速度慢

因此引入embedding_lookup,代碼見(jiàn)embed_bigram_lstm.py

數(shù)據(jù)輸入：BatchGenerator不再生成one-hot-encoding的向量作為輸入，而是直接生成bigram對(duì)應(yīng)的index列表
embedding look up調(diào)劑embedding，使bigram與vector對(duì)應(yīng)起來(lái)
將embedding look up的結(jié)果喂給lstm cell便可
輸出時(shí)，需要將label和output都轉(zhuǎn)為One-hot-encoding，才能用交叉熵和softmax計(jì)算損失
在tensor里做data到one-hot-encoding轉(zhuǎn)換時(shí)，主要依賴tf.gather函數(shù)
在對(duì)valid數(shù)據(jù)做轉(zhuǎn)換時(shí)，主要依賴one_hot_voc函數(shù)

Drop out

在lstm cell中對(duì)input和output做drop out
Refer to this article

Seq2Seq

最后1個(gè)問(wèn)題是，將1個(gè)句子中每一個(gè)詞轉(zhuǎn)為它的逆序字符串，也就是1個(gè)seq到seq的轉(zhuǎn)換
正經(jīng)的實(shí)現(xiàn)思路是，word 2 vector 2 lstm 2 vector 2 word
不過(guò)tensorflow已有了這樣1個(gè)模型來(lái)做這件事情：Seq2SeqModel，關(guān)于這個(gè)模型可以看這個(gè)分析
和tensorflow的example
只需要從batch中，根據(jù)字符串逆序的規(guī)律生成target sequence，放到seq2seqmodel里便可，主要依賴rev_id函數(shù)
實(shí)現(xiàn)見(jiàn)seq2seq.py
注意，用Seq2SeqModel的時(shí)候，size和num_layer會(huì)在學(xué)習(xí)到正確的規(guī)律前就收斂，我把它調(diào)大了1點(diǎn)

def create_model(sess, forward_only):
    model = seq2seq_model.Seq2SeqModel(source_vocab_size=vocabulary_size,
                                       target_vocab_size=vocabulary_size,
                                       buckets=[(20, 21)],
                                       size=256,
                                       num_layers=4,
                                       max_gradient_norm=5.0,
                                       batch_size=batch_size,
                                       learning_rate=1.0,
                                       learning_rate_decay_factor=0.9,
                                       use_lstm=True,
                                       forward_only=forward_only)
    return model

參數(shù)含義
- source_vocab_size: size of the source vocabulary.
- target_vocab_size: size of the target vocabulary.
- buckets: a list of pairs (I, O), where I specifies maximum input length
  that will be processed in that bucket, and O specifies maximum output
  length. Training instances that have inputs longer than I or outputs
  longer than O will be pushed to the next bucket and padded accordingly.
  We assume that the list is sorted, e.g., [(2, 4), (8, 16)].
- size: number of units in each layer of the model.
- num_layers: number of layers in the model.
- max_gradient_norm: gradients will be clipped to maximally this norm.
- batch_size: the size of the batches used during training;
  the model construction is independent of batch_size, so it can be
  changed after initialization if this is convenient, e.g., for decoding.
- learning_rate: learning rate to start with.
- learning_rate_decay_factor: decay learning rate by this much when needed.
- use_lstm: if true, we use LSTM cells instead of GRU cells.
- num_samples: number of samples for sampled softmax.
- forward_only: if set, we do not construct the backward pass in the model.