RNN, LSTM & GRU

Recurrent Neural Network (RNN), Long-Short Term Memory (LSTM) & Gated Recurrent Unit (GRU)

Is a type of artificial neural network where connections between nodes form a sequence. This allows temporal dynamic behavior for time sequence.
There are 3 types of vanilla recurrent neural network: the simple (RNN), gated recurrent unit (GRU) and long short term memory unit (LSTM).

RNN – Recurrent Neural Network


Notation

x_t : input vector (m \times 1).
h_t : hidden layer vector (n\times 1).
o_t : output vector (n\times1).
b_h : bias vector (n\times1).
U,W : parameter matrices (n\times m).
V : parameter matrix (n\times n).
\sigma_h, \sigma_y : activation functions.

Feed-Forward

    \[ h_t=\sigma_h(i_t)=\sigma_h(U_hx_t+V_hh_{t-1}+b_h) \]

    \[ y_t=\sigma_y(a_t)=\sigma_y(W_yh_t+b_h) \]

Backpropagation

    \[ \Pi_t= \frac{\partial E_t}{\partial o_t} \frac{\partial o_t}{\partial h_t}+ \frac{\partial h_{t+1}}{\partial h_t} \Pi_{t+1} \]

    \[ \beta_t^U=\beta_{t+1}^U+\Pi_t \frac{\partial h_t}{\partial U_t} \]

    \[ \beta_t^V=\beta_{t+1}^V+\Pi_t \frac{\partial h_t}{\partial V_t} \]

    \[ \beta_t^W=\beta_{t+1}^W+ \frac{\partial E_t}{\partial o_t} \frac{\partial o_t}{\partial W_t} \]

    \[ \frac{\partial E}{\partial X} \equiv \beta_0^x \]

Example
  • Vanilla mode of processing without RNN, from fixed-sized input to fixed-sized output (e.g. image classification).
  • Sequence output (e.g. image captioning takes an image and outputs a sentence of words).
  • Sequence input (e.g. sentiment analysis where a given sentence is classified as expressing positive or negative sentiment)
  • Sequence input and sequence output (e.g. Machine Translation: an RNN reads a sentence in English and then outputs a sentence in French).
  • Synced sequence input and output (e.g. video classification where we wish to label each frame of the video).

LSTM – Long-Short Term Memory


Notation

h_t , C_t : hidden layer vectors.
x_t : input vector.
b_f , b_i , b_c , b_o : bias vector.
W_f , W_i , W_c , W_o : parameter matrices.
\sigma , \tanh : activation functions.

Feed-Forward

    \[ f_t=\sigma(W_f\cdot[h_{t-1},x_t]+b_f) \]

    \[ i_t=\sigma(W_i\cdot[h_{t-1},x_t]+b_i) \]

    \[ o_t=\sigma(W_o\cdot[h_{t-1},x_t]+b_o) \]

    \[ \tilde{C}_t=\tanh(W_c\cdot[h_{t-1},x_t]+b_c) \]

    \[ C_t=f_t\odot C_{t-1}+i_t\odot\tilde{C}_t \]

    \[ h_t=o_t\odot\tanh(C_t) \]

Backpropagation

    \begin{align*} &\frac{\partial C_{t+1}}{\partial h_t}= \frac{\partial C_{t+1}}{\partial \tilde{C}_{t+1}} \frac{\partial \tilde{C}_{t+1}}{\partial h_t}+ \frac{\partial C_{t+1}}{\partial f_{t+1}} \frac{\partial f_{t+1}}{\partial h_t}+ \frac{\partial C_{t+1}}{\partial t_{t+1}} \frac{\partial i_{t+1}}{\partial h_t}\\ &\frac{\partial C_{t+1}}{\partial C_t}\\ &\frac{\partial h_{t+1}}{\partial C_t}= \frac{\partial h_{t+1}}{\partial C_{t+1}} \frac{\partial C_{t+1}}{\partial C_t}\\ &\frac{\partial h_{t+1}}{\partial h_t}= \frac{\partial h_{t+1}}{\partial C_{t+1}} \frac{\partial C_{t+1}}{\partial h_t}+ \frac{\partial h_{t+1}}{\partial o_{t+1}} \frac{\partial o_{t+1}}{\partial h_t} \end{align*}

    \[ \Pi_t= \frac{\partial E_t}{\partial h_t}+ \frac{\partial h_{t+1}}{\partial h_t} \Pi_{t+1}+ \frac{\partial C_{t+1}}{\partial h_t} \mathcal{T}_{t+1} \]

    \[ \mathcal{T}_t= \frac{\partial E_t}{\partial h_t} \frac{\partial E_t}{\partial C_t}+ \frac{\partial h_{t+1}}{\partial C_t} \Pi_{t+1}+ \frac{\partial C_{t+1}}{\partial C_t} \mathcal{T}_{t+1} \]

    \[ \beta_t^f=\beta_{t+1}^f+ \frac{\partial C_t}{\partial f_t} \frac{\partial f_t}{\partial W_t^f} ( \frac{\partial h_t}{\partial C_t} \Pi_t + \mathcal{T}_t ) \]

    \[ \beta_t^i=\beta_{t+1}^i+ \frac{\partial C_t}{\partial i_t} \frac{\partial i_t}{\partial W_t^i} ( \frac{\partial h_t}{\partial C_t} \Pi_t + \mathcal{T}_t ) \]

    \[ \beta_t^c=\beta_{t+1}^c+ \frac{\partial C_t}{\partial \tilde{C}_{t}} \frac{\partial \tilde{C}_{t}}{\partial W_t^c} ( \frac{\partial h_t}{\partial C_t} \Pi_t + \mathcal{T}_t ) \]

    \[ \beta_t^o=\beta_{t+1}^o+ \frac{\partial h_t}{\partial o_t} \frac{\partial o_t}{\partial W_t^o} ( \Pi_t ) \]

GRU – Gated Recurrent Unit


Notation

h_t : hidden layer vectors.
x_t : input vector.
b_z , b_r , b_h : bias vector.
W_z , W_r , W_h : parameter matrices.
\sigma , \tanh : activation functions.

Feed-Forward

    \[ z_t=\sigma(W_z \cdot[h_{t-1},x_t]+b_z) \]

    \[ r_t=\sigma(W_r \cdot [h_{t-1},x_t]+b_r) \]

    \[ \tilde{h}_t=\tanh(W_h\cdot[r_t \odot h_{t-1},x_t]+b_h) \]

    \[ h_t=(1-z_t)\odot h_{t-1}+z_t\odot \tilde{h}_t \]

Backpropagation

    \[ \frac{\partial h_{t+1}}{\partial h_t}= \frac{\partial h_{t+1}}{\partial h_t}+ \frac{\partial h_{t+1}}{\partial z_{t+1}} \frac{\partial z_{t+1}}{\partial h_t}+ \frac{\partial h_{t+1}}{\partial \tilde{h}_{t+1}} ( \frac{\partial \tilde{h}_{t+1}}{\partial h_t}+ \frac{\partial \tilde{h}_{t+1}}{\partial r_{t+1}} \frac{\partial r_{t+1}}{\partial h_t} ) \]

    \[ \Pi_t= \frac{\partial E_t}{\partial h_t}+ \frac{\partial h_{t+1}}{\partial h_t} \Pi_{t+1} \]

    \[ \beta_t^z=\beta_{t+1}^z+ \frac{\partial h_t}{\partial z_t} \frac{\partial z_t}{\partial W_t^z} \Pi_t \]

    \[ \beta_t^r=\beta_{t+1}^r+ \frac{\partial h_t}{\partial \tilde{h}_t} \frac{\partial \tilde{h}_t}{\partial r_t} \frac{\partial r_t}{\partial W_t^r} \Pi_t \]

    \[ \beta_t^h=\beta_{t+1}^h+ \frac{\partial h_t}{\partial \tilde{h}_t} \frac{\partial \tilde{h}_t}{\partial W_t^h} \Pi_t \]

2 thoughts on “RNN, LSTM & GRU”

  1. Thanks for this very compact summary on recurrent networks!
    I think I have found some minor inconsistencies with LSTM and GRU.
    I think x_t is not the output vector but the input vector. And you split for RNN the signal at the end into output vector o_t and hidden vector h_t. You don’t do that for LSTM and GRU, although it seems like it would apply there, too.

Leave a Reply

Your email address will not be published. Required fields are marked *