RNN, LSTM & GRU

Is a type of artificial neural network where connections between nodes form a sequence. This allows temporal dynamic behavior for time sequence.
There are 3 types of vanilla recurrent neural network: the simple (RNN), gated recurrent unit (GRU) and long short term memory unit (LSTM).

RNN – Recurrent Neural Network

Notation

$x_t$ : input vector ( $m \times 1$ ).
$h_t$ : hidden layer vector $(n\times 1)$ .
$o_t$ : output vector $(n\times1)$ .
$b_h$ : bias vector $(n\times1)$ .
$U,W$ : parameter matrices $(n\times m)$ .
$V$ : parameter matrix $(n\times n)$ .
$\sigma_h, \sigma_y$ : activation functions.

Feed-Forward

$h_t=\sigma_h(i_t)=\sigma_h(U_hx_t+V_hh_{t-1}+b_h)$

$y_t=\sigma_y(a_t)=\sigma_y(W_yh_t+b_h)$

Backpropagation

$\Pi_t= \frac{\partial E_t}{\partial o_t} \frac{\partial o_t}{\partial h_t}+ \frac{\partial h_{t+1}}{\partial h_t} \Pi_{t+1}$

$\beta_t^U=\beta_{t+1}^U+\Pi_t \frac{\partial h_t}{\partial U_t}$

$\beta_t^V=\beta_{t+1}^V+\Pi_t \frac{\partial h_t}{\partial V_t}$

$\beta_t^W=\beta_{t+1}^W+ \frac{\partial E_t}{\partial o_t} \frac{\partial o_t}{\partial W_t}$

$\frac{\partial E}{\partial X} \equiv \beta_0^x$

Example

Vanilla mode of processing without RNN, from fixed-sized input to fixed-sized output (e.g. image classification).
Sequence output (e.g. image captioning takes an image and outputs a sentence of words).
Sequence input (e.g. sentiment analysis where a given sentence is classified as expressing positive or negative sentiment)
Sequence input and sequence output (e.g. Machine Translation: an RNN reads a sentence in English and then outputs a sentence in French).
Synced sequence input and output (e.g. video classification where we wish to label each frame of the video).

LSTM – Long-Short Term Memory

Notation

$h_t$ , $C_t$ : hidden layer vectors.
$x_t$ : input vector.
$b_f$ , $b_i$ , $b_c$ , $b_o$ : bias vector.
$W_f$ , $W_i$ , $W_c$ , $W_o$ : parameter matrices.
$\sigma$ , $\tanh$ : activation functions.

Feed-Forward

$f_t=\sigma(W_f\cdot[h_{t-1},x_t]+b_f)$

$i_t=\sigma(W_i\cdot[h_{t-1},x_t]+b_i)$

$o_t=\sigma(W_o\cdot[h_{t-1},x_t]+b_o)$

$\tilde{C}_t=\tanh(W_c\cdot[h_{t-1},x_t]+b_c)$

$C_t=f_t\odot C_{t-1}+i_t\odot\tilde{C}_t$

$h_t=o_t\odot\tanh(C_t)$

Backpropagation

$\begin{align*} &\frac{\partial C_{t+1}}{\partial h_t}= \frac{\partial C_{t+1}}{\partial \tilde{C}_{t+1}} \frac{\partial \tilde{C}_{t+1}}{\partial h_t}+ \frac{\partial C_{t+1}}{\partial f_{t+1}} \frac{\partial f_{t+1}}{\partial h_t}+ \frac{\partial C_{t+1}}{\partial t_{t+1}} \frac{\partial i_{t+1}}{\partial h_t}\\ &\frac{\partial C_{t+1}}{\partial C_t}\\ &\frac{\partial h_{t+1}}{\partial C_t}= \frac{\partial h_{t+1}}{\partial C_{t+1}} \frac{\partial C_{t+1}}{\partial C_t}\\ &\frac{\partial h_{t+1}}{\partial h_t}= \frac{\partial h_{t+1}}{\partial C_{t+1}} \frac{\partial C_{t+1}}{\partial h_t}+ \frac{\partial h_{t+1}}{\partial o_{t+1}} \frac{\partial o_{t+1}}{\partial h_t} \end{align*}$

$\Pi_t= \frac{\partial E_t}{\partial h_t}+ \frac{\partial h_{t+1}}{\partial h_t} \Pi_{t+1}+ \frac{\partial C_{t+1}}{\partial h_t} \mathcal{T}_{t+1}$

$\mathcal{T}_t= \frac{\partial E_t}{\partial h_t} \frac{\partial E_t}{\partial C_t}+ \frac{\partial h_{t+1}}{\partial C_t} \Pi_{t+1}+ \frac{\partial C_{t+1}}{\partial C_t} \mathcal{T}_{t+1}$

$\beta_t^f=\beta_{t+1}^f+ \frac{\partial C_t}{\partial f_t} \frac{\partial f_t}{\partial W_t^f} ( \frac{\partial h_t}{\partial C_t} \Pi_t + \mathcal{T}_t )$

$\beta_t^i=\beta_{t+1}^i+ \frac{\partial C_t}{\partial i_t} \frac{\partial i_t}{\partial W_t^i} ( \frac{\partial h_t}{\partial C_t} \Pi_t + \mathcal{T}_t )$

$\beta_t^c=\beta_{t+1}^c+ \frac{\partial C_t}{\partial \tilde{C}_{t}} \frac{\partial \tilde{C}_{t}}{\partial W_t^c} ( \frac{\partial h_t}{\partial C_t} \Pi_t + \mathcal{T}_t )$

$\beta_t^o=\beta_{t+1}^o+ \frac{\partial h_t}{\partial o_t} \frac{\partial o_t}{\partial W_t^o} ( \Pi_t )$

GRU – Gated Recurrent Unit

Notation

$h_t$ : hidden layer vectors.
$x_t$ : input vector.
$b_z$ , $b_r$ , $b_h$ : bias vector.
$W_z$ , $W_r$ , $W_h$ : parameter matrices.
$\sigma$ , $\tanh$ : activation functions.

Feed-Forward

$z_t=\sigma(W_z \cdot[h_{t-1},x_t]+b_z)$

$r_t=\sigma(W_r \cdot [h_{t-1},x_t]+b_r)$

$\tilde{h}_t=\tanh(W_h\cdot[r_t \odot h_{t-1},x_t]+b_h)$

$h_t=(1-z_t)\odot h_{t-1}+z_t\odot \tilde{h}_t$

Backpropagation

$\frac{\partial h_{t+1}}{\partial h_t}= \frac{\partial h_{t+1}}{\partial h_t}+ \frac{\partial h_{t+1}}{\partial z_{t+1}} \frac{\partial z_{t+1}}{\partial h_t}+ \frac{\partial h_{t+1}}{\partial \tilde{h}_{t+1}} ( \frac{\partial \tilde{h}_{t+1}}{\partial h_t}+ \frac{\partial \tilde{h}_{t+1}}{\partial r_{t+1}} \frac{\partial r_{t+1}}{\partial h_t} )$

$\Pi_t= \frac{\partial E_t}{\partial h_t}+ \frac{\partial h_{t+1}}{\partial h_t} \Pi_{t+1}$

$\beta_t^z=\beta_{t+1}^z+ \frac{\partial h_t}{\partial z_t} \frac{\partial z_t}{\partial W_t^z} \Pi_t$

$\beta_t^r=\beta_{t+1}^r+ \frac{\partial h_t}{\partial \tilde{h}_t} \frac{\partial \tilde{h}_t}{\partial r_t} \frac{\partial r_t}{\partial W_t^r} \Pi_t$

$\beta_t^h=\beta_{t+1}^h+ \frac{\partial h_t}{\partial \tilde{h}_t} \frac{\partial \tilde{h}_t}{\partial W_t^h} \Pi_t$

2 thoughts on “RNN, LSTM & GRU”

Christoph Doell says:
June 9, 2020 at 9:26 am
Thanks for this very compact summary on recurrent networks!
I think I have found some minor inconsistencies with LSTM and GRU.
I think x_t is not the output vector but the input vector. And you split for RNN the signal at the end into output vector o_t and hidden vector h_t. You don’t do that for LSTM and GRU, although it seems like it would apply there, too.
1. dprogrammer says:
  June 9, 2020 at 11:43 am
  You are very right, I will change. Thank you, I appreciate it!

RNN – Recurrent Neural Network

Notation

Feed-Forward

Backpropagation

Example

LSTM – Long-Short Term Memory

Notation

Feed-Forward

Backpropagation

GRU – Gated Recurrent Unit

Notation

Feed-Forward

Backpropagation

2 thoughts on “RNN, LSTM & GRU”

Leave a Reply Cancel reply