• By now, you've seen most of the building blocks of RNNs.


  • But, there are just two more ideas that let you build much more powerful models.


  • One is bidirectional RNNs, which lets you at a point in time to take information from both earlier and later in the sequence.


  • And second, is deep RNNs, which you'll see in the next section.


  • So let's start with Bidirectional RNNs.


  • So, to motivate bidirectional RNNs, let's look at this network which you've seen a few times before in the context of named entity recognition.


  • And one of the problems of this network is that, to figure out whether the third word Teddy is a part of the person's name, it's not enough to just look at the first part of the sentence.


  • So to tell, if \( Y^{<3>} \) should be zero or one, you need more information than just the first three words because the first three words doesn't tell you if they'll talking about Teddy bears or talk about the former US president, Teddy Roosevelt.


  • bidirectional-rnn


  • So this is a unidirectional or forward directional only RNN.


  • And, this comment I just made is true, whether these cells are standard RNN blocks or whether they're GRU units or whether they're LSTM blocks.


  • But all of these blocks are in a forward only direction. So what a bidirectional RNN does or BRNN, is fix this issue.


  • So, a bidirectional RNN works as follows. I'm going to use a simplified four inputs or maybe a four word sentence. So we have four inputs - \( X^{<1>} , X^{<2>}, X^{<3>}, X^{<4>}\).


  • So this networks heading there will have a forward recurrent components.


  • So I'm going to call this, \( A^{<1>} , A^{<2>}, A^{<3>}, A^{<4>}\). And so, each of these four recurrent units inputs the current X, and then feeds in to help predict \( {\hat{y}}^{<1>} , {\hat{y}}^{<2>}, {\hat{y}}^{<3>}, {\hat{y}}^{<4>}\).


  • bidirectional-rnn


  • Now we add a backward recurrent layer.


  • So we'd have \( {\overleftarrow{a}}^{<1>} \) to denote this is a backward connection, and then \( {\overleftarrow{a}}^{<2>}, {\overleftarrow{a}}^{<3>}, {\overleftarrow{a}}^{<4>} \), backwards, so the left arrow denotes that it is a backward connection.


  • And so, we're then going to connect to network up as follows. And this A backward connections will be connected to each other going backward in time.


  • bidirectional-rnn
  • So, notice that this network defines a Acyclic graph. And so, given an input sequence, \( X^{<1>} , X^{<2>}, X^{<3>}, X^{<4>}\), the fourth sequence will first compute \( {\overrightarrow{a}}^{<1>}, {\overrightarrow{a}}^{<2>}, {\overrightarrow{a}}^{<3>}, {\overrightarrow{a}}^{<4>} \).


  • Whereas, the backward sequence would start by computing \( {\overleftarrow{a}}^{<1>}, {\overleftarrow{a}}^{<2>}, {\overleftarrow{a}}^{<3>}, {\overleftarrow{a}}^{<4>} \), and then as you are computing network activation, this is not backward this is forward prop.


  • But the forward prop has part of the computation going from left to right and part of computation going from right to left in this diagram.


  • But having computed \( {\overleftarrow{a}}^{<3>} \), you can then use those activations to compute \( {\overleftarrow{a}}^{<2>} \), and then \( {\overleftarrow{a}}^{<1>} \), and then finally having computed all you had in the activations, you can then make your predictions.


  • And so, for example, to make the predictions, your network will have something like \( {\hat{y}}^{<1>} \) at time t is an activation function applied to \(W_Y\) with both the forward activation at time t, and the backward activation at time t being fed in to make that prediction at time t.


  • So, if you look at the prediction at time set three for example, then information from \( X^{<1>} \) can flow through forward one to forward two to forward three to \( {\hat{y}}^{<3>} \).


  • So information from \( X^{<1>} , X^{<2>}, X^{<3>} \) are all taken into account with information from \( X^{<4>} \) can flow through a backward four to a backward three i.e. \( {\overrightarrow{a}}^{<4>}, {\overrightarrow{a}}^{<3>} \) to \( {\hat{y}}^{<3>} \).


  • So this allows the prediction at time three to take as input both information from the past, as well as information from the present which goes into both the forward and the backward things at this step, as well as information from the future.


  • So, in particular, given a phrase like, "He said, Teddy Roosevelt..."


  • To predict whether Teddy is a part of the person's name, you take into account information from the past and from the future.


  • So this is the bidirectional recurrent neural network and these blocks here can be not just the standard RNN block but they can also be GRU blocks or LSTM blocks.


  • In fact, for a lots of NLP problems, for a lot of text with natural language processing problems, a bidirectional RNN with a LSTM appears to be commonly used.


  • So, we have NLP problem and you have the complete sentence, you try to label things in the sentence, a bidirectional RNN with LSTM blocks both forward and backward would be a pretty much of first thing to try.


  • So, that's it for the bidirectional RNN and this is a modification they can make to the basic RNN architecture or the GRU or the LSTM, and by making this change you can have a model that uses RNN and or GRU or LSTM and is able to make predictions anywhere even in the middle of a sequence by taking into account information potentially from the entire sequence.


  • The disadvantage of the bidirectional RNN is that you do need the entire sequence of data before you can make predictions anywhere.


  • So, for example, if you're building a speech recognition system, then the BRNN will let you take into account the entire speech utterance but if you use this straightforward implementation, you need to wait for the person to stop talking to get the entire utterance before you can actually process it and make a speech recognition prediction.


  • So for a real type speech recognition applications, they're somewhat more complex modules as well rather than just using the standard bidirectional RNN as you've seen here.


  • But for a lot of natural language processing applications where you can get the entire sentence all the same time, the standard BRNN algorithm is actually very effective.


  • So, that's it for BRNNs and next , let's talk about how to take all of these ideas RNNs, LSTMs and GRUs and the bidirectional versions and construct deep versions of them.