We index into the ith
row first to get the ith
training example (represented by parentheses), then the jth
column to get the jth
word (represented by the brackets).
It is appropriate when every input should be matched to an output.
Yes, in a language model we try to predict the next step based on the knowledge of all prior steps.
Correct, Γu is a vector of dimension equal to the number of hidden units in the LSTM.