original paper : https://doi.org/10.48550/arXiv.1706.03762
Word Embedding
Positional Encoding
Self Attention
The main idea of self attention mechanism is it calculates similarity of each word with all other words in a sentence . This process while training helps to identify those words which are more related to each other which in turn helps to identify which pronoun refers to what.
Let’s say the sentence goes like ‘The pizza came out of oven and it tasted good.’ Here ‘it’ can refer to both oven and pizza . As human we can identify ‘it’ refers to pizza here . But algorithm can’t comprehend it properly , That’s where self attention comes into the picture.
The word embedded vector is input into 3 distinct fully connected layers to generate query. key and value vectors.
The queries and keys undergoes a dot product matrix multiplication to generate score vectors, which gives the similarity of one word with other words. The higher the score more the focus.
This score is passed through a softmax function to get values between 0 and 1 , the highest score gets hightened and lowest scores gets depressed. The output of softmax is the attention vector which is multiplied with value vector to get the output vector.
Problem with RNN
No parallelization - slow computation
Sequential
Forget the beginning of a long paragraph
Vanishing or exploding gradient
Positional encoding - word order in the sentence
Data Flow
The sentence to translate is split into words - tokenize
Each word is given an input id according to the position of the vocabulary
in the dictionary used for training. This id is fixed throughout for a word.
The words are converted to an embedding vector, which changes according to
the training process.
Positional embedding vector is calculated and added to the word embedding vector , which is
once computed and reused for every sentence during training and inference
Masked Multi Head Attention - We don’t want the words to be related to the future words . So we
set the elements above diagonal to be - inf in the similarity matrix