original paper : https://doi.org/10.48550/arXiv.1706.03762

Word Embedding

Positional Encoding

Self Attention

The main idea of self attention mechanism is it calculates similarity of each word with all other words in a sentence . This process while training helps to identify those words which are more related to each other which in turn helps to identify which pronoun refers to what.

Let’s say the sentence goes like ‘The pizza came out of oven and it tasted good.’ Here ‘it’ can refer to both oven and pizza . As human we can identify ‘it’ refers to pizza here . But algorithm can’t comprehend it properly , That’s where self attention comes into the picture.

The word embedded vector is input into 3 distinct fully connected layers to generate query. key and value vectors.

The queries and keys undergoes a dot product matrix multiplication to generate score vectors, which gives the similarity of one word with other words. The higher the score more the focus.

This score is passed through a softmax function to get values between 0 and 1 , the highest score gets hightened and lowest scores gets depressed. The output of softmax is the attention vector which is multiplied with value vector to get the output vector.

Problem with RNN

No parallelization - slow computation

Sequential

Forget the beginning of a long paragraph

Vanishing or exploding gradient

Positional encoding - word order in the sentence

Data Flow

The sentence to translate is split into words - tokenize

Each word is given an input id according to the position of the vocabulary

in the dictionary used for training. This id is fixed throughout for a word.

The words are converted to an embedding vector, which changes according to

the training process.

Positional embedding vector is calculated and added to the word embedding vector , which is

once computed and reused for every sentence during training and inference

Masked Multi Head Attention - We don’t want the words to be related to the future words . So we

set the elements above diagonal to be - inf in the similarity matrix

https://youtu.be/bCz4OMemCcA