Untitled

Question 1:

Consider a single Transformer encoder layer using narrow self-attention. The self-attention projection size is 1024 (e.g. the size of $W_{\{q,k,v\}}$), the feed-forward projection size is 4096 and each layer has 16 heads.

a. Calculate the number of weights in this single layer that will need to be learned? You can ignore normalisation parameters.

We now change the encoder to use wide self-attention with other parameters unchanged.

b. Calculate the number of weights in this new layer? Ignore normalisation parameters again.

Now we consider an encoder-decoder Transformer model for English sentence compression. The encoder and decoder each have six layers, as described above, and we also define a vocabulary of 64,000 words with corresponding embeddings. Assume the Transformer layers use narrow self-attention, the embedding dimensionality is equal to the self-attention projection size and normalisation parameters can be similarly ignored

c. Calculate the number of weights in this full model. You will need to consider the encoder and decoder layers as well as additional input and output parameters required for this task.
d. What advantage does tying these embedding matrices together have in terms of the learned word representation?
e. What is the percentage change in the number of weights to be learned from this change?