Consider a single Transformer encoder layer using narrow self-attention. The self-attention projection size is 1024 (e.g. the size of $W_{\{q,k,v\}}$), the feed-forward projection size is 4096 and each layer has 16 heads.
We now change the encoder to use wide self-attention with other parameters unchanged.
Now we consider an encoder-decoder Transformer model for English sentence compression. The encoder and decoder each have six layers, as described above, and we also define a vocabulary of 64,000 words with corresponding embeddings. Assume the Transformer layers use narrow self-attention, the embedding dimensionality is equal to the self-attention projection size and normalisation parameters can be similarly ignored