Decompose P(y, z | x)
using the chain rule:
$$ P(y, z | x) = P(y | x, z)P(z | x) = P(|y|, |z| \space|\space x) \prod_{i=1}^{|y|} P(y_i | y_1, ..., y_{i-1}, x, z) \prod_{i=1}^{|z|} P(z_i | z_1, ..., z_{i-1}, x) $$
The first item chooses the length of y and z. We need to make some independence assumptions to simplify the other two terms into something we can work with:
$$ \textbf{Full model:} \qquad P(|y|\space|\space x) $$
$$ \textbf{Full model:} \qquad P(|y|\space | \space x) \prod_{i=1}^{|y|} P(z_i\space|\space|x|) $$
$$ \textbf{Full model:} \qquad P(|y|\space | \space x) \prod_{i=1}^{|y|} P(z_i\space|\space|x|)P(y_i|x_{z_i}) $$
<aside> 🗣️ Alternative view of this: Each training example contains a set of states (Swedish words), and a sequence of English words that we tag with those states.
</aside>