Use word alignments to model translation

Untitled

Decompose P(y, z | x) using the chain rule:

$$ P(y, z | x) = P(y | x, z)P(z | x) = P(|y|, |z| \space|\space x) \prod_{i=1}^{|y|} P(y_i | y_1, ..., y_{i-1}, x, z) \prod_{i=1}^{|z|} P(z_i | z_1, ..., z_{i-1}, x) $$

The first item chooses the length of y and z. We need to make some independence assumptions to simplify the other two terms into something we can work with:

Step 1: Draw length of English, conditioned on Swedish

$$ \textbf{Full model:} \qquad P(|y|\space|\space x) $$

Step 2: For each English position, draw a Swedish word uniformly at random. (Let $|z| = |y|$ and let $z_i$ be position of aligned Swedish word for $y_i$)

$$ \textbf{Full model:} \qquad P(|y|\space | \space x) \prod_{i=1}^{|y|} P(z_i\space|\space|x|) $$

Step 3: For each English word, draw its translation from a bigram translation probability.

$$ \textbf{Full model:} \qquad P(|y|\space | \space x) \prod_{i=1}^{|y|} P(z_i\space|\space|x|)P(y_i|x_{z_i}) $$

<aside> 🗣️ Alternative view of this: Each training example contains a set of states (Swedish words), and a sequence of English words that we tag with those states.

Untitled

</aside>