<aside> 🗣️ Background:

3 parameters helps large language modelling:

  1. More compute helps
  2. More data helps
  3. Bigger model helps

Can we be a bit more rigorous in finding the relationship between these three parameters?

Scaling laws can answer this question

</aside>

Scaling Laws:

GPU hardware

Double Descent