How machines learn to mimic our understanding of language

In AI engineering, initialising a machine learning model is referred to as "pre-training". This process requires extensive amounts of training data.

As Andrej Karpathy explains in his video on LLMs:

At a high level, internet documents are broken down into tokens (small chunks of information) for sequence prediction by a model. Trillions of tokens go into this.

Knowing the exact list of data used to initialise MLMs is virtually impossible. Informed consent from content authors for the right to use their data on training machines is judicious, but not always practical or consistently implemented.

Data extraction, cleaning and transformation is a huge part of this process.

To train LLMs, these machine learning models are progressively, tasked with:

  • Predicting the last word of a sentence
  • Guessing missing words in a sentence (through masking sentences)
  • Predicting the next sentence

No manual labelling of information is done. No parameters are defined in advance.

Through algorithms and large datasets, a model makes a large amount of guesses and adjusts itself through rounds and rounds of computation.

The outcome of this stage is a base model, which sets the parameters of the network.

These are then fine tuned towards a specific task in post-training where many models undergo a process of Reinforcement learning with human feedback.