How machines learn to mimic our understanding of language

In AI engineering, initialising a machine learning model is referred to as "pre-training". This process requires extensive amounts of training data.

As Andrej Karpathy explains in his video on LLMs:

At a high level, internet documents are broken down into tokens (small chunks of information) for sequence prediction by a model. Trillions of tokens go into this.

Knowing the exact data used to initialise MLMs is virtually impossible. Informed consent from web content authors for the right to use their data to train machines is not consistently implemented.

To train LLMs, machine learning models are progressively tasked with:

Predicting the last word of a sentence
Guessing missing words in a sentence (through masking sentences)
Predicting the next sentence

No manual labelling of information is done. No parameters are defined in advance. Data used for training is extracted, cleaned and transformed with code.

Through algorithms and lots of data, a model makes a large amount of guesses and adjusts itself through rounds and rounds of computation.

The outcome of this stage is a base model, which sets the parameters of the network.

These are then fine tuned towards a specific task in post-training where many models undergo a process of Reinforcement learning with human feedback.

All notes