What is a Language Model (LM)?
A language model (LM) is a probabilistic model that assigns probabilities to sequences of tokens (words, characters, or subwords). The key function of a language model is to provide a probability distribution over a sequence of tokens, which allows the model to predict how likely a sequence of words is to appear together.
Here’s how it works in more detail:
How Language models generate texts?
Consider a sequence of tokens {X1,X2,X3….XL} are in vocabulary V, Then the probability of a word coming after :”The monsoon rains have” can be determined by using the following equation , using the chain rule of probability will be :
Example:
Let’s take the partial sentence “The monsoon rains have” and calculate the probability of the next word, say X5
, coming after this phrase using the chain rule. Assume that:
X1 = "The"
X2 = "monsoon"
X3 = "rains"
X4 = "have"
Now, the probability of the sequence "The monsoon rains have X5"
is given by:
In practical terms, this is how we compute the probability of a word X5
coming after the sequence "The monsoon rains have":
- P(X1): The probability of the first word being “The”.
- P(X2 | X1): The probability of the second word being “monsoon” given that the first word is “The”.
- P(X3 | X1, X2): The probability of the third word being “rains” given the first two words are “The monsoon”.
- P(X4 | X1, X2, X3): The probability of the fourth word being “have” given the previous words are “The monsoon rains”.
- P(X5 | X1, X2, X3, X4): The probability of the next word
X5
given the preceding sequence "The monsoon rains have".
Illustration with Next Word Prediction:
Now, let’s say we want to predict the next word after “The monsoon rains have”. We would look at a model (like a language model) to compute:
The model could suggest different options for x5
like:
- “started”
- “subsided”
- “caused”
Each of these options would have an associated probability, and the word with the highest probability would be selected as the next word.
For example:
Since “started” has the highest probability (0.6), the model would predict “started” as the next word.
References
- A Survey of Deep Learning: From Activations to Transformers https://arxiv.org/abs/2302.00722
- An introduction to the course content, logistics, policies and background — IIT Delhi classes on LLM by Tanmoy Chakraborty. https://lcs2-iitd.github.io/ELL881-AIL821-2401/static_files/presentations/1.pdf
- The Evolution of NLP: Past, Present, and Future by Team Pepper https://www.peppercontent.io/blog/tracing-the-evolution-of-nlp/