What is a Language Model (LM)?

RAHULRAJ P V
3 min readSep 30, 2024

A language model (LM) is a probabilistic model that assigns probabilities to sequences of tokens (words, characters, or subwords). The key function of a language model is to provide a probability distribution over a sequence of tokens, which allows the model to predict how likely a sequence of words is to appear together.

Here’s how it works in more detail:

How Language models generate texts?

Consider a sequence of tokens {X1,X2,X3….XL} are in vocabulary V, Then the probability of a word coming after :”The monsoon rains have” can be determined by using the following equation , using the chain rule of probability will be :

Example:

Let’s take the partial sentence “The monsoon rains have” and calculate the probability of the next word, say X5, coming after this phrase using the chain rule. Assume that:

  • X1 = "The"
  • X2 = "monsoon"
  • X3 = "rains"
  • X4 = "have"

Now, the probability of the sequence "The monsoon rains have X5" is given by:

In practical terms, this is how we compute the probability of a word X5 coming after the sequence "The monsoon rains have":

  1. P(X1): The probability of the first word being “The”.
  2. P(X2 | X1): The probability of the second word being “monsoon” given that the first word is “The”.
  3. P(X3 | X1, X2): The probability of the third word being “rains” given the first two words are “The monsoon”.
  4. P(X4 | X1, X2, X3): The probability of the fourth word being “have” given the previous words are “The monsoon rains”.
  5. P(X5 | X1, X2, X3, X4): The probability of the next word X5 given the preceding sequence "The monsoon rains have".

Illustration with Next Word Prediction:

Now, let’s say we want to predict the next word after “The monsoon rains have”. We would look at a model (like a language model) to compute:

The model could suggest different options for x5 like:

  • “started”
  • “subsided”
  • “caused”

Each of these options would have an associated probability, and the word with the highest probability would be selected as the next word.

For example:

Since “started” has the highest probability (0.6), the model would predict “started” as the next word.

References

  1. A Survey of Deep Learning: From Activations to Transformers https://arxiv.org/abs/2302.00722
  2. An introduction to the course content, logistics, policies and background — IIT Delhi classes on LLM by Tanmoy Chakraborty. https://lcs2-iitd.github.io/ELL881-AIL821-2401/static_files/presentations/1.pdf
  3. The Evolution of NLP: Past, Present, and Future by Team Pepper https://www.peppercontent.io/blog/tracing-the-evolution-of-nlp/

--

--

RAHULRAJ P V

DUK MTech CSE AI '24 | IIM Kozhikode Research Intern | CSIR NPL Former Project Intern | MSc Physics | PGDDSA | Generative AI Learner🧠 | Film Enthusiast