What Makes a Model Large and Language

Introduction to Large Language Models

If you’ve ever asked ChatGPT or Claude to draft an email, you’ve already met an LLM in action. LLMs, or Large Language Models, have gone a long way from research previews to dinner-table topics. They’re changing how we interact with information, and even think about intelligence.

What Makes a Model "Large"?

But what makes a model “large”, and why call it a “language model” if it can also generate images? In the last article, we talked about the Transformer. Now, we’ll build on that foundation and explore what happens when you scale that architecture to billions of parameters and let it learn without supervision.

Foundational Overview

This article is a foundational overview of what defines an LLM and how it’s trained. Also, we’ll compare it to the old-school NLP system. The term “Large Language Model” gets mentioned a lot. Each word points to something specific about what LLMs are and why they work.

"Large"

It’s mostly about scale. The term “large” refers to the number of parameters, which are the internal weights that a model learns during training. The parameters are like dials, and each of them adjusts how the model understands patterns in the data. To give a sense of scale:

When GPT-3 firstly came out, it had 175 billion parameters
GPT-4 is speculated to have approximately 1.76 trillion (not disclosed)
Llama 4 Maverick has 400 billion

"Language"

LLMs are often trained on natural language data like books, Reddit, Wikipedia, code, news articles, and more kinds of web pages. They’re not just remembering grammar rules, they’re learning statistical patterns in how words and phrases relate to one another. They’re called language models because they learn to model the probability distribution of language, meaning to predict what comes next in a sentence based on what came before.

"Model"

The model part refers to the underlying architecture used to learn those probabilities, and modern LLMs are built on the Transformer almost universally. The key ideas that make Transformers suited for LLMs are:

Self-attention allows to weigh different words regardless of distance
Positional embeddings help capture word order without recurrence
Transformers can be efficiently parallelized, which makes them practical for massive datasets and hardware clusters

Understanding Large Language Models

So when you hear the phrase “large language model”, think:

Billions of learned parameters (Large)
Trained on massive text corpora (Language)
Built on the Transformer architecture (Model)

How LLMs Become Language-Fluent

How does a model become language-fluent without anyone explicitly labelling the data? The answer is pretraining with self-supervised learning. This is a paradigm where the training signal comes from the data itself, not from human annotations.

Self-Supervised Learning

In traditional supervised learning, we usually need labelled data. But the structure provides its own supervision with language. For example: “The quick brown ___ jumps over the lazy dog”. We don’t need a human to label it because the missing word is part of the sentence itself. The model now learns deep patterns of language structure and semantics, by masking parts of a sentence or predicting the next word.

Pretraining Objectives

Two Common Pretraining Objectives

Autoregressive (Next-token prediction): Used by GPT models to predict the next token given previous ones.
Masked Language Modeling (MLM): Used by BERT to predict masked tokens within a sequence.

Why Pretraining Works Well

Massive scale: Pretraining uses trillions of words to capture the diversity of human knowledge and communication
No need for labels: Any text corpus becomes a training set
General-purpose knowledge: The model learns syntax, semantics, reasoning patterns, and even world knowledge all at once.

Comparison to Traditional NLP

The Traditional NLP Stack

A typical NLP system looked something like this:

Tokenization: Break text into words or subwords
Part-of-Speech Tagging: Label each word as noun, verb, adjective, etc
Parsing: Analyze grammatical structure
Named Entity Recognition: Detect entities like names, dates, locations
Feature Engineering: Design inputs for task-specific models
Task-Specific Models: Apply statistical classifiers for tasks like sentiment analysis

The LLM Approach

LLMs like GPT changed the paradigm. Now we feed in raw text and the model handles everything: no part-of-speech tagging, dependency parsing, feature engineering, nothing. LLMs instead learn representations of language that encode factors like grammar and syntax, and even real-world knowledge all during pretraining.

Emergent Abilities and Zero-Shot Learning

LLMs can generalize to new tasks without ever being explicitly trained on them. This is probably the most impressive part. For example, here’s a prompt: “Translate good morning to Chinese” => “早上好”. That’s zero-shot learning, which is an ability that emerges when scale, data, and architecture align.

Examples of LLMs

GPT (OpenAI)

Model Type: Autoregressive decoder-only
Known For: ChatGPT, few-shot prompting, strong generative ability
Key Idea: Predicts the next token in a sequence using only past context

BERT (Google)

Model Type: Bidirectional encoder
Known For: Deep language understanding tasks
Key Idea: Uses Masked Language Modelling to learn from both left and right context

More Candidates

Claude by Anthropic: Claude 4 Sonnet, Claude 4 Opus
Gemini by Google: Gemini 2.5 Pro
Grok by xAI: Grok 3, Grok 2

Conclusion

LLM’s influence reaches far beyond chatbots or productivity tools, they’re reshaping how we search, learn, and more. With their ability to learn from vast amounts of data without explicit labeling, LLMs are opening new avenues in artificial intelligence and natural language processing. As technology continues to evolve, the impact of Large Language Models will only continue to grow, changing the way we interact with information and each other.

News

Useful Links

What Makes a Model Large and Language

Introduction to Large Language Models