This is a 3-part series.
Part-1 is an overview explaining what are machine learning models, LLMs etc.
Part-2 goes into details of how you can create (essentially means ‘train’) a new machine learning model. These 2 parts together will provide a great baseline and platform to understand the basics of LLMs.
Part-3 covers how LLMs work.
TL;DR:
Let’s take a step-by-step approach to understanding what a model is → what a language model is → and finally what a large language model is.
DETAILS:
A model can be thought of as a computational system that can predict the output for a given input, on a certain topic. Point to note is that a model is not a set of predefined rules, rather it’s like a trained brain that can predict. Let’s decipher:
A computational system: in simple terms, this means that a model is a system that can do some calculations.
Predict the output for a given input: these calculations are done to predict the probability of something happening or not happening. For example, what is the probability of a given email being a spam email?
… on a certain topic: models are contextual, meaning a model can be used for a certain or specific topic or context (for example, here the context is ‘spam emailing’)
So, is a model restricted to only one context?
Good that you asked. Initially, during the early days of machine learning, most of the models were of single-context. Meaning, they knew (aka ‘trained’) on only one topic or context. For example, a model on the stock market didn’t know about spam emailing.
But as time passed and computer science advanced, newer models were trained on multiple contexts or topics. Our LLMs fall under this category. So, we can say that the machine learning models are becoming more flexible from a more rigid state.
Fun fact: now-a-days, computer scientists have been creating models that are trained on so many topics that other than calling these models ‘Multi Contextual’, they are called ‘General Purpose Models’. And yes, our LLMs (most of them) are general purpose models.
So, what does it mean that a model is ‘trained’?
Training a model = feeding labeled data into a mathematical function (model) + optimizing that function using a training algorithm.
Wait wait .. where is this mathematical function coming from now suddenly! What’s an optimization algorithm now! My head is revolving …. 🤯
Great questions … let’s take this step-by-step.
‘Training a model’ can be thought of as one part of creating a model.
Hence, let’s go to the topic below “how to create a machine learning model” and we’ll automatically understand what “training a model” means.
Once we fully understand what a model is (read up “how to create a machine learning model” first), let’s continue with understanding what is a ‘language model’ and then what a ‘large language model’ is.
> A 'language model' or LM is basically a machine learning model that has been trained on “natural languages” (a natural language means the languages that humans speak or write). And just like different people speak different languages, different language models are trained on different languages (or combinations of them).
For example, on a Mobile app, when you type “I’m feeling” and the app suggests “tired” or “happy” – these are language models in play. Hence, “autocomplete” sentences is a use case of a language model.
Here are some of the Language Models trained on different languages (some of the below models are LLMs where some are LMs):
Models those are only LMs:
> A large language model is just a very big language model — trained on billions of words from books, websites, articles, and more, using huge computing power.
What makes it "large" is:
Massive data: It reads almost the entire internet.
Massive size: It has billions of internal connections (like artificial neurons) that store knowledge.
More general intelligence: It can answer questions, write code, summarize text, write poetry — it's not just autocomplete anymore.
LLM vs LM:
All LLMs are LMs but not all LMs are LLMs!
Some of the well-known LLMs are:
Q) How to create a new machine learning model? What does it mean to ‘train a model with data’ OR what does it mean that ‘the model is learning’?
>TL;DR:
Training a model means helping it learn patterns from labeled data by feeding that data into a base ML model and optimizing it using training algorithms (like gradient descent).
So to summarize, what does it mean to ‘train a model with data’ is:
Well, it simply means that, this new model that you’re building, it is learning patterns from the data you’ve fed into the optimization functions (aka algorithms aka training algorithms like gradient descent) of the model.
Steps to create a new machine learning model are:
Figure: Simplistic steps to create a new Machine Learning Model
DETAILS:
Steps to create a new ML model:
Prepare labeled training data
Choose base model type (e.g., logistic regression)
Choose optimization algorithm
Train the model (learn patterns)
Evaluate and use it
Let’s decipher:
First, you create the training data set by proper labeling.
For example, you somehow collected 1 Million emails (means, the subject, body, to, from, attachments of the email) and labeled each of the emails as either ‘spam’ or ‘not spam’. This is the “training data set” of the model you are trying to build or train on.
Next, you choose your base Machine Learning Model Type:
Identify the base mathematical function aka, base machine learning model. Every new machine learning model needs to be created based on existing machine learning model types like ‘logistic regression’, ‘decision tree’, ‘neural networks’ etc. So, you choose one. Choosing the base ML model while creating a new ML model is the Core Design Decision.
So … is it always necessary to choose an existing base machine learning model (like logistic regression) in order to create a new ML model?
Well, not really. You can either use one of the base machine learning models that have been tested by hundreds of people and you know that they do work OR create your own base model (and follow those lengthy and complex steps to make it dependable that the existing and verified models have gone through)
Next, choose optimization techniques
Depending on the base ML model type (ie. the mathematical function) that you have chosen, you need to choose the algorithm to train the model (or rather choose an optimization technique to train the model type) on that training data set. For example, here, you’ll need to choose if you’ll use gradient descent or backpropagation.
Fun fact: not all ML models support all optimization algorithms. For example, decision trees use something like information gain, entropy etc instead of gradient descent.
So, in-a-nut-shell, Training a model = feeding labeled data into a mathematical function (base ML model) + optimizing that function using a training algorithm. This optimization is the step where the model learns patterns by adjusting its internal parameters (such as ‘weights’).
Fun Facts:
Training = Learning = Fitting
(model.fit(X, y) is the step where training happens)
So, basically, during the learning period, while working on the training dataset, the model predicts whether or not an email is spam and then verifies with its label; if the prediction doesn’t match, then the model adjusts its weights and reruns. This is how the model ‘learns’. Of course, this is not done manually, rather it’s done automatically (as doing this manually on a training set of a Million emails isn’t possible)
Libraries like TensorFlow, PyTorch, or even scikit-learn handle all this for us.
Scikit-learn is a bit more rigid and opinionated and can’t be tweaked outside of its boundary. Whereas, TensorFlow and PyTorch are systems where you can tweak them more. Hence, generally TensorFlow and PyTorch are used more in case your machine learning model needs more tweaking.
Q) What are some of the base machine learning models used today? And when to use which model? For example, when to use logistic regression vs when not to?
> Base ML models are predefined functions (like logistic regression, decision tree, etc.) used to map input to output. Each has specific strengths and compatible optimization techniques.
> Some of the commonly used base machine learning models are:
Some use cases and which ML model type to use: (this is a repetition of of the above table, used for handy-ness only)
Q) What are some of the optimization algorithms? For example, when to use backpropagation vs when not to?
> Optimization algorithms help models adjust their internal parameters during training to minimize error. Different model types support different algorithms.
> Some of the commonly used optimization algorithms are:
Q) Are there any restrictions or recommendations on which base machine learning model can use which algorithm for training?
> Yes. Not every algorithm works with every model type. Libraries like TensorFlow or scikit-learn automatically choose compatible algorithms.
Each of the base machine learning model types internally support one or more optimization algorithms.
And when you use a certain model type, then you’ll need to use from the available set of optimization algorithms.
When you use libraries like TensorFlow or PyTorch or Scikit-learn, then these optimization algorithms are automatically implemented and available as part of the ML model type when you choose one.
For example, when you use LogisticRegression() in scikit-learn, it already knows to use gradient descent internally.
Q) Databases vs Models. A sneak-peak between models and databases
> Some differences between databases and models are:
A database just ‘stores’ the data – it cannot make ‘sense’ out of it. Whereas a model can make a ‘sense’ out of the data on which it was trained.
We, users, can query the database and get the data that’s stored there. But we cannot query a model and get back the data on which it was trained.
We need to use SQL to query the database to get the data stored in it. Similarly, we need to use an API or an interface (or maybe a UI like chatgpt) to query the model.
A major and most important difference between databases and models is that, databases cannot predict anything, it can just return you the current state of data if it’s present there. Whereas a model can predict because it has been ‘trained’ to do so.
A ‘trained model’ is a pattern-aware function that predicts (prediction can be of any type including but not limited to a binary output like ‘spam’ or ‘not spam’ to the next token) depending on the fine-tuned parameters during its training period.
Note: for “tuning of parameters?” check the part on “How to create a new machine learning model?”
So, if models don’t store the data, how do they compare new input and predict?
> Great question. Models don’t store data and hence they don’t ‘compare’ the input either. Rather they take the below route:
Convert the input (e.g., an email) into a vector of numbers (internal representation)
Pass it through their trained network of weights (which encode the knowledge from training)
And output a prediction based on how similar this input is to patterns it has seen during training
So it’s not like “Is this identical to an email I’ve seen before?”. Rather it’s more like “Does this seem to be similar to patterns I’ve learned from millions of examples?”
TL;DR:
Newer models like LLMs have been trained on internet-scale data, with better and more complex architecture (called “Transformers” that are based on a mechanism known as “Self-Attention”), and supports parallel processing.
DETAILS:
Let’s understand how the GPT models are made and we’ll inherently understand how they work and why are they so more efficient – easy 🙂
Evolution from earlier to newer (ie. GPT) models
Let’s see how earlier models evolved from basic classification tasks (like spam detection) into being generative models capable of next-token prediction — and how that evolution eventually led to LLMs.
The journey:
Figure: Evolution of Pre-GPT era Machine Learning Models to GPT-era Models
Pre-GPT era Models vs GPT Models:
In a super simplistic way, it can be said that the newer models like GPT are better, m-u-c-h better in generating the next token than how the earlier models were.
So .. how does it (the newer models) make better predictions?
The 3 big factors that changed this radical shift are:
LLMs are context-aware
Training data is internet-scale large
Prediction happens in-parallel
Let’s understand each of these 3 pillars one by one.
The new protagonist in town is “Attention is all you need” 🙂
LLMs can consider long-range dependencies and nuance — this is the heart of “attention is all you need”
What is attention?
It’s a method that allows the model to "look back" at all tokens in the input and decide which parts are most relevant for generating the next token. For example:
In the sentence “The animal didn’t cross the street because it was too tired,”
the word “it” could refer to “animal” or “street” — attention mechanisms help the model understand that “animal” is the likely reference.
Earlier models like RNNs or LSTMs had limited memory and hence couldn't "see" the whole input at once.
This ‘Attention’ came along with this new architecture that is used to create LLMs, and is known as ‘Transformers’.
So, what is a Transformer?
Tying back to this course’s knowledge here, where does a Transformer stand? Is it a base ML model that is used to create other models like LLMs or is it another optimization algorithm?
Well, a transformer is not an optimization algorithm, and it’s not entirely a base ML model either
Rather, a Transformer is a new architecture that is used to create base models like Deep Neural Network.
So, a transformer is a model’s architecture. Here the model is Deep Neural Network. This transformer architecture defines:
How input data flows through the model
How information is processed and learned
What kind of layers are used (like self-attention, feed-forward, residuals, etc.)
Figure: Where does Transformer come in
Q) So, what are some of the other model architectures used in the pre-LLM era in Neural Networks? What were they good at? why were the earlier architectures not so good at handling large language tasks? And why is the transformer architecture so better?
> Other Architectures Used to Build Deep Neural Networks (Pre-Transformer Era)
What these models were good at
FNN / MLP: Straightforward predictions; not good for language.
CNNs: Capture local dependencies well (n-grams, nearby words); used in early sentence classifiers and some translation models.
RNNs / LSTMs / GRUs: Could “remember” sequences — useful for early chatbots, language models, autocomplete systems.
Example: Google's Smart Compose used LSTM-based language models before Transformer-based models took over.
And why did these earlier models struggle at large-scale language tasks?
Q) What makes the Transformer architecture 'context-aware'? And how does this support the idea of “attention is all you need”?
> The Transformer architecture is context-aware because of its unique mechanism called self-attention, which allows every word (or token) in an input sequence to dynamically “look at” and weigh the importance of all the other words — regardless of their position. Unlike older models (like RNNs or LSTMs), which process data sequentially and can struggle to retain long-range dependencies, Transformers can instantly access the full context in parallel.
And how does this support the idea of ‘attention is all you need’?
> The phrase “attention is all you need” comes from the title of the 2017 paper that introduced Transformers. The authors argued (and proved) that:
You can eliminate recurrence (like in RNNs)
You don’t need convolution (like in CNNs)
You just need attention — specifically, self-attention, repeated in layers — to model relationships and meaning in sequences effectively.
And they were right: the success of GPTs, BERT, and most modern LLMs proves that attention, when scaled and layered correctly, is indeed sufficient.
Richer data = better generalization, more fluent generation, and emergent knowledge
The radical leap in language modeling was made possible not just by better architecture, but by feeding these models a massive portion of human digital knowledge. That’s why modern LLMs seem to be so fluent, informed, and versatile — they’ve been trained on far more text than any one human could ever read in a lifetime.
What does “Internet-scale mean”?
It means LLMs are trained on hundreds of billions to trillions of words.
These include books (fiction/non-fiction), news articles, web pages (wikipedia, blogs, forums), code repositories (for coding models), websites and public datasets etc.
GPT-3, for example, was trained on ~500 billion tokens. GPT-4 is believed to have been trained on trillions of tokens
Q) Why does the scale matter so much?
More diverse, large-scale data leads to:
This is where we start seeing things like “in-context learning” and “zero-shot reasoning” – behaviors that were not programmed or trained for, but emerged purely from the scale of data and model size.
Earlier models could do well in tasks like spam detection or sentiment analysis, but only if trained specifically for it. LLMs can answer questions, summarize text, write poems, translate, code, and more — without needing task-specific training.
“Zero-shot reasoning” means that the model was not trained specifically on the task.
Q) What is a Token?
A token is the smallest unit that a model processes.
A token can be a word, a portion of a word, or even a special character.
Different LLMs have different strategies or algorithms to create tokens. These are called Tokenisers.
Hence, the same sentence may produce different number of tokens for different LLMs, if their tokenisers are different
Q) Are there any downsides of training on this massive data?
Yes, there are a few:
Not all data are verified and the model may pick up biases, misinformation, and toxic contents
These models have a knowledge cut-off; meaning, the most recent data will not be available for the model, unless tools are used to fetch this data somehow.
Massive speed-up and better learning of relationships between all parts of input text
Next token generation in pre-LLM vs LLM era:
In Pre-LLM era, the predictions used to happen sequentially.
Earlier models (such as RNNs, LSTMs) worked on one token at a time, passing info forward like a chain.
To summarize:
pre-LLM models like RNNs must process all earlier words one-by-one to get to "mat".
And Transformers like GPT can process the sentence in one go (in-parallel across layers), so it builds the context faster and predicts “mat” based on richer, more complete context.
But even in LLMs, prediction of the next token ("mat") is still done one token at a time — it's just that the encoding of the context happens more efficiently thanks to parallelism.
Thus, Context building is parallel in LLMs, though next-token prediction/generation is still done step-by-step — but faster and smarter because of richer context.
Q) What are “Layers” in the Transformer model that make the context-building parallel?
A Layer in the transformer model is like one stack (among a lot of stacks) of operations that helps in building understanding of the input
The more the layer, the deeper (meaning more accurate) is the understanding
Each layer has some techniques such as: multi-head self-attention, feedforward neural network, layer norm + residual connections
Below is a table of how many layers used by which LLM:
Here’s a simple list of important components of Transformers:
One way to easily relate Transformer’s working to a real-world scenario is to relate it to a round-table discussion: Each token (word) hears what all the others have to say (via attention), forms a new opinion (via feedforward net), keeps memory of old ideas (residuals), and moves to the next round.
Encoder and Decoder as part of the Transformer Architecture
Why Encoder and Decoder are relevant?
The original Transformer (in the “Attention is All You Need” paper) was designed with both encoder and decoder components — mainly for translation (e.g., English → French).
But
BERT uses only the encoder part → optimized for understanding tasks.
GPT uses only the decoder part → optimized for generation tasks.
T5, BART, etc., use both → encoder handles input, decoder generates output.
So when we say “GPT is built using Transformer architecture,” we mean:
It uses the decoder stack from the Transformer blueprint.
Each decoder layer includes multi-head self-attention, feedforward, layer norm, etc.
TL;DR:
Figure: Sample Steps GPT takes to generate a response
DETAILS:
Let’s understand the generative process with this example.
Say, here the output from GPT (here) is: “William Shakespeare was a renowned English playwright, poet, and actor of the 16th century.”
We'll now break down how GPT generated this sentence using the Transformer architecture (specifically, the decoder stack)
How GPT Generated That Sentence
So, let’s say the first sentence of the response is “William Shakespeare was a renowned English playwright, poet, and actor of the 16th century.”
Step-1: Tokenization
The input prompt "Write an essay on Shakespeare" is broken down into tokens. For example, using GPT-3’s tokenizer, below tokens will be generated:
["Write", " an", " essay", " on", " Shakespeare"]
Each of these tokens is mapped to an embedding vector (e.g., a 768- or 12288-dimensional vector, depending on the model)
Step-2: Positional Encoding
Since Transformers process tokens in parallel and don’t inherently know word order, a positional encoding vector is added to each token embedding to indicate its position in the sentence.
So the embedding for "Write" gets position 0 added, "an" gets position 1, and so on.
Step-3: First Transformer Decoder Layer
Each token goes through a stack of Transformer decoder layers. In each layer:
Masked Multi-Head Self-Attention: The model looks at all previous tokens (but not future ones, since it’s generating text) and calculates which ones are important for predicting the next word. For example, to predict "William", the model focuses on "essay" and "Shakespeare".
Feedforward Neural Network: Each token representation is further processed individually to enhance features.
Residual Connections + Layer Norm:
Residual Connections and Layer Normalization are like safety nets inside the transformer.
Residual connections help the model “remember” the original input, even after passing through complex layers.
Layer normalization keeps the signals stable, so learning doesn’t become chaotic.
Without them, the model could either forget important context as it goes deeper, or the training could become unstable and never converge to good results.
A real-world analogy can be:
We can think of residual connections like a whisper from the original sentence that travels alongside all the layers. It makes sure that no matter how many layers deep the model goes, it still “remembers” what the sentence was about.
Layer normalization is like a leveler — it smooths out the information between layers, preventing any one signal from overpowering others.
This process happens across multiple layers (e.g., 12 in GPT-2, 96 in GPT-4), gradually refining the representation of the prompt.
Step-4: Output Projection
At the final layer, the model has a dense representation of the prompt and its context. This is passed through a linear layer and a softmax function that outputs probabilities over the entire vocabulary (e.g., 50,000+ words/tokens).
For example:
"William" = 0.82 probability
"The" = 0.05
"Romeo" = 0.02
The model chooses "William".
Step-5: Iterative Generation
Now the prompt becomes: ["Write", " an", " essay", " on", " Shakespeare", "William"]
This new sequence is re-fed into the model to generate the next token: "Shakespeare", then "was", "a", "renowned", and so on — one token at a time, building the sentence incrementally.
Eventually, the essay is generated, one token at a time, using the exact same process.
This loop continues until… GPT stops
Q) How does LLMs (say GPT) know when to stop?
GPT uses a special token called the end-of-text token (often denoted as `` or <|endoftext|> in training). During training, the model learned that certain prompts should end at specific points.
It may stop when:
It predicts the end-of-text token with high enough probability.
It reaches a predefined token limit (e.g., 128 or 2048 tokens).
External constraints (e.g., max_tokens in API call) tell it to stop.
The model is trained to “wrap up” when it sees semantic signals like:
“In conclusion…”
“This shows that…”
Reaching a natural paragraph length.
It predicts the end-of-text token with high enough probability.
🎉 Congratulations!
You now understand what a machine learning model is, how it’s trained, and how it evolved into modern LLMs like GPT. Along the way, you’ve explored key breakthroughs like the Transformer architecture, context-aware learning, internet-scale training data, tokenization, and in-parallel prediction — giving you a solid foundation to explore and apply large language models with confidence.
Keep learning and building 🚀