Played Plinko? Then You Already Understand How LLMs Work
Most people have heard of large language models. We nod along as if we understand them, yet when pressed to explain, we often shrug and say, “It’s like ChatGPT.” That answer is as vague as calling water “the stuff you drink.” My goal for this post is to give you a deeper understanding of LLMs. We know that water is two hydrogen atoms plus one oxygen atom. This post will give you a similarly precise understanding of LLMs.
When trying to learn how these systems actually work, it’s easiest to think of an LLM like a giant Plinko machine. Except this machine is loaded and every time we drop a ball, we think we know where it will land. Picture the board, but super-sized with billions of pegs instead of a few dozen, and every ball is not round but its own unique shape. Each ball you drop is a piece of text moving through those pegs until it lands in a slot giving you a generated response.
There are four foundational steps to explain how a model travels through this metaphorical Plinko machine to go from a user prompt to an answer. By grounding yourself in these first principles, you can see why models behave as they do (and maybe even impress someone at your next dinner party).
Labeling
We first need to understand how the Plinko machine was built, so let us roll back to before a single ball drops.
We start with a process called labeling. Companies like Scale AI have become multi billion dollar companies by providing this service. The idea is simple: for each piece of input text, we decide what the correct output should be. For example, given the Plinko ball or model input “Translate hola into English,” we would expect the ball to fall in a pocket called or produce an output of “hello.”
We repeat this process hundreds of thousands of times with different text and gather a catelog or database of all these pairs. We then feed that catalog of pairs into the next phase, where we will tweak the pegs on our Plinko board, the model’s internal settings, until dropping a new ball (a fresh question) reliably bounces it into the pocket with the right answer.
Training
Now the real work begins: we drop millions of balls over and over again. We know where we want them to land, but at first they miss their mark. After each drop we nudge every peg a hair left or right, inching the next ball closer to its proper pocket. In the LLM world these pegs are called parameters. Repeat this loop long enough and the board aligns so that almost every ball lands exactly where it should. Remember we are dealing with billions and sometimes trillions of parameters (pegs) here.
Behind the scenes those tiny nudges run through complex calculus equations, analyzing how far off was the generation, and why. This is the reason training an LLMs is so costly, because the model repeats this process until it gets it right. When the math finally settles, the board is tuned so well that even balls it has never seen before glide into the correct slots.
Predicting
Now that you have a properly trained Plinko board (model) it’s time to use it. To understand how a model predicts the output for a given input, you must understand the concept of tokens.
A token is a bite-sized chunk of text: it could be a full word, part of a word, or even just a punctuation mark. Think about these as the potential pockets for a ball to fall on our Plinko boards. The model has a predefined set of tokens (pockets) at its disposal, learned during training, usually in the tens of thousands.
When you give an input prompt to an LLM, it looks at the input as a group of singular tokens (Plinko balls). It then uses that group of input tokens to predict the next token one at a time until the answer is complete. It does this by dropping the metaphorical ball (group of tokens) into the trained Plinko board, letting it bounce around the pegs (parameters) we set up, and seeing what pocket (output token) it lands on. In our example from before, the model sees a group of four input tokens: “Translate,” “hola,” “into,” and “English”, and returns a single output token, “Hello,” as the response.
Before the model actually picks the next token, it assigns a probability to every token in its entire set. Remember from before, these are the set of options a model can pick from. Think of it like dropping the same Plinko ball you did before, but instead of the ball landing in just one pocket, the board instantly gives you the odds of the ball landing in each and every pocket. We then take our list of probabilities and head to the generation step!
Generating
With our fresh list of probabilities, the model selects the token with the highest probability and generates that. It then feeds this new group of tokens, the original group plus the one it just generated, back into the model to generate the next token, continuing this process until it hits what is called a stop token. The stop token is exactly what it sounds like: a marker that tells the model to halt. Sometimes it arrives after one word, sometimes after several paragraphs. This works just like any other token; the only difference is that, once selected, it tells the model to stop instead of continuing the sequence.
A key thing to understand about the generation step is that there is a level of randomness cooked into the selection of the highest-probability token. Sometimes the model will choose the second-highest-probability token, which means the new group of input tokens will be different and creates a downstream effect that looks like a new answer to the same question for the user. The actual substance of the generation is typically the same, but this is where model inconsistency comes from!
Model inconsistency, the slight, random shifts in wording each time you run the same prompt, matters because it affects reliability. In high-stakes settings like legal drafting or data extraction, those small variations can introduce errors or require extra human review, so understanding and managing this randomness is crucial for trustworthy outputs.
Why this matters and what is next!
Now you can confidently say LLMs aren’t just “like ChatGPT” they’re gigantic, finely tuned Plinko boards. First, we label thousands of “ball-to-pocket” examples. Then we train the board, nudging billions of pegs until new balls reliably land in the right pockets. During prediction, the model assigns a probability to every possible next token or pocket and generates text by repeatedly picking the most-likely (sometimes second-most-likely) pocket until it hits the one that tells it to stop.
With this foundation, we can dig into what models can’t do, where they fall short, and why. Stick around for simple tips you can use to get better responses from the tools you already use like, ChatGPT and Perplexity. Next week, we’ll explore their current limitations and whether anything on the horizon might change them.