The output of a large language model is often described, in simple terms, as next token prediction. While that is true, have you ever wondered how, specifically, this works? There’s layers of nuance that relate to what next token is actually generated and that’s what led to this question…
What is sampling in LLMs and how does it relate to ethics?
Before I get started, I want to give some props to the book AI Engineering by Chip Huyen, as a fantastic resource. Chip unpacks technical ideas with clarity and her book was a key resource for this month’s column.
Data and AI Ethics Courses
Explore the ethical considerations and standards implicit in the data industry and the emerging realm of AI.
(Use code DATAEDU for 25% off!)
Sampling 101
You might already know that AI is not deterministic; it’s probabilistic. This means that when you enter the same prompt into a large language model, you are not going to always get the same output. Instead, you will get what is likely or probabilistic. Sampling is the process that makes the outputs probabilistic. There are different sampling strategies and sampling variables that will yield different types of results. Here’s how it works:
“For a language model, to generate the next token, the model first computes the probability distribution over all tokens in the vocabulary.” –Huyen, 2025
A quick aside on vocabulary. The vocabulary is the total set of distinct tokens in the model. It could be words, parts of words, phrases, punctuation and its set as a hyperparameter by a whomever has built the model. For example, Llama 3 has 128,000 tokens in its vocabulary according to this article by Meta.
Let’s walk through an example. Suppose we start with this:
My favourite pet is a
Let’s assume the following probabilities based on its vocabulary for these tokens (which I’ve made whole words for ease of explanation):
dog – 40%
cat – 30%
turtle – 20%
reptile – 5%
blue – .05%
jello* – .01%
The probabilities listed above are fictional, for the purposes of this article, but in reality these scores would be computed as the probability distribution over all the tokens in the vocabulary. So, if the model in question was Llama 3, imagine that another 127,994 other possible tokens are in that vocabulary, not just the six ones above.
It should also be noted that the model isn’t making a direct value judgement based on the question – as in, it prefers dogs to turtles. The model has no concept of dogs or turtles. It’s a probabilistic assessment based on the training data. If the overall training data was largely from the “Turtle lovers of America” you can imagine a different set of probabilities!
We can also note that some of the choices don’t readily make sense – like “jello” and a whole bunch of other tokens, that we didn’t include from the vocabulary (e.g., the other 127,994). That idea – making sense – is somewhat irrelevant. These are all just tokens to the model, they don’t really have meanings in and of themselves. It comes down to math and probability, which is why sampling strategies are important. That said, linguistic syntax – knowing where to put parts of speech – does become part of the pattern recognition that is encoded by probabilities. In other words, “a” is very unlikely to be followed by “the” and more likely to be followed by a noun like “dog.”
One sampling strategy is to select the highest ranked probability every time. This is known as greedy sampling. This might make for a very predictable model, but it would also be very pretty boring. It may also result in your model being stuck in a loop. Consider this example:
If “sorry” is the mostly likely token to follow “I’m” and vice versa, you might get this kind of looping output:
I’m sorry. I’m sorry. I’m sorry. I’m sorry. I’m sorry. I’m sorry…
A large language model’s utility relies on generating a plausible but novel response. In the case of “My favorite pet is a,” some people like dogs, others like turtles, some folks will prefer reptiles – it is subjective, there is no right or wrong. All of those choices might work perfectly well because they are all plausible types of pets. Jello, on the other hand, would be a highly creative choice, but also nonsensical. Blue is plausible, though less likely.
Another method might be random sampling. This means that the token “jello” has a shot (pun intended), albeit a very small one. Yet, if that happens and the model does select jello, then the whole process plays out again, but now the word “jello” is our starting point and things will progress from there. This can go off the rails quickly. We won’t go down that “jello shot” pathway.
More workable methods that bridge the gap can allow for some randomness, but with constraints. For example, a Top-K method would rank the probabilities for the highest probability tokens – in our case, dog, cat, and turtle – using a threshold cutoff to eliminate anything below that point. Let’s say that is 20%. Then, it would renormalize the probability among this smaller subset of token choices – which, again, are dog, cat, turtle – and sample from this set.
The threshold for K is fixed, so its not that adaptable … enter Top-P (aka nucleus sampling)! Top P also starts by with looking at the probabilities of tokens from highest to lowest, but it cumulatively adds up these probabilities to meet a predefined threshold (P).
Let’s assign that threshold of P at 95% for our example. This means that reptiles are now in the consideration set of probabilities, along with dogs, cats, and turtles. In our example, only four choices make the cut, but in other contexts, the distribution might be more spread out to include more possible choices. It should be noted that both methods – Top K and Top P – can reduce, but not necessarily eliminate, nonsensical choices. That is to say if you set the threshold at 95%, there might be some nonsensical tokens that get swept up in the process, and gibberish may be selected rather than coherent next word.
That said, Top P hits a happy medium, making it a popular choice for many large language models. The P threshold can be lowered for more control (e.g., 70% for just dogs and cats in our example) or raised for more variety.
All of these methods work with the existing probability distribution of the dataset. But, temperature sampling is a technique which changes the shape of the probability distribution itself. There’s more details in this explainer, but basically, think of this as dialing a control up or down that can make the model token output more wild and unpredictable or more conservative and restrained. Temperature can be combined with the other methods as another means of creative control.
These are not all the sampling methods, but they are some of the more popular ones. The key point is that choices are being made that will have implications for the output and therefore, can also have ethical implications.
The Ethics Part
Most people are now aware that large language models can output inaccurate information, also called hallucinations. The output can also be inconsistent. A large language model can provide very different outputs for inputs that are quite similar.
Sampling helps explain why this happens. The choices made about sampling strategies have direct implications for the types of tokens that are output by the model. A more constrained approach might yield more restrained responses which might appear to us as being more factually correct or accurate. A model with a lower temperature would appear to be more consistent while one with a higher temperature might be thought of as more creative. These choices are mostly invisible to the user, though some models do allow the users to control the temperature. The key point is that all of these technical details actually encode ethical choices.
Think about context: Do we want a unconstrained, high temperature, “creative” model that is meant to answer questions for a high-stakes context or a more grounded and consistent model? Getting to that outcome isn’t just about the choice of training data. It can also relate to the sampling strategy. The right sampling strategy will be selected based on suitability to the application.
Another consideration that will often be a factor is budget. Some of these methods are more computationally intensive and thus more costly.
*Absolutely complete aside: Not sure why jello came to mind as I wrote this, but I got on this whole jello thing, which led me to think about Jello Biafra and deep into the Dead Kennedys discography. Temperature settings high!
Send Me Your Questions!
I would love to hear about your data dilemmas or AI ethics questions and quandaries. You can send me a note at [email protected] or connect with me on LinkedIn. I will keep all inquiries confidential and remove any potentially sensitive information – so please feel free to keep things high level and anonymous as well.
This column is not legal advice. The information provided is strictly for educational purposes. AI and data regulation is an evolving area and anyone with specific questions should seek advice from a legal professional.
AI Governance Training
Gain the practical frameworks and tools to govern AI effectively.
Use code DATAEDU for 25% off through March 31.

