The Zero Temperature Myth: Why “Greedy” Doesn’t Always Mean “Same”

It’s one of the most common assumptions in AI: set the Temperature ($T$) to 0, and you’ll get a perfectly deterministic, robotic, and consistent output every single time. It feels logical. In a world of probabilities, $T = 0$ should mean “stop guessing and just give me the top answer.”

But if you’ve spent enough time prompting, you’ve likely noticed the “ghost in the machine.” Even at absolute zero, the model might occasionally swap a “the” for an “a” or change the structure of a sentence.

Let’s dive into the math, the hardware, and the quirks of probability to understand why $T = 0$ isn’t the “lockdown” we think it is.

1. The Math: What Temperature Actually Does

In a Large Language Model, the final layer doesn’t output words; it outputs logits (raw scores for every possible word in its vocabulary). To turn these scores into a probability distribution, we use the Softmax function.

The Temperature hyperparameter is a scaler applied to those logits before they hit the Softmax. The formula looks like this:

P_i = \frac{e^{z_i / T}}{\sum_j e^{z_j / T}}

Where:

$P_i$ is the probability of token $i$
$z_i$ is the logit (score) for that token
$T$ is the Temperature

The Behavior of $T$

High $T$ ($T > 1$): Flattens the distribution. The gap between the “best” word and the “okay” word shrinks, making the model more creative (and chaotic).
Low $T$ ($0 < T < 1$): Sharpens the distribution. The “best” word gets a massive boost in probability compared to everything else.
$T = 0$: Mathematically, you can’t divide by zero. In practice, setting $T = 0$ tells the model to perform Greedy Decoding. It bypasses the probability distribution entirely and simply picks the token with the highest logit.

2. Why the Inconsistency? The “GPU Chaos”

If the model is just picking the highest score, why would it ever change its mind? The answer usually lies in floating-point math and parallel processing.

Non-Associative Summation

Computers handle numbers using “floating-point” representation. Unlike pure math, in computer science, $(A + B) + C$ is not always exactly equal to $A + (B + C)$.

When a model runs on a GPU, it performs billions of operations in parallel. Because the order of these operations can vary slightly depending on the GPU’s workload or how the threads are scheduled, tiny “rounding errors” occur at the 10th or 15th decimal place.

The Butterfly Effect: If two words (e.g., “Apple” and “Banana”) have scores that are nearly identical, a tiny rounding error in the 16th decimal place can cause “Banana” to suddenly have a higher logit than “Apple.” At $T = 0$, the model picks the winner, even if it won by a microscopic margin.

The Problem of “Ties”

Sometimes, the model genuinely cannot decide. If two tokens have the exact same logit score, the system has to break the tie. Depending on the specific implementation of the software, this might fall back to a random choice or the token’s position in the library, adding another layer of potential variance.

3. Architecture and Service Quirks

Beyond the hardware, the way AI is served to you can cause shifts:

Mixture of Experts (MoE): Modern models (like some versions of Gemini or GPT-4) use a “routing” system where different parts of the model handle different parts of a query. If the router has any internal variance, different “experts” might handle your prompt, leading to slightly different logits.
Quantization: Many models are compressed to run faster. This compression makes the logit scores even closer together, increasing the likelihood that a tiny hardware fluctuation will flip the “top” choice.
Batching: When servers handle multiple requests at once, the way inputs are grouped can subtly affect the numerical precision of the calculations.

The zero temperature myth why greedy doesn't always mean same

Table of Contents