Likelihood vs. Probability: What’s the Difference (and Why It Matters in Machine Learning)?

“Given a model, what’s the chance of this data?” vs. “Given this data, what’s the best model?”

Whether you’re diving into machine learning, statistics, or even just curious about coin tosses, understanding the difference between probability and likelihood is foundational — yet often misunderstood.

In this blog, we’ll explore both concepts through simple, visual, and intuitive examples — with a little help from our old friend: the coin.


Part 1: Understanding Probability

What Is Probability?

Probability is about predicting how likely a certain event is before it happens — assuming we already know the rules of the system.

Think Forward:

“If I know the coin is fair, what’s the chance of getting 3 heads in a row?”

This is a forward problem. You already know the model — now you’re trying to calculate the probability of an outcome.


Example: Fair Coin Toss

Assume the coin is fair:

  • p = 0.5 → Probability of Head
  • 1 - p = 0.5 → Probability of Tail

You toss the coin 3 times.

What’s the chance of getting HHH?

Since each flip is independent:

P(HHHp = 0.5) = 0.5 × 0.5 × 0.5 = 0.125

That’s a 12.5% chance — only 1 of 8 possible outcomes (since 2^3 = 8).


Part 2: Understanding Likelihood

What Is Likelihood?

Likelihood flips the question.

“I’ve seen 3 heads in a row — what’s the most likely value of p ?”

Now, instead of assuming the model and asking about outcomes, you assume the data and ask:

What model (what value of p ) makes this data most plausible?

This is the reverse direction — and it’s the foundation of Maximum Likelihood Estimation (MLE).


Example 1: You Observe HHH (3 Heads)

Let’s say you toss a coin 3 times, and observe: Result = H, H, H

You don’t know if the coin is fair. You want to estimate p — the probability of Heads — that makes this sequence most likely.

Likelihood Function:

L(p) = p × p × p = p^3

Why? Because each Head has probability p, and:

L(p) = P(H) × P(H) × P(H) = p^3

Maximize It:

Maximize L(p) = p^3 ⇒ Highest when p = 1

So:

  • MLE estimate: p̂ = 1

In other words, based on HHH, you’d conclude the coin is fully biased toward Heads — even though that conclusion is shaky with such little data.

Note: Likelihood doesn’t judge; it just finds the best fit for the data you’ve got.


Example 2: You Observe HHT (2 Heads, 1 Tail)

Now let’s say you observe: Result = H, H, T

This time, the likelihood function becomes:

L(p) = p^2 * (1 - p)

  • 2 Heads → p^2
  • 1 Tail → (1 - p)

Evaluate ( \mathcal{L}(p) ) for Different Values:

pL(p) = p^2 * (1 - p)
0.10.01 × 0.9 = 0.009
0.30.09 × 0.7 = 0.063
0.50.25 × 0.5 = 0.125
0.60.36 × 0.4 = 0.144
0.660.4356 × 0.34 ≈ 0.148
0.90.81 × 0.1 = 0.081

Result:

The maximum likelihood occurs at p ≈ 0.66

So:

  • MLE estimate: p̂ = 2/3

Why? Because HHT happened, and p = 0.66 makes that sequence most plausible.


Putting It All Together

ConceptDirectionAssumesSolves forExample
ProbabilityForwardKnown modelLikelihood of dataWhat’s P(HHH) if p = 0.5?
LikelihoodReverse (MLE)Known dataBest-fitting modelWhat’s best p if HHH observed?

Why This Matters in Machine Learning

In machine learning, especially during model training:

  • We assume the data is fixed (it’s our training set)
  • We optimize the model parameters to maximize the likelihood

So every time your model “learns,” it’s usually maximizing a likelihood function.


Final Thought

Probability predicts. Likelihood explains.

They may look like twins, but they walk in opposite directions.
Mastering both helps you think clearly about data, models, and uncertainty — whether you’re flipping coins or training neural networks.