Hypergeometric Distribution

📊 Statistics⏱️ 15 min read📅 Last updated: 01/14/2025

Introduction

The hypergeometric distribution is a discrete probability distribution that models the number of successes in a sample drawn without replacement from a finite population. It is fundamental for understanding situations where sampling without replacement affects the probabilities of each subsequent trial, differing significantly from the binomial distribution. This article presents a complete and in-depth analysis of the hypergeometric distribution.

What is the Hypergeometric Distribution?

The hypergeometric distribution describes the number of successes in n trials performed without replacement from a finite population of size N, where there are K possible successes in the total population.

Conditions for Hypergeometric Distribution

For a situation to be modeled by a hypergeometric distribution, the following conditions must be met:

  • 1. Finite population (N): The population has a fixed and known size
  • 2. Successes in population (K): There are K successes in the total population
  • 3. Sample without replacement (n): We draw n elements without returning them
  • 4. Variable probability: Probability changes after each draw

Binomial vs Hypergeometric Difference

⚠️ Crucial Difference

Binomial

  • With replacement
  • Constant probability
  • Independent trials
  • Infinite (or very large) population

Hypergeometric

  • Without replacement
  • Variable probability
  • Dependent trials
  • Finite population

Formula

The probability function of the hypergeometric distribution is given by:

Probability Function

P(X = k) = [C(K,k) × C(N-K, n-k)] / C(N,n)
  • X: Random variable (number of successes in sample)
  • k: Number of successes desired (0 ≤ k ≤ min(n, K))
  • N: Size of total population
  • K: Number of successes in population
  • n: Sample size
  • C(a,b): Binomial coefficient (combinations)

Notation

Standard Notation

X ~ Hypergeometric(N, K, n)

Read as: "X follows a hypergeometric distribution with parameters N, K and n"

Parameters N, K and n

Parameter N

Population size: Total number of elements in the population. Must be a positive integer.

Parameter K

Successes in population: Number of elements of interest in the population. Must satisfy 0 ≤ K ≤ N.

Parameter n

Sample size: Number of elements drawn without replacement. Must satisfy 0 ≤ n ≤ N.

Mean and Variance

Measures of central tendency and dispersion of the hypergeometric distribution:

Fundamental Statistics

Mean (Expected Value)

E[X] = μ = n × (K/N)

The expected number of successes in the sample.

Variance

Var(X) = σ² = n × (K/N) × ((N-K)/N) × ((N-n)/(N-1))

The correction factor (N-n)/(N-1) appears due to sampling without replacement.

Standard Deviation

σ = √Var(X)

Interpretation

The hypergeometric formula can be understood as:

Combinatorial Interpretation

  • C(K,k): Ways to choose k successes from the K available
  • C(N-K, n-k): Ways to choose (n-k) failures from the (N-K) available
  • C(N,n): Total ways to choose n elements from N
  • Quotient: Probability = (Favorable cases) / (Possible cases)

Examples

Example 1: Quality Control

Problem

In a batch of 100 products, 20 are defective. If we inspect 10 products randomly without replacement, what is the probability of finding exactly 3 defective products?

Solution

  • • N = 100, K = 20, n = 10, k = 3
  • • P(X = 3) = [C(20,3) × C(80,7)] / C(100,10)
  • • P(X = 3) ≈ 0.209

The probability of finding exactly 3 defective products is approximately 20.9%.

Example 2: Cards

Problem

From a deck of 52 cards, 13 are spades. If we draw 5 cards without replacement, what is the probability of getting exactly 2 spades?

Solution

  • • N = 52, K = 13, n = 5, k = 2
  • • P(X = 2) ≈ 0.274

The probability of getting exactly 2 spades is approximately 27.4%.

Example 3: Poll

Problem

In a city with 10,000 voters, 4,000 intend to vote for candidate A. If we conduct a poll with 100 randomly selected voters (without replacement), what is the probability that exactly 45 vote for candidate A?

Solution

  • • N = 10,000, K = 4,000, n = 100, k = 45
  • • P(X = 45) ≈ 0.048

The probability of exactly 45 voters choosing candidate A is approximately 4.8%.

Comparison Table

When to Use Each

CriterionHypergeometricBinomial
SamplingWithout replacementWith replacement
PopulationFinite (small)Infinite (or very large)
ProbabilityVaries after each drawConstant in each trial
VarianceSmaller (correction factor)Larger

Binomial Approximation

When the population is very large relative to the sample, the hypergeometric distribution can be approximated by the binomial distribution:

Approximation Condition

If n/N < 0.05 (sample is less than 5% of population), we can use:

X ≈ Binomial(n, p = K/N)

In this case, sampling without replacement behaves approximately like sampling with replacement.

Applications

Application Areas

  • Quality Control: Inspection of finite lots without replacement
  • Audit: Verification of documents in a finite population
  • Opinion Polls: Sampling without replacement from finite populations
  • Compliance Testing: Verification of items in lots
  • Card Games: Probabilities in games where cards are not returned
  • Process Sampling: When population is limited

When NOT to Use

⚠️ When NOT to Use the Hypergeometric Distribution

  • Sampling with replacement: Use binomial distribution
  • Very large population: If n/N < 0.05, use binomial approximation
  • Sample larger than population: Mathematically impossible
  • More than two outcomes: Use multivariate hypergeometric distribution

Multivariate Extension

Multivariate Hypergeometric Distribution

When there are more than two categories in the population:

Extends the concept to multiple categories simultaneously, allowing modeling of more complex situations.

Conclusion

The hypergeometric distribution is essential for modeling situations where sampling without replacement affects probabilities. It is fundamental in quality control, audit, electoral polls, and many other areas where we work with finite populations.

Understanding the difference between hypergeometric and binomial distribution is crucial for choosing the appropriate model. When the sample is small relative to the population (n/N < 0.05), both provide similar results, but for larger samples, the difference becomes significant.

Remember: the hypergeometric distribution is appropriate when we have sampling without replacement from a finite population. If the population is very large or sampling is with replacement, consider using the binomial distribution.

Hypergeometric Distribution - Articles | SevenCoins