Hypergeometric Distribution
Introduction
The hypergeometric distribution is a discrete probability distribution that models the number of successes in a sample drawn without replacement from a finite population. It is fundamental for understanding situations where sampling without replacement affects the probabilities of each subsequent trial, differing significantly from the binomial distribution. This article presents a complete and in-depth analysis of the hypergeometric distribution.
What is the Hypergeometric Distribution?
The hypergeometric distribution describes the number of successes in n trials performed without replacement from a finite population of size N, where there are K possible successes in the total population.
Conditions for Hypergeometric Distribution
For a situation to be modeled by a hypergeometric distribution, the following conditions must be met:
- 1. Finite population (N): The population has a fixed and known size
- 2. Successes in population (K): There are K successes in the total population
- 3. Sample without replacement (n): We draw n elements without returning them
- 4. Variable probability: Probability changes after each draw
Binomial vs Hypergeometric Difference
⚠️ Crucial Difference
Binomial
- • With replacement
- • Constant probability
- • Independent trials
- • Infinite (or very large) population
Hypergeometric
- • Without replacement
- • Variable probability
- • Dependent trials
- • Finite population
Formula
The probability function of the hypergeometric distribution is given by:
Probability Function
P(X = k) = [C(K,k) × C(N-K, n-k)] / C(N,n)- • X: Random variable (number of successes in sample)
- • k: Number of successes desired (0 ≤ k ≤ min(n, K))
- • N: Size of total population
- • K: Number of successes in population
- • n: Sample size
- • C(a,b): Binomial coefficient (combinations)
Notation
Standard Notation
X ~ Hypergeometric(N, K, n)Read as: "X follows a hypergeometric distribution with parameters N, K and n"
Parameters N, K and n
Parameter N
Population size: Total number of elements in the population. Must be a positive integer.
Parameter K
Successes in population: Number of elements of interest in the population. Must satisfy 0 ≤ K ≤ N.
Parameter n
Sample size: Number of elements drawn without replacement. Must satisfy 0 ≤ n ≤ N.
Mean and Variance
Measures of central tendency and dispersion of the hypergeometric distribution:
Fundamental Statistics
Mean (Expected Value)
E[X] = μ = n × (K/N)The expected number of successes in the sample.
Variance
Var(X) = σ² = n × (K/N) × ((N-K)/N) × ((N-n)/(N-1))The correction factor (N-n)/(N-1) appears due to sampling without replacement.
Standard Deviation
σ = √Var(X)Interpretation
The hypergeometric formula can be understood as:
Combinatorial Interpretation
- • C(K,k): Ways to choose k successes from the K available
- • C(N-K, n-k): Ways to choose (n-k) failures from the (N-K) available
- • C(N,n): Total ways to choose n elements from N
- • Quotient: Probability = (Favorable cases) / (Possible cases)
Examples
Example 1: Quality Control
Problem
In a batch of 100 products, 20 are defective. If we inspect 10 products randomly without replacement, what is the probability of finding exactly 3 defective products?
Solution
- • N = 100, K = 20, n = 10, k = 3
- • P(X = 3) = [C(20,3) × C(80,7)] / C(100,10)
- • P(X = 3) ≈ 0.209
The probability of finding exactly 3 defective products is approximately 20.9%.
Example 2: Cards
Problem
From a deck of 52 cards, 13 are spades. If we draw 5 cards without replacement, what is the probability of getting exactly 2 spades?
Solution
- • N = 52, K = 13, n = 5, k = 2
- • P(X = 2) ≈ 0.274
The probability of getting exactly 2 spades is approximately 27.4%.
Example 3: Poll
Problem
In a city with 10,000 voters, 4,000 intend to vote for candidate A. If we conduct a poll with 100 randomly selected voters (without replacement), what is the probability that exactly 45 vote for candidate A?
Solution
- • N = 10,000, K = 4,000, n = 100, k = 45
- • P(X = 45) ≈ 0.048
The probability of exactly 45 voters choosing candidate A is approximately 4.8%.
Comparison Table
When to Use Each
| Criterion | Hypergeometric | Binomial |
|---|---|---|
| Sampling | Without replacement | With replacement |
| Population | Finite (small) | Infinite (or very large) |
| Probability | Varies after each draw | Constant in each trial |
| Variance | Smaller (correction factor) | Larger |
Binomial Approximation
When the population is very large relative to the sample, the hypergeometric distribution can be approximated by the binomial distribution:
Approximation Condition
If n/N < 0.05 (sample is less than 5% of population), we can use:
X ≈ Binomial(n, p = K/N)In this case, sampling without replacement behaves approximately like sampling with replacement.
Applications
Application Areas
- • Quality Control: Inspection of finite lots without replacement
- • Audit: Verification of documents in a finite population
- • Opinion Polls: Sampling without replacement from finite populations
- • Compliance Testing: Verification of items in lots
- • Card Games: Probabilities in games where cards are not returned
- • Process Sampling: When population is limited
When NOT to Use
⚠️ When NOT to Use the Hypergeometric Distribution
- • Sampling with replacement: Use binomial distribution
- • Very large population: If n/N < 0.05, use binomial approximation
- • Sample larger than population: Mathematically impossible
- • More than two outcomes: Use multivariate hypergeometric distribution
Multivariate Extension
Multivariate Hypergeometric Distribution
When there are more than two categories in the population:
Extends the concept to multiple categories simultaneously, allowing modeling of more complex situations.
Conclusion
The hypergeometric distribution is essential for modeling situations where sampling without replacement affects probabilities. It is fundamental in quality control, audit, electoral polls, and many other areas where we work with finite populations.
Understanding the difference between hypergeometric and binomial distribution is crucial for choosing the appropriate model. When the sample is small relative to the population (n/N < 0.05), both provide similar results, but for larger samples, the difference becomes significant.
Remember: the hypergeometric distribution is appropriate when we have sampling without replacement from a finite population. If the population is very large or sampling is with replacement, consider using the binomial distribution.