Let (X_1, X_2, \ldots, X_{30}) denote a random sample of size 30 from a random variable (X) with the probability mass function (pmf): [ f(x) = \left(\frac{5}{6}\right)^{x-1} \frac{1}{6}, \quad x = 1, 2, \ldots ]
(a) Write an expression (use (\Sigma) notation) that calculates the probability (P(X_1 + \ldots + X_{30} > 170)). Then write a code evaluating this value.
(b) Find an approximation of the probability in (a) by the Central Limit Theorem and the half-unit correction.
The given pmf is for a Geometric distribution, (X \sim \text{Geom}(p=1/6)), representing the number of trials needed to get the first success.
- Mean of (X_i): (E[X_i] = \mu_X = 1/p = 1/(1/6) = 6)
- Variance of (X_i): (\text{Var}(X_i) = \sigma_X^2 = (1-p)/p^2 = (5/6)/(1/6)^2 = 30)
Let (S_n = X_1 + X_2 + \ldots + X_n). The sum of (n=30) i.i.d. Geometric(p) random variables follows a Negative Binomial distribution, where (S_{30}) represents the total number of trials required to achieve (r=30) successes. The pmf is ( P(S_{30}=k) = \binom{k-1}{r-1} p^r (1-p)^{k-r} ) for (k = r, r+1, \ldots).
Mean and Variance of (S_{30}):
- Mean: (E[S_{30}] = \mu_S = r/p = 30/(1/6) = 180)
- Variance: (\text{Var}(S_{30}) = \sigma_S^2 = r(1-p)/p^2 = 30(5/6)/(1/6)^2 = 900)
- Standard Deviation: (\sigma_S = \sqrt{900} = 30)
We need to calculate (P(S_{30} > 170)). SciPy's stats.nbinom distribution is typically parameterized by the number of successes (r) and the probability of success (p), but its random variable represents the number of failures (F) that occur before achieving (r) successes.
The relationship between the total number of trials (S_r) and the number of failures (F) is (S_r = F + r). Thus, the event (S_{30} > 170) is equivalent to (F + 30 > 170), which means (F > 170 - 30 = 140).
So, we need to calculate (P(F > 140)) where (F \sim \text{NB_failures}(r=30, p=1/6)).
Using Python's scipy.stats.nbinom.sf(f, r, p) function, which computes (P(F > f)) for the failure-parameterized Negative Binomial:
[ P(S_{30} > 170) = P(F > 140) = \text{stats.nbinom.sf}(140, 30, 1/6) \approx 0.6031603187 ]
It's important to understand the nuances of calculating probabilities, especially when using libraries like SciPy:
-
Equivalence of Approaches: The probability (P(S_{30} > 170)) can be conceptualized in multiple equivalent ways:
- Directly as (P(S_{30} > 170)).
- As the complement: (1 - P(S_{30} \le 170)).
- When translated to SciPy's
nbinom(which models failures (F)): (P(F > 140)). - The complement in terms of failures: (1 - P(F \le 140)). All these lead to the same theoretical probability.
-
SciPy's
nbinomand its Methods:scipy.stats.nbinomis the object representing the Negative Binomial distribution. To get probabilities, we use its methods:.sf(f, r, p)calculates the Survival Function, (P(F > f))..cdf(f, r, p)calculates the Cumulative Distribution Function, (P(F \le f))..pmf(f, r, p)calculates the Probability Mass Function, (P(F = f)). Our choice ofstats.nbinom.sf(140, 30, 1/6)directly computes (P(F > 140)).
-
Numerical Precision (
.sfvs.1 - .cdf): For right-tail probabilities like (P(F > f)), using the survival function (.sf()) is generally preferred over1 - .cdf(f). While mathematically equivalent,.sf()is often implemented to provide better numerical precision, especially when the tail probability is very small (though for (P(F > 140) \approx 0.603), the practical difference here might be negligible, using.sf()is good practice and more direct). -
The Constant Probability
p: The probability of success (p=1/6) is a fundamental parameter of the underlying Bernoulli trials. This value remains constant whether we are analyzing the number of failures, total trials, or using different methods of calculation for the resulting Negative Binomial distribution.
The Central Limit Theorem (CLT) states that for a sufficiently large (n) (here (n=30)), the sum (S_n) can be approximated by a Normal distribution: [ S_{30} \approx N(\mu_S, \sigma_S^2) = N(180, 900) ] To approximate (P(S_{30} > 170)), we use a half-unit correction because the Negative Binomial distribution is discrete. (P(S_{30} > 170)) is equivalent to (P(S_{30} \ge 171)). This is approximated by (P(Y > 171 - 0.5) = P(Y > 170.5)), where (Y \sim N(180, 900)).
To find this probability, we standardize (Y) to a Standard Normal Distribution (Z \sim N(0,1)): [ Z = \frac{Y - \mu_S}{\sigma_S} = \frac{Y - 180}{30} ] So, [ P(Y > 170.5) = P\left(Z > \frac{170.5 - 180}{30}\right) = P\left(Z > \frac{-9.5}{30}\right) \approx P(Z > -0.31667) ] Using the symmetry of the Normal distribution, (P(Z > -0.31667) = P(Z < 0.31667)). This is (\Phi(0.31667)), where (\Phi) is the Cumulative Distribution Function (CDF) of the Standard Normal Distribution (N(0,1)). [ \Phi(0.31667) \approx 0.6247 ]
The calculations and visualizations will be implemented in project_ec2.py.
- The exact probability will use
scipy.stats.nbinom.sf(140, 30, 1/6). - The CLT approximation will calculate (Z = (170.5 - 180)/30) and then compute
scipy.stats.norm.cdf(-Z). - Visualizations will show the PMF of (S_{30} \sim NB(30, 1/6)) (representing total trials) and the PDF of its Normal approximation (N(180, 900)), highlighting the calculated probabilities.