Probability theory reference

05 July 2025
probability theory

Probability theory quantifies uncertainty and makes predictions under conditions of incomplete information. The theory rests on several essential axioms.

Consider a random experiment with a set of all possible outcomes $Ω$ , called the sample space. Events are subsets of $Ω$ and the collection of all events forms a $σ$ -algebra $F$ .

A probability measure $P$ assigns each event $A$ in $F$ a number $P (A)$ satisfying Kolmogorov's axioms.

$P (A) \geq 0 for all A \in F$

$P (Ω) = 1$

$P (⋃_{i = 1}^{\infty} A_{i}) = \sum_{i = 1}^{\infty} P (A_{i}) for disjoint events A_{i}$

The triple $(Ω, F, P)$ forms a probability space which is the primary object of interest in probability theory.

The complement rule states the following.

$P (A^{c}) = 1 - P (A)$

For any event $A$ and its complement $A^{c} = Ω ∖ A$ .

The union of two events follows the inclusion-exclusion principle.

$P (A \cup B) = P (A) + P (B) - P (A \cap B)$

This extends to multiple events.

$P (⋃_{i = 1}^{n} A_{i}) = \sum_{i = 1}^{n} P (A_{i}) - \sum_{i < j} P (A_{i} \cap A_{j}) + \sum_{i < j < k} P (A_{i} \cap A_{j} \cap A_{k})$

$- \dots + (- 1)^{n - 1} P (A_{1} \cap A_{2} \cap \dots \cap A_{n})$

Conditional probability captures how the occurrence of one event affects the probability of another. The conditional probability of $A$ given $B$ is defined by the following relation.

$P (A | B) = \frac{P (A \cap B)}{P (B)}$

The definition applies when $P (B) > 0$ .

Two events $A$ and $B$ are independent if and only if they satisfy the following.

$P (A \cap B) = P (A) \cdot P (B)$

This generalizes to multiple events. $A_{1}, A_{2}, \dots, A_{n}$ are mutually independent if they satisfy the following.

$P (⋂_{i \in S} A_{i}) = \prod_{i \in S} P (A_{i})$

For any subset $S$ of indices $1, 2, \dots, n$ .

The law of total probability expresses the probability of an event as a weighted sum over a partition.

$P (A) = \sum_{i = 1}^{n} P (A | B_{i}) P (B_{i})$

For any partition $B_{1}, B_{2}, \dots, B_{n}$ of the sample space.

Bayes' theorem relates conditional probabilities.

$P (A | B) = \frac{P (B | A) P (A)}{P (B)}$

Using the law of total probability, we expand the denominator.

$P (A | B) = \frac{P (B | A) P (A)}{\sum_{i} P (B | A_{i}) P (A_{i})}$

Random variables formalize the mapping from outcomes to numerical values. A random variable $X$ is a measurable function from $Ω$ to the real numbers.

The cumulative distribution function (CDF) of a random variable $X$ is defined by the following.

$F_{X} (x) = P (X \leq x)$

CDF completely characterizes the distribution of $X$ and satisfies key properties. $F_{X}$ is non-decreasing. $lim_{x \to - \infty} F_{X} (x) = 0$ and $lim_{x \to \infty} F_{X} (x) = 1$ . $F_{X}$ is right-continuous, implying that $lim_{y ↓ x} F_{X} (y) = F_{X} (x)$ .

Discrete random variables take countably many values. The probability mass function (PMF) is defined as follows.

$p_{X} (x) = P (X = x)$

PMF satisfies $p_{X} (x) \geq 0$ and $\sum_{x} p_{X} (x) = 1$ .

Continuous random variables have a probability density function (PDF) $f_{X} (x)$ that satisfies the following relation.

$P (a \leq X \leq b) = \int_{a}^{b} f_{X} (x) d x$

PDF satisfies $f_{X} (x) \geq 0$ and $\int_{- \infty}^{\infty} f_{X} (x) d x = 1$ .

The relationship between CDF and PDF is defined by these equations.

$F_{X} (x) = \int_{- \infty}^{x} f_{X} (t) d t$

$f_{X} (x) = \frac{d}{d x} F_{X} (x)$

The expected value (mean) of a random variable $X$ is calculated differently for discrete and continuous random variables.

$E [X] = {\begin{cases} \sum_{x} x \cdot p_{X} (x) & for discrete X \\ \int_{- \infty}^{\infty} x \cdot f_{X} (x) d x & for continuous X \end{cases}$

The variance measures the spread around the mean and is calculated using this formula.

$Var (X) = E [(X - E [X])^{2}] = E [X^{2}] - (E [X])^{2}$

The moment generating function (MGF) encapsulates all moments of X.

$M_{X} (t) = E [e^{t X}]$

When it exists, MGF uniquely determines the distribution. The $k$ -th moment can be extracted using differentiation.

$E [X^{k}] = {\frac{d^{k}}{d t^{k}} M_{X} (t) |}_{t = 0}$

The characteristic function always exists and is defined by this formula.

$φ_{X} (t) = E [e^{i t X}]$

Bernoulli distribution models binary outcomes with parameter $p \in [0, 1]$ . Its PMF is given by:

$P (X = 1) = p, P (X = 0) = 1 - p$

Its mean is $E [X] = p$ and variance is $Var (X) = p (1 - p)$ .

The binomial distribution $Bin (n, p)$ counts successes in $n$ independent Bernoulli trials. Its PMF is given by:

$P (X = k) = (\binom{n}{k}) p^{k} (1 - p)^{n - k}$

The mean is $E [X] = n p$ and the variance is $Var (X) = n p (1 - p)$ .

The geometric distribution models the number of trials until the first success. Its PMF is given by:

$P (X = k) = (1 - p)^{k - 1} p$

The mean is $E [X] = \frac{1}{p}$ and the variance is $Var (X) = \frac{1 - p}{p^{2}}$ .

The Poisson distribution $Pois (λ)$ models the number of events in a fixed interval. Its PMF is given by:

$P (X = k) = \frac{λ^{k} e^{- λ}}{k!}$

Both its mean and variance equal $λ$ .

The uniform distribution $Unif (a, b)$ has constant density on an interval. Its PMF is given by:

$f_{X} (x) = {\begin{cases} \frac{1}{b - a} & if a \leq x \leq b 0 & otherwise \end{cases}$

The mean is $E [X] = \frac{a + b}{2}$ and the variance is $Var (X) = \frac{(b - a)^{2}}{12}$ .

The normal distribution $N (μ, σ^{2})$ has the following probability density function.

$f_{X} (x) = \frac{1}{σ \sqrt{2 π}} e^{- \frac{(x - μ)^{2}}{2 σ^{2}}}$

The mean is $μ$ and the variance is $σ^{2}$ .

The standard normal distribution $N (0, 1)$ has a special CDF denoted $Φ (x)$ and defined by this integral.

$Φ (x) = \frac{1}{\sqrt{2 π}} \int_{- \infty}^{x} e^{- \frac{t^{2}}{2}} d t$

For any normal random variable $X \sim N (μ, σ^{2})$ , the standardization $Z = \frac{X - μ}{σ} \sim N (0, 1)$ .

The exponential distribution $Exp (λ)$ models the time between events in a Poisson process. Its PMF is given by:

$f_{X} (x) = λ e^{- λ x} for x \geq 0$

The mean is $E [X] = \frac{1}{λ}$ and the variance is $Var (X) = \frac{1}{λ^{2}}$ .

The gamma distribution $Gamma (α, β)$ generalizes the exponential. The probability density function is given by this relation.

$f_{X} (x) = \frac{β^{α}}{Γ (α)} x^{α - 1} e^{- β x} for x > 0$

The mean is $E [X] = \frac{α}{β}$ and the variance is $Var (X) = \frac{α}{β^{2}}$ .

Joint distributions describe the behavior of multiple random variables. For discrete random variables $X$ and $Y$ , the joint PMF is defined as follows.

$p_{X, Y} (x, y) = P (X = x, Y = y)$

For continuous random variables, the joint PDF $f_{X, Y} (x, y)$ satisfies:

$P ((X, Y) \in A) = \int \int_{A} f_{X, Y} (x, y) d x d y$

Marginal distributions are obtained by summing or integrating out variables.

$p_{X} (x) = \sum_{y} p_{X, Y} (x, y)$

$f_{X} (x) = \int_{- \infty}^{\infty} f_{X, Y} (x, y) d y$

Independence of random variables $X$ and $Y$ is characterized by the following.

$f_{X, Y} (x, y) = f_{X} (x) \cdot f_{Y} (y)$

The covariance measures the linear relationship between random variables. It is calculated using this formula.

$Cov (X, Y) = E [(X - E [X]) (Y - E [Y])] = E [X Y] - E [X] E [Y]$

The correlation coefficient normalizes the covariance with the following.

$ρ_{X, Y} = \frac{Cov (X, Y)}{\sqrt{Var (X) Var (Y)}}$

Values range from $- 1$ to $1$ , implying perfect negative correlation, independence or perfect positive correlation.

Transformations of random variables follow specific rules. For a function $g$ and a random variable $X$ , the expected value is:

$E [g (X)] = {\begin{cases} \sum_{x} g (x) \cdot p_{X} (x) & for discrete X \\ \int_{- \infty}^{\infty} g (x) \cdot f_{X} (x) d x & for continuous X \end{cases}$

For linear transformations $Y = a X + b$ , the mean and variance transform according to these equations.

$E [Y] = a E [X] + b$

$Var (Y) = a^{2} Var (X)$

For a strictly monotonic function $g$ with inverse $g^{- 1}$ , the PDF of $Y = g (X)$ is given by:

$f_{Y} (y) = f_{X} (g^{- 1} (y)) | \frac{d}{d y} g^{- 1} (y) |$

This generalizes to multivariate transformations through the Jacobian.

Sums of independent random variables exhibit special properties. If $X$ and $Y$ are independent, their means and variances add.

$E [X + Y] = E [X] + E [Y]$

$Var (X + Y) = Var (X) + Var (Y)$

The MGF of their sum is the product of the individual MGFs.

$M_{X + Y} (t) = M_{X} (t) \cdot M_{Y} (t)$

The convolution relation gives the PDF of $Z = X + Y$ through this integral.

$f_{Z} (z) = \int_{- \infty}^{\infty} f_{X} (z - y) f_{Y} (y) d y$

Central limit theorem (CLT) is one of the most important results in probability theory. Let $X_{1}, X_{2}, \dots, X_{n}$ be independent and identically distributed random variables with mean $μ$ and variance $σ^{2} < \infty$ . CLT states the following convergence.

$\frac{\sum_{i = 1}^{n} X_{i} - n μ}{σ \sqrt{n}} \overset{d}{\to} N (0, 1)$

The symbol $\overset{d}{\to}$ denotes convergence in distribution as $n \to \infty$ .

CLT explains the ubiquity of the normal distribution in natural phenomena that result from many small, independent effects.

The law of large numbers (LLN) comes in two forms. The weak LLN states this convergence.

$\frac{1}{n} \sum_{i = 1}^{n} X_{i} \overset{p}{\to} μ$

The strong LLN makes a stronger statement.

$P (lim_{n \to \infty} \frac{1}{n} \sum_{i = 1}^{n} X_{i} = μ) = 1$

All of these laws formalize the intuition that the sample mean converges to the expected value as sample size increases.

Conditional expectations extend the concept of conditional probability to random variables. For discrete random variables $X$ and $Y$ , the conditional expectation is defined as follows.

$E [X | Y = y] = \sum_{x} x \cdot p_{X | Y} (x | y)$

For continuous random variables, the definition is an integral.

$E [X | Y = y] = \int_{- \infty}^{\infty} x \cdot f_{X | Y} (x | y) d x$

The law of total expectation states the following.

$E [X] = E [E [X | Y]]$

The conditional variance formula decomposes the variance.

$Var (X) = E [Var (X | Y)] + Var (E [X | Y])$

Markov chains model stochastic processes with the Markov property:

The future depends on the past only through the present state.

A discrete-time Markov chain has transition probabilities defined by:

$P (X_{n + 1} = j | X_{n} = i) = p_{i j}$

The transition matrix $P = (p_{i j})$ completely describes the chain's dynamics.

The $n$ -step transition probabilities are given by the $n$ -th power of $P$ .

$P (X_{m} = j | X_{0} = i) = (P^{m})_{i j}$

For an irreducible and aperiodic Markov chain, a unique stationary distribution $π$ exists. It satisfies these equations.

$π = π P$

$\sum_{i} π_{i} = 1$

The chain converges to this distribution regardless of the initial state.

Order statistics examine the distributions of ranked random variables. Given i.i.d. random variables $X_{1}, X_{2}, \dots, X_{n}$ with CDF $F_{X}$ , the $k$ -th order statistic $X_{(k)}$ has the following CDF.

$F_{X_{(k)}} (x) = \sum_{j = k}^{n} (\binom{n}{j}) [F_{X} (x)]^{j} [1 - F_{X} (x)]^{n - j}$

The minimum $X_{(1)}$ and maximum $X_{(n)}$ have special forms.

$F_{X_{(1)}} (x) = 1 - [1 - F_{X} (x)]^{n}$

$F_{X_{(n)}} (x) = [F_{X} (x)]^{n}$

Bayesian statistics treats unknown parameters as random variables with prior distributions. Given data $X$ and parameter $θ$ , Bayes' theorem gives the posterior distribution.

$f_{Θ | X} (θ | x) = \frac{f_{X | Θ} (x | θ) f_{Θ} (θ)}{f_{X} (x)}$

The posterior distribution $f_{Θ | X} (θ | x)$ updates our belief about $θ$ after observing data x.

A conjugate prior yields a posterior in the same family, simplifying computation. E.g. the Beta distribution is conjugate to the Bernoulli and Binomial likelihoods.

Sufficiency reduces data without losing information about parameters. A statistic $T (X)$ is sufficient for parameter $θ$ if the conditional distribution of $X$ given $T (X)$ does not depend on $θ$ .

The Fisher-Neyman factorization theorem provides a way to identify sufficient statistics. $T (X)$ is sufficient for $θ$ if and only if the likelihood can be factored as follows.

$f_{X} (x | θ) = g (T (x), θ) \cdot h (x)$

The functions $g$ and $h$ may depend on $x$ and $θ$ as indicated.

Maximum likelihood estimation (MLE) finds the parameter value $θ$ that maximizes the likelihood function. The likelihood is defined by the following.

$L (θ | x) = f_{X} (x | θ)$

Often, we maximize the log-likelihood instead.

$ℓ (θ | x) = \log L (θ | x)$

For i.i.d. observations, the log-likelihood has this form.

$ℓ (θ | x_{1}, \dots, x_{n}) = \sum_{i = 1}^{n} \log f_{X} (x_{i} | θ)$

The method of moments estimates parameters by equating sample moments to theoretical moments.

$\frac{1}{n} \sum_{i = 1}^{n} X_{i}^{k} = E [X^{k}]$

Hypothesis testing evaluates evidence against a null hypothesis $H_{0}$ in favor of an alternative $H_{1}$ . The p-value is the probability of observing a test statistic at least as extreme as the one observed, assuming $H_{0}$ is true.

Type I error occurs when rejecting a true $H_{0}$ . Type II error occurs when failing to reject a false $H_{0}$ . The power of a hypothesis test is the probability of correctly rejecting a false $H_{0}$ .

The likelihood ratio test compares the maximized likelihoods under $H_{0}$ and $H_{1}$ . The test statistic is defined as follows.

$Λ (x) = \frac{sup_{θ \in Θ_{0}} L (θ | x)}{sup_{θ \in Θ} L (θ | x)}$

Under certain regularity conditions, $- 2 \log Λ (X)$ converges to a chi-squared distribution as sample size increases.

← Previous
Linux syscalls
Next →
Writing an AVX2 DGEMM kernel