Skip to main content
Abdi Moalim

Probability theory reference

Probability theory quantifies uncertainty and makes predictions under conditions of incomplete information. The theory rests on several essential axioms.

Consider a random experiment with a set of all possible outcomes Ω, called the sample space. Events are subsets of Ω and the collection of all events forms a σ-algebra F.

A probability measure P assigns each event A in F a number P(A) satisfying Kolmogorov's axioms.

P(A)0 for all AF

P(Ω)=1

P(i=1Ai)=i=1P(Ai) for disjoint events Ai

The triple (Ω,F,P) forms a probability space which is the primary object of interest in probability theory.

The complement rule states the following.

P(Ac)=1P(A)

For any event A and its complement Ac=ΩA.

The union of two events follows the inclusion-exclusion principle.

P(AB)=P(A)+P(B)P(AB)

This extends to multiple events.

P(i=1nAi)=i=1nP(Ai)i<jP(AiAj)+i<j<kP(AiAjAk)

+(1)n1P(A1A2An)

Conditional probability captures how the occurrence of one event affects the probability of another. The conditional probability of A given B is defined by the following relation.

P(A|B)=P(AB)P(B)

The definition applies when P(B)>0.

Two events A and B are independent if and only if they satisfy the following.

P(AB)=P(A)P(B)

This generalizes to multiple events. A1,A2,,An are mutually independent if they satisfy the following.

P(iSAi)=iSP(Ai)

For any subset S of indices 1,2,,n.

The law of total probability expresses the probability of an event as a weighted sum over a partition.

P(A)=i=1nP(A|Bi)P(Bi)

For any partition B1,B2,,Bn of the sample space.

Bayes' theorem relates conditional probabilities.

P(A|B)=P(B|A)P(A)P(B)

Using the law of total probability, we expand the denominator.

P(A|B)=P(B|A)P(A)iP(B|Ai)P(Ai)

Random variables formalize the mapping from outcomes to numerical values. A random variable X is a measurable function from Ω to the real numbers.

The cumulative distribution function (CDF) of a random variable X is defined by the following.

FX(x)=P(Xx)

CDF completely characterizes the distribution of X and satisfies key properties. FX is non-decreasing. limxFX(x)=0 and limxFX(x)=1. FX is right-continuous, implying that limyxFX(y)=FX(x).

Discrete random variables take countably many values. The probability mass function (PMF) is defined as follows.

pX(x)=P(X=x)

PMF satisfies pX(x)0 and xpX(x)=1.

Continuous random variables have a probability density function (PDF) fX(x) that satisfies the following relation.

P(aXb)=abfX(x)dx

PDF satisfies fX(x)0 and fX(x)dx=1.

The relationship between CDF and PDF is defined by these equations.

FX(x)=xfX(t)dt

fX(x)=ddxFX(x)

The expected value (mean) of a random variable X is calculated differently for discrete and continuous random variables.

E[X]={xxpX(x)for discrete XxfX(x)dxfor continuous X

The variance measures the spread around the mean and is calculated using this formula.

Var(X)=E[(XE[X])2]=E[X2](E[X])2

The moment generating function (MGF) encapsulates all moments of X.

MX(t)=E[etX]

When it exists, MGF uniquely determines the distribution. The k-th moment can be extracted using differentiation.

E[Xk]=dkdtkMX(t)|t=0

The characteristic function always exists and is defined by this formula.

φX(t)=E[eitX]

Bernoulli distribution models binary outcomes with parameter p[0,1]. Its PMF is given by:

P(X=1)=p,P(X=0)=1p

Its mean is E[X]=p and variance is Var(X)=p(1p).

The binomial distribution Bin(n,p) counts successes in n independent Bernoulli trials. Its PMF is given by:

P(X=k)=(nk)pk(1p)nk

The mean is E[X]=np and the variance is Var(X)=np(1p).

The geometric distribution models the number of trials until the first success. Its PMF is given by:

P(X=k)=(1p)k1p

The mean is E[X]=1p and the variance is Var(X)=1pp2.

The Poisson distribution Pois(λ) models the number of events in a fixed interval. Its PMF is given by:

P(X=k)=λkeλk!

Both its mean and variance equal λ.

The uniform distribution Unif(a,b) has constant density on an interval. Its PMF is given by:

fX(x)={1baif axb 0otherwise

The mean is E[X]=a+b2 and the variance is Var(X)=(ba)212.

The normal distribution N(μ,σ2) has the following probability density function.

fX(x)=1σ2πe(xμ)22σ2

The mean is μ and the variance is σ2.

The standard normal distribution N(0,1) has a special CDF denoted Φ(x) and defined by this integral.

Φ(x)=12πxet22dt

For any normal random variable XN(μ,σ2), the standardization Z=XμσN(0,1).

The exponential distribution Exp(λ) models the time between events in a Poisson process. Its PMF is given by:

fX(x)=λeλx for x0

The mean is E[X]=1λ and the variance is Var(X)=1λ2.

The gamma distribution Gamma(α,β) generalizes the exponential. The probability density function is given by this relation.

fX(x)=βαΓ(α)xα1eβx for x>0

The mean is E[X]=αβ and the variance is Var(X)=αβ2.

Joint distributions describe the behavior of multiple random variables. For discrete random variables X and Y, the joint PMF is defined as follows.

pX,Y(x,y)=P(X=x,Y=y)

For continuous random variables, the joint PDF fX,Y(x,y) satisfies:

P((X,Y)A)=AfX,Y(x,y)dxdy

Marginal distributions are obtained by summing or integrating out variables.

pX(x)=ypX,Y(x,y)

fX(x)=fX,Y(x,y)dy

Independence of random variables X and Y is characterized by the following.

fX,Y(x,y)=fX(x)fY(y)

The covariance measures the linear relationship between random variables. It is calculated using this formula.

Cov(X,Y)=E[(XE[X])(YE[Y])]=E[XY]E[X]E[Y]

The correlation coefficient normalizes the covariance with the following.

ρX,Y=Cov(X,Y)Var(X)Var(Y)

Values range from 1 to 1, implying perfect negative correlation, independence or perfect positive correlation.

Transformations of random variables follow specific rules. For a function g and a random variable X, the expected value is:

E[g(X)]={xg(x)pX(x)for discrete Xg(x)fX(x)dxfor continuous X

For linear transformations Y=aX+b, the mean and variance transform according to these equations.

E[Y]=aE[X]+b

Var(Y)=a2Var(X)

For a strictly monotonic function g with inverse g1, the PDF of Y=g(X) is given by:

fY(y)=fX(g1(y))|ddyg1(y)|

This generalizes to multivariate transformations through the Jacobian.

Sums of independent random variables exhibit special properties. If X and Y are independent, their means and variances add.

E[X+Y]=E[X]+E[Y]

Var(X+Y)=Var(X)+Var(Y)

The MGF of their sum is the product of the individual MGFs.

MX+Y(t)=MX(t)MY(t)

The convolution relation gives the PDF of Z=X+Y through this integral.

fZ(z)=fX(zy)fY(y)dy

Central limit theorem (CLT) is one of the most important results in probability theory. Let X1,X2,,Xn be independent and identically distributed random variables with mean μ and variance σ2<. CLT states the following convergence.

i=1nXinμσndN(0,1)

The symbol d denotes convergence in distribution as n.

CLT explains the ubiquity of the normal distribution in natural phenomena that result from many small, independent effects.

The law of large numbers (LLN) comes in two forms. The weak LLN states this convergence.

1ni=1nXipμ

The strong LLN makes a stronger statement.

P(limn1ni=1nXi=μ)=1

All of these laws formalize the intuition that the sample mean converges to the expected value as sample size increases.

Conditional expectations extend the concept of conditional probability to random variables. For discrete random variables X and Y, the conditional expectation is defined as follows.

E[X|Y=y]=xxpX|Y(x|y)

For continuous random variables, the definition is an integral.

E[X|Y=y]=xfX|Y(x|y)dx

The law of total expectation states the following.

E[X]=E[E[X|Y]]

The conditional variance formula decomposes the variance.

Var(X)=E[Var(X|Y)]+Var(E[X|Y])

Markov chains model stochastic processes with the Markov property:

The future depends on the past only through the present state.

A discrete-time Markov chain has transition probabilities defined by:

P(Xn+1=j|Xn=i)=pij

The transition matrix P=(pij) completely describes the chain's dynamics.

The n-step transition probabilities are given by the n-th power of P.

P(Xm=j|X0=i)=(Pm)ij

For an irreducible and aperiodic Markov chain, a unique stationary distribution π exists. It satisfies these equations.

π=πP

iπi=1

The chain converges to this distribution regardless of the initial state.

Order statistics examine the distributions of ranked random variables. Given i.i.d. random variables X1,X2,,Xn with CDF FX, the k-th order statistic X(k) has the following CDF.

FX(k)(x)=j=kn(nj)[FX(x)]j[1FX(x)]nj

The minimum X(1) and maximum X(n) have special forms.

FX(1)(x)=1[1FX(x)]n

FX(n)(x)=[FX(x)]n

Bayesian statistics treats unknown parameters as random variables with prior distributions. Given data X and parameter θ, Bayes' theorem gives the posterior distribution.

fΘ|X(θ|x)=fX|Θ(x|θ)fΘ(θ)fX(x)

The posterior distribution fΘ|X(θ|x) updates our belief about θ after observing data x.

A conjugate prior yields a posterior in the same family, simplifying computation. E.g. the Beta distribution is conjugate to the Bernoulli and Binomial likelihoods.

Sufficiency reduces data without losing information about parameters. A statistic T(X) is sufficient for parameter θ if the conditional distribution of X given T(X) does not depend on θ.

The Fisher-Neyman factorization theorem provides a way to identify sufficient statistics. T(X) is sufficient for θ if and only if the likelihood can be factored as follows.

fX(x|θ)=g(T(x),θ)h(x)

The functions g and h may depend on x and θ as indicated.

Maximum likelihood estimation (MLE) finds the parameter value θ that maximizes the likelihood function. The likelihood is defined by the following.

L(θ|x)=fX(x|θ)

Often, we maximize the log-likelihood instead.

(θ|x)=logL(θ|x)

For i.i.d. observations, the log-likelihood has this form.

(θ|x1,,xn)=i=1nlogfX(xi|θ)

The method of moments estimates parameters by equating sample moments to theoretical moments.

1ni=1nXik=E[Xk]

Hypothesis testing evaluates evidence against a null hypothesis H0 in favor of an alternative H1. The p-value is the probability of observing a test statistic at least as extreme as the one observed, assuming H0 is true.

Type I error occurs when rejecting a true H0. Type II error occurs when failing to reject a false H0. The power of a hypothesis test is the probability of correctly rejecting a false H0.

The likelihood ratio test compares the maximized likelihoods under H0 and H1. The test statistic is defined as follows.

Λ(x)=supθΘ0L(θ|x)supθΘL(θ|x)

Under certain regularity conditions, 2logΛ(X) converges to a chi-squared distribution as sample size increases.