Tutorial 1: Introduction to Advanced Probability for Graphical Models¶
Overview¶
- Basics
- Probability rules
- Exponential family models
- Maximum likelihood
- Conjugate Bayesian inference (time permitting)
Notation¶
A random variable, X represents outcomes or states of the world. In this class, we will write p(x) to mean p(X=x), the probability of the random variable X taking on state x.
Tip
See here for a helpful list of notational norms in probability.
The sample space is the space of all possible outcomes, which may be discrete, continuous, or mixed.
p(x) is the probability mass (or density) function (PMF/PDF), and assigns a non-negative number to each point in the sample space. The PMF/PDF must sum (or integrate) to 1. Intuitively, we can understand the PMF/PDF at x as representing how often x occurs, or how much we believe in x.
Note
There is no requirement however, that the PMF/PDF cannot take values greater than 1. A commonly cited and intuitive example is the uniform distribution on the interval [0, \frac{1}{2}]. While the value of the pdf f_X(x) is 2 for 0 \le x \le \frac{1}{2}, the area under the graph of f_X(x) is rectangular, and therefore equal to base \times width = \frac{1}{2} * 2 = 1.
Probability Distributions¶
1. Joint probability distribution¶
The joint probability distribution for random variables X, Y is a probability distribution that gives the probability that each of X, Y falls in any particular range or discrete set of values specified for that variable.
which is read as "the probability of X taking on x and Y taking on y".
2. Conditional Probability Distribution¶
The conditional probability distribution of Y given X is the probability distribution of Y when X is known to be a particular value.
which is read as "the probability of Y taking on y given that X is x".
3. Marginal Probability Distribution¶
The marginal distribution of a subset of a collection of random variables is the probability distribution of the variables contained in the subset
if X, Y are discrete
if X, Y are continuous
which is read as "the probability of X taking on x" or "the probability of Y taking on y".
Probability Rules¶
Some important rules in probability include
1. Rule of Sum (marginalization)¶
Gives the situations in which the probability of a union of events can be calculated by summing probabilities together. It is often used on mutually exclusive events, meaning events that cannot both happen at the same time.
2. Chain Rule¶
Permits the calculation of any member of the joint distribution of a set of random variables using only conditional probabilities. The rule is useful in the study of Bayesian networks, which describe a probability distribution in terms of conditional probabilities.
3. Bayes' Rule¶
Bayes' theorem is a formula that describes how to update the probabilities of hypotheses when given evidence. It follows simply from the axioms of conditional probability, but can be used to powerfully reason about a wide range of problems involving belief updates.
which is read as "the posterior is equal to the likelihood times the prior divided by the evidence".
Warning
Skipped some slides here. Come back and finish them.
Exponential Family¶
An exponential family is a set of probability distributions of a certain form, specified below. This special form is chosen for mathematical convenience, based on some useful algebraic properties, as well as for generality, as exponential families are in a sense very natural sets of distributions to consider.
Most of the commonly used distributions are in the exponential family, including
- Bernoulli
- Binomial/multinomial
- Normal (Gaussian)
def. Eponential family: The exponential family of distributions over x, given parameter \eta (eta) is the set of distributions of the form
where
- x is a scalar or a vector and is either continuous or discrete
- \eta are the natural parameter or canonical parameters
- T(x) is a vector of sufficient statistics
- h(x) is the scaling constant or base measure
- g(\eta) is the normalizing constant that guarantees the sum/integral equals 1.
Lets show some examples of rearranging distributions into their exponential family forms
Bernoulli¶
The Bernoulli distribution is given by
re-arranging into exponential family form, we get
from here, it is clear that
- \eta = \ln(\frac{\mu}{1 - \mu})
- T(x) = x
- h(x) = 1
noting that
we can see that g(\eta) = (1 - \mu) = \sigma(-\eta).
Multinomial¶
The multinomial distribution is given by
re-arranging into exponential family form, we get
from here, it is clear that
- \eta = \begin{bmatrix}\ln(u_1) & \cdots & \ln(u_M)\end{bmatrix}
- T(x) = x
- h(x) = 1
Gaussian¶
The univariate normal or Gaussian distribution is given by
from here, it is clear that
- \eta = \begin{bmatrix}\frac{1}{\sigma^2}u -\frac{1}{2\sigma^2}\end{bmatrix}
- T(x) = \begin{bmatrix}x \\\ x^2\end{bmatrix}
re-writing in terms of \eta
noting that
- h(x) = (\sqrt{2 \pi})^{-\frac{1}{2}}
- g(\eta) = (-2\eta_2)^{\frac{1}{2}} \cdot \exp\{\frac{\eta_1^2}{4\eta_2}\}
Tip
Chapter 9.2 of K. Murphy, Machine Learning a Probabilistic Perspective fleshes out these examples in more detail.