

Week 2: Introduction to Probabilistic Models¶

Assigned Reading¶

Murphy: Chapters 3, 4, 7-9 (excluding * sections)
Chapter 3 of David Mackay's textbook

Overview¶

Overview of probabilistic models
Sufficient statistics
Likelihood
Maximum likelihood estimation (MLE)
Classification

Overview of probabilistic models¶

In general, we have variables that can be observed or unobserved (latent variables are always unobserved!). The job of a probabilistic model is to relate the variables (observed or unobserved). More specifically, a probabilistic learns a joint probability distribution over variables, e.g. $p(x_1, x_2, ..., x_N)$ . Because the distributions are parameterized, learning is essentially joint density estimation.

In a generative model, we assume our data is generated by some distribution & then try to learn that distribution. The joint distribution (or joint density function) is the central object we use to define our model. From this perspective, we can think about the basic tasks we care about in machine learning (ML) as operations on joint distributions. One such task is classification

Model: $p(X, C)$
Task: $p(C=c | x) = \frac{p(c, x)}{p(x)}$

where $X$ are our inputs, $C$ our classes and $p(x)$ is the probability of our data (sometimes called the evidence). $p(x)$ can be re-written as the marginal probability

$p(C=c | x) = \frac{p(x, c)}{\sum_C p(x, c_i)}$

What happens if $c$ is never observed? Then we call this clustering. Clustering allows us to compute $p(C=c|x)$ ("the probability that the input belongs to some cluster") even if $c$ is unobserved.

$p(C = c | x) = \frac{p(c, x)}{p(x)} ; c \text{ is unobserved}$

If our inputs and classes are continuous, we call this regression

$p(y | x) = \frac{p(x, y)}{p(x)} = \frac{p(x, y)}{\int_Y p(x,y)dy}$

Note

The sum over our target/classes in the marginal distribution $p(x)$ is replaced by an integral.

In general,

if a variable is always observed, we may not want to model its density (regression / classification)
if a variable is never observed (always unobserved) then we call it a hidden or latent variable and we may want to model its density (clustering, density estimation)

In fact, we can mostly classify (no pun intended) the problems we care about into four types:

Classification: $p(c | x) = \frac{p(c, x)}{p(x)} = \frac{p(c, x)}{\sum_C p(c, x)}$
Clustering: $p(c | x) = \frac{p(c, x)}{p(x)} \ ; \ c \text{ is unobserved}$
Regression: $p(y | x) = \frac{p(y, x)}{p(x)} = \frac{p(y, x)}{\int_Y p(x, y)dy}$
Density Estimation: $p(y | x) = \frac{p(y, x)}{p(x)} \ ; \ y \text{ is unobserved}$

Operations on Probabilistic Models¶

The fundamental operations we will perform on a probabilistic model are:

Generate Data: For this you need to know how to sample from local models (directed) or how to do Gibbs or other sampling (undirected).
Compute probabilities: When all nodes are either observed or marginalized the result is a single number which is the probability of the configuration.
Inference: Compute expectations of some things given others which are observed or marginalized.
Learning: Set the parameters of the joint distribution given some (partially) observed data to maximize the probability of seeing the data.

Tip

I believe "configuration" means the likelihood of all the variables of our model taking on some set of specific values.

Goals of a Probabilistic Model¶

We want to build prediction systems automatically based on data, and as little as possible on expert information. In this course, we’ll use probability to combine evidence from data and make predictions. We’ll use graphical models as a visual shorthand language to express and reason about families of model assumptions, structures, dependencies and information flow, without specifying exact distributional forms or parameters. In this case learning means setting parameters of distributions given a model structure.

Note

"Structure learning" is also possible but we won’t consider it now.

More specifically, we want two things of our probabilistic model:

Compact representation: we don't want parameters to scale poorly with dimensionality of the data.
Efficient computation: we need to be able to compute the marginal and conditional probability from the joint distribution efficiently.

Multiple observations, Complete IID data¶

A single observation of the data, $X_i$ is rarely useful on its own. Generally we have data including many observations, which creates a set of random variables: $\mathcal D = \{X_1, ..., X_n\}$

To achieve our above-listed goals, we will make assumptions. Often, we assume the following:

1. Observations are independently and identically distributed (i.i.d) according to the joint distribution of the model.

This reduces the computation of the joint probability to the product of the individual probabilities of observations
We don't always assume full independence. Sometimes, we assume only some level of independence between variables (e.g. var 3 depends on var 1 but not on var 2).

2. We observe all random variables in the domain on each observation (i.e. complete data, or fully observed model).

Important

We typically shade the nodes in a probabilistic graphical model to indicate they are observed. (Later we will work with unshaded nodes corresponding to missing data or latent variables.)

Learning Parameters for a Distribution¶

Lets take an example with discrete random variables.

\[ T: \text{Temperature} \; ; \; h = \text{"hot" or } c = \text{"cold"} \] \[ W: \text{Weather} \; ; \; s = \text{"sunny" or } r = \text{"raining"} \]

We know that

\[ P(T=h) = 0.40 \] \[ P(T=c) = 0.60 \]

\[ P(W=s) = 0.70 \] \[ P(W=r) = 0.30 \]

and that these states define a valid probability distribution, so

$P(X) \ge 0 \; ; \; \sum_x P(X = x) = 1$

We could create a parameterized, probabilistic model, $P(T, W)$ over the states

\[P(T | \theta_T) \ ; \ \theta_T = \begin{bmatrix} 0.4 \\ 0.6 \end{bmatrix}\] \[P(W | \theta_W) \ ;\ \theta_W = \begin{bmatrix} 0.7 \\ 0.3 \end{bmatrix}\]

Notice that $\theta_T$ and $\theta_W$ are the probability distributions of our random variables. Our parameters define the probability of the data explicitly and store it in a vector.

We can represent the joint distribution $P(T, W)$ , our model, as:

T	W	P
h	s	0.28
c	r	0.18
h	r	0.12
c	s	0.42

from the joint distribution (which is again, essentially our model) we can compute the marginals

$P(T=h) = \sum_w P(T=h, W=w) = 0.40$

$P(T=c) = \sum_w P(T=c, W=w) = 0.60$

$P(W=s) = \sum_t P(W=s, T=t) = 0.70$

$P(W=r) = \sum_t P(W=r, T=t) = 0.30$

we could also ask questions about conditional probabilities, like

$P(W = s | T = c) = \frac{P(W=s,T=c)}{P(T=c)} = \frac{P(W=s,T=c)}{\sum_w P(T=c, W=w)} = \frac{0.42}{0.60} = 0.64$

Why did we do this?¶

The whole point of the above example was to show that from a probabilistic model, which itself is just a joint distribution represented as a matrix (or tensor), we can compute both the marginal and conditional probabilities. This will allow us to compute probabilities, generate data and perform inference.

Joint Dimensionality¶

Lets take our previous example and expand on it. Firstly, it is helpful to think of the joint distribution of some set of random variables as a grid with $k^n$ squares, where $n$ is our number of variables and $k$ our states.

For our running example, this means our joint distribution is parameterized by a 4 dimensional vector, containing the probabilities of seeing any pair of states. We could of course, add more random variables to our model. Imagine we add $B$ , for whether or not we bike into work and $H$ , for overall health, each with two states. The dimensionality of our parameters (for the joint distribution over $T, W, B, H$ ) then becomes

$k^n = 2^4$

It is important to note that our joint distribution will be computed based on the assumptions we make about independence between variables. For example, we could assume that while $T$ and $W$ are independent from one another, $H$ is dependent on both $T$ and $W$ as well as $B$ . From the chain rule, we get

$P(T, H, B, W) = P(T)P(W)P(H|T, W)P(H|B)$

Likelihood function¶

So far, we have focused on the probability function $p(x|\theta)$ which assigns a probability (density) to any joint configuration of variables $x$ given fixed parameters $\theta$ . But our goal is to learn $\theta$ , which we do not start with and which is not fixed.

This is the opposite of how we want to think. Really, we have some fixed data and we want to find parameters $\theta$ which maximize the likelihood of that data.

Note

We are asking "given $x$ , how do I choose $\theta$ ?".

To do this, we define some function of $\theta$ for a fixed $x$

$\ell(\theta ; x) = \log p(x|\theta)$

which we call the log likelihood function.

Note

The likelihood function is essentially a notational trick in order to make it easy to talk about our data as a function of our parameters.

The process of learning is choosing $\theta$ to minimize some cost or loss function, $L(\theta)$ which includes $\ell (\theta)$ . This can be done in a couple of ways, including:

Maximum likelihood estimation (MLE): $L(\theta) = \ell (\theta; \mathcal D)$
Maximum a posteriori (MAP): $L(\theta) = \ell (\theta; \mathcal D) + r(\theta)$

Maximum likelihood estimation¶

The basic idea behind maximum likelihood estimation (MLE) is to pick values for our parameters which were most likely to have generated the data we saw

$\hat \theta_{MLE} = \underset{\theta}{\operatorname{argmax}} \ell(\theta ; \mathcal D)$

Note

MLE is commonly used in statistics, and often leads to "intuitive", "appealing" or "natural" estimators.

For IID data

\[p(\mathcal D | \theta) = \prod_m p(x^{(m)} | \theta)\] \[\ell (\theta ; D) = \sum_m \log p(x^{(m)} | \theta)\]

The IID assumption turns the log likelihood into a sum, making the derivative easy to compute term by term.

Note

The negative log likelihood, $NLL(\theta ; D)$ , simply introduces a negative sign so our optimization problem becomes a minimization, that is, maximizing $\ell (\theta ; D)$ is equivalent to minimizing $NLL(\theta ; D)$ .

Sufficient statistics¶

A statistic is a (possibly vector valued) deterministic function of a (set of) random variable(s). A sufficient statistic is a statistic that conveys exactly the same information about the data generating process that created that data as the entire data itself. In other words, once we know the sufficient statistic, $T(x)$ , then our inferences are the same as would be obtained from our entire data. More formally, we say that $T(X)$ is a sufficient statistic for $X$ if

$T(x^{(1)}) = T(x^{(2)}) \Rightarrow L(\theta ; x^{(1)}) = L(\theta; x^{(2)}) \ \forall \theta$

Put another way

$P(\theta | T(X)) = P(\theta | X)$

Note

Why is this useful? Well, if we have a particular large data sample, a lot of the data may be redundant. If we knew the sufficient statistic for that sample, we could use it in place of the full data sample.

Equivalently (by the Neyman factorization theorem) we can write

$P(\theta | T(X)) = h(x, T(x))g(T(x), \theta)$

An example is the exponential family

$p(x | \eta) = h(x)\exp\{\eta^TT(x)-g(\eta)\}$

or, equivalently

$p(x | \eta) = h(x)g(\eta)\exp\{\eta^TT(x)\}$

Sufficient statistics example: Bernoulli Trials¶

Let us take the example of flipping a fair coin. This process that generates our data can be modeled as a Bernoulli distribution

$X \backsim \text{Ber}(\theta)$

where $X$ is a random variable and $x_i$ represents the result of the ith coin flip

\[x_i = 0 \text{ , if tails}\] \[x_i = 1 \text{ , if heads}\]

the likelihood (assuming independence between flips of the coin) is

\[L = \prod_{i=1}^N \theta^{x_i}(1-\theta)^{1-x_i}\] \[= \theta^{\sum_{i=1}^N x_i}(1-\theta)^{N-\sum_{i=1}^N x_i}\]

So we notice here that our likelihood depends on $\sum_{i=1}^N x_i$ . In other words, our data only enters the likelihood in this particular form. This tells us that if we know this summary statistic, which we will call $T(x) = \sum_{i=1}^N x_i$ then essentially we know everything that is useful from our sample to do inference.

To perform inference with $T(x)$ , we define the log likelihood

$\ell(\theta ; X) = \log p(X | \theta) \\ = T(X) \log \theta + (N - T(X)) \log(1-\theta)$

then we take the derivative and set it to 0 to find the maximum

$\Rightarrow \frac{\partial \ell}{\partial \theta} = \frac{T(X)}{\theta} - \frac{N - T(X)}{1-\theta} \\ \Rightarrow \hat \theta = \frac{T(X)}{N} \\$

This is our maximum likelihood estimation of the parameters $\theta$ , $\theta^{\star}_{MLE}$ .

Note

See Lecture 2 slides 10-13 for more examples.

Summary of Probabilistic Models¶

In general, learning the parameters of a probabilistic model depends on whether our variables are observed or partially observed, continuous or discrete

	Continuous	Discrete
Fully observed variables	Bespoke estimates from calculus	Normalized counts
Partially observed variables	Variational inference, recognition networks, MCMC	Message passing, variable elimination, junction tree

Appendix¶

Useful Resources¶

Helpful video on sufficient statistics.

Week 2: Introduction to Probabilistic Models¶

Assigned Reading¶

Overview¶

Overview of probabilistic models¶

Operations on Probabilistic Models¶

Goals of a Probabilistic Model¶

Multiple observations, Complete IID data¶

Learning Parameters for a Distribution¶

Why did we do this?¶

Joint Dimensionality¶

Likelihood function¶

Maximum likelihood estimation¶

Sufficient statistics¶

Sufficient statistics example: Bernoulli Trials¶

Summary of Probabilistic Models¶

Appendix¶

Useful Resources¶

Glossary of Terms¶