

Week 5: Exact Inference¶

Assigned Reading¶

Murphy: Chapter 20
MacKay: Chapter 21.1 (worked example with numbers)
MacKay: Chapter 16 (Message Passing, including soldier intuition)
MacKay: Chapter 26 Exact Inference on Factor Graphs

Overview¶

Variable elimination (VE)
Complexity of VE

Inference in PGM¶

In this lecture, we will explore inference in probabilistic graphical models (PGM).

Let

$X_F = \text{The unobserved variable we want to infer} \\ X_E = \text{The observed evidence} \\ X_R = X - \{X_F, X_E\}$

where $X_R$ is the set of random variables in our model that are neither part of the query nor the evidence. We will focus computing the conditional probability distribution

$p(X_F | X_E) = \frac{p(X_F, X_E)}{p(X_E)}$

each of the distributions we need to compute can be computed by marginalizing over the other variables.

$p(X_F, X_E) = \sum_{X_R}p(X_F, X_E, X_R) \\ p(X_E) = \sum_{X_F, X_R}p(X_F, X_E, X_R)$

However, naively marginalizing over all unobserved variables requires a number of computations exponential in the number of random variables, $N$ , in our model.

Variable elimination¶

Variable elimination is a simple and general exact inference algorithm in any probabilistic graphical model.

The running time will depend on the graph structure of the model, but, by using dynamic programming we can avoid enumerating all assignments.

Simple Example: Chain¶

Lets start with the example of a simple chain

$A \rightarrow B \rightarrow C \rightarrow D$

where we want to compute $P(D)$ . We have

$X_F = \{D\}, \ X_E = \{\}, \ X_R = \{A, B, C\}$

and

$\begin{align} P(X_F) &= \sum_{X_R} p(X_F, X_R) \\ \Rightarrow P(D) &= \sum_{A, B, C}p(A, B, C, D) \\ & = \sum_C \sum_B \sum_A p(A)p(B | A) p(C | B) p(D | C) \end{align}$

clearly, this is exponential in the number of variables ( $\mathcal O(k^n)$ ). But, reordering the joint distribution

$P(D) = \sum_C p(D | C) \sum_B p(C | B) \sum_A p(A)p(B | A)$

we can begin to simplify the summation

So, by using dynamic programming to do the computation inside out instead of outside in, we have done inference over the joint distribution represented by the chain without generating it explicitly. The cost of performing inference on the chain in this manner is $\mathcal O(nk^2)$ . In comparison, generating the full joint distribution and marginalizing over it has complexity $\mathcal O(k^n)$ !

Tip

See slide 7-13 for a full derivation of this example, including the runtime $\mathcal O(nk^2)$ .

Simple Example: DGM¶

Lets take the DGM we saw in lecture 3

What is $p(x_1 | \bar x_6)$ ? We have

Note

The $\bar x$ means that the variable is observed.

$\ X_F = \{x_1\}, X_E = \{x_6\}, \ X_R = \{x_2, x_3, x_4, x_5\}$

and

$\begin{align} p(X_F | X_E) &= \frac{\sum_{X_R} p(X_F, X_E, X_R)}{\sum_{X_F, X_R} p(X_F, X_E, X_R)} \\ \Rightarrow p(x_1 | \bar x_6) &= \frac{p(x_1, \bar x_6)}{p(\bar x_6)} \\ &= \frac{p(x_1, \bar x_6)}{\sum_{x \in {X_F, X_R}}p(x, \bar x_6)} \\ \end{align}$

to compute $p(x_1, \bar x_6)$ , we use variable elimination

$p(x_1, \bar x_6) = p(x_1) \sum_{x_2} \sum_{x_3} \sum_{x_4} \sum_{x_5} p(x_2 | x_1)p(x_3 | x_1)p(x_4 | x_2)p(x_5 | x_3)p(\bar x_6 | x_2, x_5)\\ = p(x_1) \sum_{x_2} p(x_2 | x_1) \sum_{x_3} p(x_3 | x_1) \sum_{x_4} p(x_4 | x_2) \sum_{x_5} p(x_5 | x_3)p(\bar x_6 | x_2, x_5) \\ = p(x_1) \sum_{x_2} p(x_2 | x_1) \sum_{x_3} p(x_3 | x_1) \sum_{x_4} p(x_4 | x_2) p(\bar x_6 | x_2, x_3) \\ = p(x_1) \sum_{x_2} p(x_2 | x_1) \sum_{x_3} p(x_3 | x_1) p(\bar x_6 | x_2, x_3) \sum_{x_4} p(x_4 | x_2) \\ = p(x_1) \sum_{x_2} p(x_2 | x_1) \sum_{x_3} p(x_3 | x_1) p(\bar x_6 | x_2, x_3) \\ = p(x_1) \sum_{x_2} p(x_2 | x_1) p(\bar x_6 | x_1, x_2) \\ = p(x_1) p(\bar x_6 | x_1) \\$

Summary so far¶

A worst case analysis says that computing the joint distribution is NP-hard. In practice, some subexpressions in the joint depend only on a subset of variables due to the structure of the Bayesian network. We can therefore use dynamic programming to reduce the number of computations, however, this depends on choosing a good variable elimination ordering.

Sum-product inference¶

We want an algorithm to compute $P(Y)$ for directed and undirected models. This can be reduced to the following sum-product inference task

$\tau(Y) = \sum_z \prod_{\psi \in \Psi} \psi(z_{Scope[\psi] \cap Z}, y_{Scope[\psi] \cap Y}) \quad \forall Y$

where $\Psi$ is a set of potentials or factors.

For directed models, $\Psi$ is given by the conditional probability distributions for all variables

$\Psi = \{\psi_{x_i}\}^N_{i=1} = \{p(x_i | x_{\pi_i})\}^N_{i=1}$

where the sum is over the set $Z = X - X_F$ .

For undirected models, $\Psi$ is given by the set of potentials. Because the sum product returns unnormalized $\Phi$ distributions, we must normalize by $\sum_Y\tau(y)$ .

Example: Directed Graph¶

Take the following directed graph as example

The joint distribution is given by

$p(C, D, I, G, S, L, H, J) = p(C)p(D|C)p(I)p(G | D, I)p(L | G)P(S | I)p(J | S, L)p(H | J, G)$

with factors

$\Psi = \{\psi_C(C), \psi_D(C, D), \psi_I(I), \psi_G(G, D, I), \psi_L(L, G), \psi_S(S, I), \psi_J(J, S, L), \psi_H(H, J, H)\}$

Let's do variable elimination with ordering $\prec \{C, D, I, H, G, S, L\}$