Hidden Markov Models & three key Problems and Algorithms

Contents

Introduction
Hidden Markov Models
Three Key HMM Problems

1) Forward-Backward algorithm
2) Viterbi algorithm
3) Baum-Welch algorithm

Introduction

This is a compilation of a series of videos in which Prof. Patterson describes the Hidden Markov Model, starting with the Markov Model and proceeding to the three core questions for HMMs.
A Hidden Markov Model is a machine learning model for predicting sequences of states from indirect observations.

Hidden Markov Model

Markov property: a mathematical modeling constraint

Discrete markov model
: discrete if it has distinct states that the model can be in at a given moment

At discrete times the system changes state (possibly to the same state) according to the transition probabilities

\(P(q_t = S_z | q_{t-1} = S_y, … , q_2 = S_j, q_1 = S_i)\)

A full probabilistic description of a given system would require knowledge of the current state and all previous states

But not so with the Markov property

\(P(q_t = S_j | q_{t-1} = S_i)\)

If we assume that our transition probabilities don’t vary over time then

\(a_{ij} = P(q_t = S_j | q_{t-1} = S_i) 1 \leq i, j \leq N\)

\(a_{ij} \geq 0\)

\(\sum_{j=1}^{N} a_{ij} = 1\)

zero means we’re not going to make that transition ever

An Observable Markov Model

Given the model is in a known state, what is the probability that it will stay in that state for d days?

The formal observation sequence is

\(O = \{S_i, S_i, S_i, …, S_i, S_j \neq S_i\}\)

\(P(O|Model, q_1 = S_i) = (a_{ii})^{d-1}(1-a_{ii}) = p_i(d)\)

This is the discrete probability density function of duration d in state i \(p_i(d)\)

From this we can calculate the average stay in any state conditioned on the model and starting in the state

\(\bar{d_i} = \sum_{d=1}^{\infty}dp_i(d)\)

\(\bar{d_i} = \sum_{d=1}^{\infty}d(a_{ii})^{d-1}(1-a_{ii}) = \frac{1}{1-a_{ii}}\)

So, the expected number of consecutive sunny days (given the model) is \(\frac{1}{1-a_{ii}} = \frac{1}{0.2} = 5\)

Imagine a person flipping a coin behind a curtain
Each result is called out and becomes an observation

  • How should we model this situation?
  • Should we assume a single fair coin is being flipped?
  • Should we assume the caller is accurate?

A Hidden Markov Model has N states
They are hidden, but they typically correspond to something we know about the world
States are generally connected aka ergodic S = \(\{S_1, S_2, S_3, S_4, S_5, …, S_N\}\)
state at time t is \(q_t\), \(\forall i : q_i \in S (1 \leq i \leq t)\)

A Hidden Markov Model has M observable symbols
They are what can be observed
They are a discrete alphabet
V = \(\{V_1, V_2, V_3, V_4, V_5, …, V_M\}\)

A Hidden Markov Model has State transition matrix A

A = \({a_{ij}}\)

\(a_{ij} = P(q_t = S_j | q_{t-1} = S_i) 1 \leq i, j \leq N\)

if a_{ij} = 0 then a transition between \(S_i\) and \(S_j\) is not possible

\(\sum_{j=1}^{N}a_{ij} = 1\)

A Hidden Markov Model has Observation probabilities

B = \({b_j(k)}\)

\(b_j(k) = P(v_k at t | q_t = S_j)\), 1 ≤ j ≤ N, 1 ≤ k ≤ M

A Hidden Markov Model has Initial state probabilities

\(\pi = {\pi_i}\)

\(\pi_i = P(q_1 = S_i) 1 \leq i \leq N\)

HMM = {N, M, A, B, π}

\(O = {O_1, O_2, O_3, …, O_T}\)

To generate observations
Choose \(q_1 = S_i\) according to \(\pi_i\)
Set t = 1
Choose \(O_t = v_k\) according to B ie \(b_i(k)\)
Transit to a new state \(q_{t+1} = S_j\) according to A ie \(a_{ij}\)
set t = t + 1 repeat until t > T

Markov Model is a special case of a Hidden Markov Model

  • One in which the observation distribution has one non-zero entry
  • There is one observation symbol for each state

Three Key HMM Problems

What is the probability that a model generated a sequence of observations?

\(O = {O_1, O_2, O_3, …, O_T}, \lambda = (A, B, \pi)\)
\(P(O|\lambda) ?\)

  • Given a sequence of observations, which apartment is the robot in?
  • Which model best explains the observations?

Forward-Backward algorithm

Let’s start by imagining all possible state sequences

\(Q = q_1, q_2, q_3, …, q_T\)

Probability of seeing observations given those states is

\(P(O|Q, \lambda) = \prod_{t=1}^{T}P(O_t|q_t, \lambda)\)

\(P(O|Q, \lambda) = b_{q1}(O_1) \cdot b_{q2}(O_2) \cdots b_{qT}(O_T)\)

Probability of seeing those state transitions is \(P(Q|\lambda) = \pi_{q1q2}a_{q2q3} \cdots a_{qT-1qT}\)

Probability of those seeing observations and those state transitions is P(O, Q|λ) = P(O|Q, λ)P(Q|λ)

But we want the probability of the observations regardless of the particular state sequence P(O|λ) = \(\sum_{all Q}\)P(O|Q, λ)P(Q|λ)

Calculating this directly is infeasible P(O|λ) = \(\sum_{q_1,q_2,…,q_T}\pi_{q_1}b_{q_1}(O_1)a_{q_1q_2}b_{q_2}(O_2)a_{q_2q_3} \cdots a_{q_{T-1}q_T}b_{q_T}(O_T)\)

How many state sequences are there? \(N^T\)

How many multiplications per state sequence? 2T-1

Total number of operations? \((2T-1)N^T + (N^T-1)\)

T = 100 and N = 5, How many operations?

\((2T-1)N^T + (N^T-1) = (2(100) – 1)5^100 + (5^100 – 1) = 199 \cdot 5^100 + 5^100 – 1 = 200 \cdot 5^100 – 1 \approx 5^103 \approx 10^{72}\)

We need a better way The forward-backward algorithm

Let’s introduce a helper function

\(\alpha_t(i) = P(O_1, O_2, O_3, …, O_t, q_t = S_i | \lambda)\)

Let’s solve for it inductively

  • base case: \(\alpha_1(i) = \pi_i b_i(O_1)\), 1 ≤ i ≤ N
  • inductive step:

\(\alpha_{t+1}(j) = [\sum_{i=1}^{N}\alpha_t(i)a_{ij}]b_j(O_{t+1})\), 1 ≤ t ≤ T-1, 1 ≤ j ≤ N

final step: P(O|λ) = \(\sum_{i=1}^{N}\alpha_T(i)\)

\(O(TN^T) \approx (2T-1)N^T + (N^T-1) \approx 10^{72} calculations\)

\(O(N^2 T) \approx 3000 calculations\), N = 5, T = 100

The forward part is all we need to solve the first problem \(\alpha_t(i)\)

But let’s do the same thing backward to help solve the other problems

\(\beta_t(i) = P(O_{t+1}, O_{t+2}, … O_T | q_t = S_i, \lambda)\)

Let’s solve for it inductively

  • base case: \(\beta_T(i) = 1\), 1 ≤ i ≤ N
  • inductive step:

\(\beta_t(i) = \sum_{j=1}^{N}a_{ij}b_j(O_{t+1})\beta_{t+1}(j)\), t = T-1, T-2, …, 1, 1 ≤ i ≤ N

\(\beta_t(i) = P(O_{t+1}, O_{t+2}, …O_T | q_t = S_i, \lambda\)

Motivating the Viterbi algorithm

Q1. What is the probability that a model generated a sequence of observations?

\(O = \{O_1, O_2, O_3, …, O_T\}, \lambda = (A, B, \pi)\)
\(P(O|\lambda) ?\)

P(O|λ) = \(\sum_{i=1}^{N}\alpha_T(i)\)

\(\alpha_t(i) = P(O_1, O_2, O_3, …, O_t, q_t = S_i | \lambda\)
\(\beta_t(i) = P(O_{t+1}, O_{t+2}, …O_T | q_t = S_i, \lambda\)

The Forward-Backward Algorithm

Q2. What sequence of states best explains a sequence of observations?

\(O = \{O_1, O_2, O_3, …, O_T\}\)
\(Q = \{q_1, q_2, q_3, …, q_T\} ?\)

What does “best” mean?

1) choose states that are individually most likely \(\gamma_t(i) = P(q_t = S_i | O, \lambda)\)

\( \gamma_t(i) = \frac{\alpha_t(i)\beta_t(i)}{P(O|\lambda)} = \frac{\alpha_t(i)\beta_t(i)}{\sum_{j=1}^{N}\alpha_t(j)\beta_t(j)}\)

\(\sum_{i=1}^{N}\gamma_t(i) = 1\)

\(q_t = argmax_{1 \leq i \leq N}[\gamma_t(i)]\), 1 ≤ t ≤ T

(1) This will maximize the expected number of correct states, but … it might not make sense

(2) choose a path sequence that maximizes P(Q|O, λ)

P(Q, O|λ)
\(P(\{q_1, q_2, q_3 … q_T\}, \{O_1, O_2, O_3, …, O_T\}|\lambda)\)

The Viterbi Algorithm

What is the path with the highest probability that accounts for the first tt observations and ends at state \(S_i\) ?

\(\delta_t(i) = max_{q_1,q_2,…q_{t-1}}P({q_1,q_2,q_3 \cdots q_t = i}, {O_1,O_2,O_3, \cdots O_t}|\lambda)\)

induction step

\(\delta_{t+1}(j) = [max_i \delta_t(i)a_{ij}]\cdot b_j(O_{t+1})\)

We need to keep track of which i maximized the result at each time step

Initialization \(\delta_1(i) = \pi_i b_i(O_1)\), \(\psi_i(i) = 0\)

Inductive step

\(\delta_t(j) = max_{1 \leq i \leq N}[\delta_{t-1}(i)a_{ij}]\cdot b_j(O_t)\)
\(\psi_t(j) = argmax_{1 \leq i \leq N}[\delta_{t-1}(i)a_{ij}]\), 2 ≤ t ≤ T, 1 ≤ j ≤ N

Termination

\(P^{*} = max_{1 \leq i \leq N}[\delta_T(i)]\), \(q_T^{*} = argmax_{1 \leq i \leq N}[\delta_T(i)]\)

The path sequence that maximizes P(Q, O|λ)

\(\delta_t(j) = max_{1 \leq i \leq N}[\delta_{t-1}(i)a_{ij}]\cdot b_j(O_t)\)

\(\alpha_t(j) = [\sum_{i=1}^{N}\alpha_{t-1}(i)a_{ij}] b_j(O_t)\)

Q3. Given a set of observations sequences how to we learn the model probabilities that would generate them?

Review our Resources

\(O = \{O_1, O_2, O_3, …, O_T\}\)
\(\lambda = (A, B, \pi) ?\)

\(\alpha_t(i) = P(O_1, O_2, O_3, …, O_t, q_t = S_i | \lambda)\)

\(\beta_t(i) = P(O_{t+1}, O_{t+2}, O_3, …, O_T, q_t = S_i | \lambda)\)

\(\gamma_t(i) = P(q_t = S_i | O, \lambda)\)

Baum-Welch Algorithm

There is no known way to solve for the globally optimal parameters of lambda

We will search for a locally optimal result
A result that converges to a stable good answer but isn’t guaranteed to be the best answer.

This is an EM method
– Expectation-Maximization (EM)
– Iterative, converges to a local optimum

aka gradient “descent”

We need a new mathematical tool

\(\xi_t(i, j) = P(q_t = S_i, q_{t+1} = S_j | O, \lambda)\)

\(\xi_t(i, j) = \frac{\alpha_t(i)a_{ij}b_j(O_{t+1})\beta_{t+1}(j)}{P(O|\lambda)}\)

\(\xi_t(i, j) = \frac{\alpha_t(i)a_{ij}b_j(O_{t+1})\beta_{t+1}(j)}{\sum_{i=1}^{N}\sum_{j=1}^{N}\alpha_t(i)a_{ij}b_j(O_{t+1})\beta_{t+1}(j)}\)

\(\xi_t(i, j)\) is related to \(\gamma_t(i)\)

\(\gamma_t(i) = \sum_{j=1}^{N}\xi_t(i, j)\)

\(\gamma_t(i)\) is the probability of being in \(S_i\) at time t

if we sum over all t then we have a number that can be treated as the expected number of times \(S_i\) is ever visited

\(\xi_t(i, j)\) is the probability of transitioning from \(S_i\) to \(S_j\) at time t

if we sum over all t then we have a number that can be treated as the expected number of times \(S_i\) ever transitions to \(S_j\)

\(\sum_{t=1}^{T-1}\gamma_t(i)\) = expected number of transitions from \(S_i\)

\(\sum_{t=1}^{T-1}\xi_t(i, j)\) = expected number of transitions from \(S_i\) to \(S_j\)

So now
– We have an existing model, λ = (A, B, π)
– We have a set of observations, O
– We have a set of tools \(\alpha_t(i), \beta_t(i), \gamma_t(i), \xi_t(i, j)\)

How do we use these to improve our model? \(\bar{\lambda}\) = ?

\(\bar{\pi} = \gamma_1(i)\) = expected frequency in \(S_i\) at time (t = 1)

\(\bar{a_{ij}} = \frac{\text{expected number of transitions from } S_i \text{ to } S_j}{\text{expected number of transitions from } S_i}\)

\(\bar{a_{ij}} = \frac{\sum_{t=1}^{T-1}\xi_t(i,j)}{\sum_{t=1}^{T-1}\gamma_t(i)}\)

\(\bar{b_j(k)} = \frac{\text{expected number of times in state j and observing } v_k}{\text{expected number of times in state j}}\)

\(\bar{b}_j(k) = \frac{\sum_{t=1 s.t.O_t=v_k}^{T}\gamma_t(i)}{\sum_{t=1}^{T}\gamma_t(i)}\)

Given λ = (A, B, π) and O we can produce \(\alpha_t(i), \beta_t(i), \gamma_t(i), \xi_t(i, j)\)

Given \(\alpha_t(i), \beta_t(i), \gamma_t(i), \xi_t(i, j)\) we can produce \(\bar{\lambda} = (\bar{A}, \bar{B}, \bar{\pi})\)

References

Leave a Comment