### Contents

Introduction Hidden Markov Models Three Key HMM Problems 1) Forward-Backward algorithm 2) Viterbi algorithm 3) Baum-Welch algorithm

## Introduction

This is a compilation of a series of videos in which Prof. Patterson describes the Hidden Markov Model, starting with the Markov Model and proceeding to the three core questions for HMMs.

A Hidden Markov Model is a machine learning model for predicting sequences of states from indirect observations.

## Hidden Markov Model

Markov property: a mathematical modeling constraint

Discrete markov model

: discrete if it has distinct states that the model can be in at a given moment

At discrete times the system changes state (possibly to the same state) according to the transition probabilities

\(P(q_t = S_z | q_{t-1} = S_y, … , q_2 = S_j, q_1 = S_i)\)

A full probabilistic description of a given system would require knowledge of the current state and all previous states

But not so with the Markov property

\(P(q_t = S_j | q_{t-1} = S_i)\)

If we assume that our transition probabilities don’t vary over time then

\(a_{ij} = P(q_t = S_j | q_{t-1} = S_i) 1 \leq i, j \leq N\)

\(a_{ij} \geq 0\)

\(\sum_{j=1}^{N} a_{ij} = 1\)

zero means we’re not going to make that transition ever

##### An Observable Markov Model

Given the model is in a known state, what is the probability that it will stay in that state for d days?

The formal observation sequence is

\(O = \{S_i, S_i, S_i, …, S_i, S_j \neq S_i\}\)

\(P(O|Model, q_1 = S_i) = (a_{ii})^{d-1}(1-a_{ii}) = p_i(d)\)

This is the discrete probability density function of duration d in state i \(p_i(d)\)

From this we can calculate the average stay in any state conditioned on the model and starting in the state

\(\bar{d_i} = \sum_{d=1}^{\infty}dp_i(d)\)

\(\bar{d_i} = \sum_{d=1}^{\infty}d(a_{ii})^{d-1}(1-a_{ii}) = \frac{1}{1-a_{ii}}\)

So, the expected number of consecutive **sunny** days (given the model) is \(\frac{1}{1-a_{ii}} = \frac{1}{0.2} = 5\)

Imagine a person flipping a coin behind a curtain

Each result is called out and becomes an observation

- How should we model this situation?
- Should we assume a single fair coin is being flipped?
- Should we assume the caller is accurate?

A Hidden Markov Model has N states

They are hidden, but they typically correspond to something we know about the world

States are generally connected aka **ergodic** S = \(\{S_1, S_2, S_3, S_4, S_5, …, S_N\}\)

state at time t is \(q_t\), \(\forall i : q_i \in S (1 \leq i \leq t)\)

A Hidden Markov Model has M observable symbols

They are what can be observed

They are a discrete alphabet

V = \(\{V_1, V_2, V_3, V_4, V_5, …, V_M\}\)

A Hidden Markov Model has State transition matrix A

A = \({a_{ij}}\)

\(a_{ij} = P(q_t = S_j | q_{t-1} = S_i) 1 \leq i, j \leq N\)

if a_{ij} = 0 then a transition between \(S_i\) and \(S_j\) is not possible

\(\sum_{j=1}^{N}a_{ij} = 1\)

A Hidden Markov Model has Observation probabilities

B = \({b_j(k)}\)

\(b_j(k) = P(v_k at t | q_t = S_j)\), 1 ≤ j ≤ N, 1 ≤ k ≤ M

A Hidden Markov Model has Initial state probabilities

\(\pi = {\pi_i}\)

\(\pi_i = P(q_1 = S_i) 1 \leq i \leq N\)

HMM = {N, M, A, B, *π*}

\(O = {O_1, O_2, O_3, …, O_T}\)

To generate observations

Choose \(q_1 = S_i\) according to \(\pi_i\)

Set t = 1

Choose \(O_t = v_k\) according to B ie \(b_i(k)\)

Transit to a new state \(q_{t+1} = S_j\) according to A ie \(a_{ij}\)

set t = t + 1 repeat until t > T

Markov Model is a special case of a Hidden Markov Model

- One in which the observation distribution has one non-zero entry
- There is one observation symbol for each state

#### Three Key HMM Problems

What is the probability that a model generated a sequence of observations?

\(O = {O_1, O_2, O_3, …, O_T}, \lambda = (A, B, \pi)\)

\(P(O|\lambda) ?\)

- Given a sequence of observations, which apartment is the robot in?
- Which model best explains the observations?

### Forward-Backward algorithm

Let’s start by imagining all possible state sequences

\(Q = q_1, q_2, q_3, …, q_T\)

Probability of seeing observations given those states is

\(P(O|Q, \lambda) = \prod_{t=1}^{T}P(O_t|q_t, \lambda)\)

\(P(O|Q, \lambda) = b_{q1}(O_1) \cdot b_{q2}(O_2) \cdots b_{qT}(O_T)\)

Probability of seeing those state transitions is \(P(Q|\lambda) = \pi_{q1q2}a_{q2q3} \cdots a_{qT-1qT}\)

Probability of those seeing observations **and** those state transitions is P(O, Q|λ) = P(O|Q, λ)P(Q|λ)

But we want the probability of the observations regardless of the particular state sequence P(O|λ) = \(\sum_{all Q}\)P(O|Q, λ)P(Q|λ)

Calculating this directly is infeasible P(O|λ) = \(\sum_{q_1,q_2,…,q_T}\pi_{q_1}b_{q_1}(O_1)a_{q_1q_2}b_{q_2}(O_2)a_{q_2q_3} \cdots a_{q_{T-1}q_T}b_{q_T}(O_T)\)

How many state sequences are there? \(N^T\)

How many multiplications per state sequence? 2T-1

Total number of operations? \((2T-1)N^T + (N^T-1)\)

T = 100 and N = 5, How many operations?

\((2T-1)N^T + (N^T-1) = (2(100) – 1)5^100 + (5^100 – 1) = 199 \cdot 5^100 + 5^100 – 1 = 200 \cdot 5^100 – 1 \approx 5^103 \approx 10^{72}\)

We need a better way The forward-backward algorithm

Let’s introduce a helper function

\(\alpha_t(i) = P(O_1, O_2, O_3, …, O_t, q_t = S_i | \lambda)\)

Let’s solve for it inductively

- base case: \(\alpha_1(i) = \pi_i b_i(O_1)\), 1 ≤ i ≤ N
- inductive step:

\(\alpha_{t+1}(j) = [\sum_{i=1}^{N}\alpha_t(i)a_{ij}]b_j(O_{t+1})\), 1 ≤ t ≤ T-1, 1 ≤ j ≤ N

final step: P(O|λ) = \(\sum_{i=1}^{N}\alpha_T(i)\)

\(O(TN^T) \approx (2T-1)N^T + (N^T-1) \approx 10^{72} calculations\)

\(O(N^2 T) \approx 3000 calculations\), N = 5, T = 100

The forward part is all we need to solve the first problem \(\alpha_t(i)\)

But let’s do the same thing backward to help solve the other problems

\(\beta_t(i) = P(O_{t+1}, O_{t+2}, … O_T | q_t = S_i, \lambda)\)

Let’s solve for it inductively

- base case: \(\beta_T(i) = 1\), 1 ≤ i ≤ N
- inductive step:

\(\beta_t(i) = \sum_{j=1}^{N}a_{ij}b_j(O_{t+1})\beta_{t+1}(j)\), t = T-1, T-2, …, 1, 1 ≤ i ≤ N

\(\beta_t(i) = P(O_{t+1}, O_{t+2}, …O_T | q_t = S_i, \lambda\)

### Motivating the Viterbi algorithm

**Q1. What is the probability that a model generated a sequence of observations?**

\(O = \{O_1, O_2, O_3, …, O_T\}, \lambda = (A, B, \pi)\)

\(P(O|\lambda) ?\)

P(O|λ) = \(\sum_{i=1}^{N}\alpha_T(i)\)

\(\alpha_t(i) = P(O_1, O_2, O_3, …, O_t, q_t = S_i | \lambda\)

\(\beta_t(i) = P(O_{t+1}, O_{t+2}, …O_T | q_t = S_i, \lambda\)

**The Forward-Backward Algorithm**

**Q2. What sequence of states best explains a sequence of observations?**

\(O = \{O_1, O_2, O_3, …, O_T\}\)

\(Q = \{q_1, q_2, q_3, …, q_T\} ?\)

What does “best” mean?

1) choose states that are individually most likely \(\gamma_t(i) = P(q_t = S_i | O, \lambda)\)

\( \gamma_t(i) = \frac{\alpha_t(i)\beta_t(i)}{P(O|\lambda)} = \frac{\alpha_t(i)\beta_t(i)}{\sum_{j=1}^{N}\alpha_t(j)\beta_t(j)}\)

\(\sum_{i=1}^{N}\gamma_t(i) = 1\)

\(q_t = argmax_{1 \leq i \leq N}[\gamma_t(i)]\), 1 ≤ t ≤ T

(1) This will maximize the expected number of correct states, but … it might not make sense

(2) choose a **path sequence** that maximizes P(Q|O, λ)

P(Q, O|λ)

\(P(\{q_1, q_2, q_3 … q_T\}, \{O_1, O_2, O_3, …, O_T\}|\lambda)\)

**The Viterbi Algorithm**

What is the path with the highest probability that accounts for the first tt observations and ends at state \(S_i\) ?

\(\delta_t(i) = max_{q_1,q_2,…q_{t-1}}P({q_1,q_2,q_3 \cdots q_t = i}, {O_1,O_2,O_3, \cdots O_t}|\lambda)\)

induction step

\(\delta_{t+1}(j) = [max_i \delta_t(i)a_{ij}]\cdot b_j(O_{t+1})\)

We need to keep track of which i maximized the result at each time step

Initialization \(\delta_1(i) = \pi_i b_i(O_1)\), \(\psi_i(i) = 0\)

Inductive step

\(\delta_t(j) = max_{1 \leq i \leq N}[\delta_{t-1}(i)a_{ij}]\cdot b_j(O_t)\)

\(\psi_t(j) = argmax_{1 \leq i \leq N}[\delta_{t-1}(i)a_{ij}]\), 2 ≤ t ≤ T, 1 ≤ j ≤ N

Termination

\(P^{*} = max_{1 \leq i \leq N}[\delta_T(i)]\), \(q_T^{*} = argmax_{1 \leq i \leq N}[\delta_T(i)]\)

The **path sequence** that maximizes P(Q, O|λ)

\(\delta_t(j) = max_{1 \leq i \leq N}[\delta_{t-1}(i)a_{ij}]\cdot b_j(O_t)\)

\(\alpha_t(j) = [\sum_{i=1}^{N}\alpha_{t-1}(i)a_{ij}] b_j(O_t)\)

**Q3. Given a set of observations sequences how to we learn the model probabilities that would generate them?**

Review our Resources

\(O = \{O_1, O_2, O_3, …, O_T\}\)

\(\lambda = (A, B, \pi) ?\)

\(\alpha_t(i) = P(O_1, O_2, O_3, …, O_t, q_t = S_i | \lambda)\)

\(\beta_t(i) = P(O_{t+1}, O_{t+2}, O_3, …, O_T, q_t = S_i | \lambda)\)

\(\gamma_t(i) = P(q_t = S_i | O, \lambda)\)

#### Baum-Welch Algorithm

There is no known way to solve for the globally optimal parameters of lambda

We will search for a locally optimal result

A result that converges to a stable good answer but isn’t guaranteed to be the best answer.

This is an EM method

– Expectation-Maximization (EM)

– Iterative, converges to a local optimum

aka gradient “descent”

We need a new mathematical tool

\(\xi_t(i, j) = P(q_t = S_i, q_{t+1} = S_j | O, \lambda)\)

\(\xi_t(i, j) = \frac{\alpha_t(i)a_{ij}b_j(O_{t+1})\beta_{t+1}(j)}{P(O|\lambda)}\)

\(\xi_t(i, j) = \frac{\alpha_t(i)a_{ij}b_j(O_{t+1})\beta_{t+1}(j)}{\sum_{i=1}^{N}\sum_{j=1}^{N}\alpha_t(i)a_{ij}b_j(O_{t+1})\beta_{t+1}(j)}\)

\(\xi_t(i, j)\) is related to \(\gamma_t(i)\)

\(\gamma_t(i) = \sum_{j=1}^{N}\xi_t(i, j)\)

\(\gamma_t(i)\) is the probability of being in \(S_i\) at time t

if we sum over all t then we have a number that can be treated as the expected number of times \(S_i\) is ever visited

\(\xi_t(i, j)\) is the probability of transitioning from \(S_i\) to \(S_j\) at time t

if we sum over all t then we have a number that can be treated as the expected number of times \(S_i\) ever transitions to \(S_j\)

\(\sum_{t=1}^{T-1}\gamma_t(i)\) = expected number of transitions from \(S_i\)

\(\sum_{t=1}^{T-1}\xi_t(i, j)\) = expected number of transitions from \(S_i\) to \(S_j\)

So now

– We have an existing model, λ = (A, B, π)

– We have a set of observations, O

– We have a set of tools \(\alpha_t(i), \beta_t(i), \gamma_t(i), \xi_t(i, j)\)

How do we use these to improve our model? \(\bar{\lambda}\) = ?

\(\bar{\pi} = \gamma_1(i)\) = expected frequency in \(S_i\) at time (t = 1)

\(\bar{a_{ij}} = \frac{\text{expected number of transitions from } S_i \text{ to } S_j}{\text{expected number of transitions from } S_i}\)

\(\bar{a_{ij}} = \frac{\sum_{t=1}^{T-1}\xi_t(i,j)}{\sum_{t=1}^{T-1}\gamma_t(i)}\)

\(\bar{b_j(k)} = \frac{\text{expected number of times in state j and observing } v_k}{\text{expected number of times in state j}}\)

\(\bar{b}_j(k) = \frac{\sum_{t=1 s.t.O_t=v_k}^{T}\gamma_t(i)}{\sum_{t=1}^{T}\gamma_t(i)}\)

Given λ = (A, B, π) and O we can produce \(\alpha_t(i), \beta_t(i), \gamma_t(i), \xi_t(i, j)\)

Given \(\alpha_t(i), \beta_t(i), \gamma_t(i), \xi_t(i, j)\) we can produce \(\bar{\lambda} = (\bar{A}, \bar{B}, \bar{\pi})\)

### References

- Prof. Donald J. Patterson(djp3), (1/12) Hidden Markov Models 01: The Markov Property, Mar 20, 2020 – Apr 10, 2020, https://youtu.be/J_y5hx_ySCg?si=EXAPxw9ZDfNdgSWN
- ~
- Prof. Donald J. Patterson(djp3), (12/12) Hidden Markov Models 12: the Baum-Welch algorithm, Apr 10, 2020, https://youtu.be/JRsdt05pMoI?si=VYCOu8N6Ps6Akbll