Contents
Introduction Hidden Markov Models Three Key HMM Problems 1) Forward-Backward algorithm 2) Viterbi algorithm 3) Baum-Welch algorithm
Introduction
This is a compilation of a series of videos in which Prof. Patterson describes the Hidden Markov Model, starting with the Markov Model and proceeding to the three core questions for HMMs.
A Hidden Markov Model is a machine learning model for predicting sequences of states from indirect observations.
Hidden Markov Model
Markov property: a mathematical modeling constraint
Discrete markov model
: discrete if it has distinct states that the model can be in at a given moment
At discrete times the system changes state (possibly to the same state) according to the transition probabilities
\(P(q_t = S_z | q_{t-1} = S_y, … , q_2 = S_j, q_1 = S_i)\)
A full probabilistic description of a given system would require knowledge of the current state and all previous states
But not so with the Markov property
\(P(q_t = S_j | q_{t-1} = S_i)\)
If we assume that our transition probabilities don’t vary over time then
\(a_{ij} = P(q_t = S_j | q_{t-1} = S_i) 1 \leq i, j \leq N\)
\(a_{ij} \geq 0\)
\(\sum_{j=1}^{N} a_{ij} = 1\)
zero means we’re not going to make that transition ever
An Observable Markov Model
Given the model is in a known state, what is the probability that it will stay in that state for d days?
The formal observation sequence is
\(O = \{S_i, S_i, S_i, …, S_i, S_j \neq S_i\}\)
\(P(O|Model, q_1 = S_i) = (a_{ii})^{d-1}(1-a_{ii}) = p_i(d)\)
This is the discrete probability density function of duration d in state i \(p_i(d)\)
From this we can calculate the average stay in any state conditioned on the model and starting in the state
\(\bar{d_i} = \sum_{d=1}^{\infty}dp_i(d)\)
\(\bar{d_i} = \sum_{d=1}^{\infty}d(a_{ii})^{d-1}(1-a_{ii}) = \frac{1}{1-a_{ii}}\)
So, the expected number of consecutive sunny days (given the model) is \(\frac{1}{1-a_{ii}} = \frac{1}{0.2} = 5\)
Imagine a person flipping a coin behind a curtain
Each result is called out and becomes an observation
- How should we model this situation?
- Should we assume a single fair coin is being flipped?
- Should we assume the caller is accurate?
A Hidden Markov Model has N states
They are hidden, but they typically correspond to something we know about the world
States are generally connected aka ergodic S = \(\{S_1, S_2, S_3, S_4, S_5, …, S_N\}\)
state at time t is \(q_t\), \(\forall i : q_i \in S (1 \leq i \leq t)\)
A Hidden Markov Model has M observable symbols
They are what can be observed
They are a discrete alphabet
V = \(\{V_1, V_2, V_3, V_4, V_5, …, V_M\}\)
A Hidden Markov Model has State transition matrix A
A = \({a_{ij}}\)
\(a_{ij} = P(q_t = S_j | q_{t-1} = S_i) 1 \leq i, j \leq N\)
if a_{ij} = 0 then a transition between \(S_i\) and \(S_j\) is not possible
\(\sum_{j=1}^{N}a_{ij} = 1\)
A Hidden Markov Model has Observation probabilities
B = \({b_j(k)}\)
\(b_j(k) = P(v_k at t | q_t = S_j)\), 1 ≤ j ≤ N, 1 ≤ k ≤ M
A Hidden Markov Model has Initial state probabilities
\(\pi = {\pi_i}\)
\(\pi_i = P(q_1 = S_i) 1 \leq i \leq N\)
HMM = {N, M, A, B, π}
\(O = {O_1, O_2, O_3, …, O_T}\)
To generate observations
Choose \(q_1 = S_i\) according to \(\pi_i\)
Set t = 1
Choose \(O_t = v_k\) according to B ie \(b_i(k)\)
Transit to a new state \(q_{t+1} = S_j\) according to A ie \(a_{ij}\)
set t = t + 1 repeat until t > T
Markov Model is a special case of a Hidden Markov Model
- One in which the observation distribution has one non-zero entry
- There is one observation symbol for each state
Three Key HMM Problems
What is the probability that a model generated a sequence of observations?
\(O = {O_1, O_2, O_3, …, O_T}, \lambda = (A, B, \pi)\)
\(P(O|\lambda) ?\)
- Given a sequence of observations, which apartment is the robot in?
- Which model best explains the observations?
Forward-Backward algorithm
Let’s start by imagining all possible state sequences
\(Q = q_1, q_2, q_3, …, q_T\)
Probability of seeing observations given those states is
\(P(O|Q, \lambda) = \prod_{t=1}^{T}P(O_t|q_t, \lambda)\)
\(P(O|Q, \lambda) = b_{q1}(O_1) \cdot b_{q2}(O_2) \cdots b_{qT}(O_T)\)
Probability of seeing those state transitions is \(P(Q|\lambda) = \pi_{q1q2}a_{q2q3} \cdots a_{qT-1qT}\)
Probability of those seeing observations and those state transitions is P(O, Q|λ) = P(O|Q, λ)P(Q|λ)
But we want the probability of the observations regardless of the particular state sequence P(O|λ) = \(\sum_{all Q}\)P(O|Q, λ)P(Q|λ)
Calculating this directly is infeasible P(O|λ) = \(\sum_{q_1,q_2,…,q_T}\pi_{q_1}b_{q_1}(O_1)a_{q_1q_2}b_{q_2}(O_2)a_{q_2q_3} \cdots a_{q_{T-1}q_T}b_{q_T}(O_T)\)
How many state sequences are there? \(N^T\)
How many multiplications per state sequence? 2T-1
Total number of operations? \((2T-1)N^T + (N^T-1)\)
T = 100 and N = 5, How many operations?
\((2T-1)N^T + (N^T-1) = (2(100) – 1)5^100 + (5^100 – 1) = 199 \cdot 5^100 + 5^100 – 1 = 200 \cdot 5^100 – 1 \approx 5^103 \approx 10^{72}\)
We need a better way The forward-backward algorithm
Let’s introduce a helper function
\(\alpha_t(i) = P(O_1, O_2, O_3, …, O_t, q_t = S_i | \lambda)\)
Let’s solve for it inductively
- base case: \(\alpha_1(i) = \pi_i b_i(O_1)\), 1 ≤ i ≤ N
- inductive step:
\(\alpha_{t+1}(j) = [\sum_{i=1}^{N}\alpha_t(i)a_{ij}]b_j(O_{t+1})\), 1 ≤ t ≤ T-1, 1 ≤ j ≤ N
final step: P(O|λ) = \(\sum_{i=1}^{N}\alpha_T(i)\)
\(O(TN^T) \approx (2T-1)N^T + (N^T-1) \approx 10^{72} calculations\)
\(O(N^2 T) \approx 3000 calculations\), N = 5, T = 100
The forward part is all we need to solve the first problem \(\alpha_t(i)\)
But let’s do the same thing backward to help solve the other problems
\(\beta_t(i) = P(O_{t+1}, O_{t+2}, … O_T | q_t = S_i, \lambda)\)
Let’s solve for it inductively
- base case: \(\beta_T(i) = 1\), 1 ≤ i ≤ N
- inductive step:
\(\beta_t(i) = \sum_{j=1}^{N}a_{ij}b_j(O_{t+1})\beta_{t+1}(j)\), t = T-1, T-2, …, 1, 1 ≤ i ≤ N
\(\beta_t(i) = P(O_{t+1}, O_{t+2}, …O_T | q_t = S_i, \lambda\)
Motivating the Viterbi algorithm
Q1. What is the probability that a model generated a sequence of observations?
\(O = \{O_1, O_2, O_3, …, O_T\}, \lambda = (A, B, \pi)\)
\(P(O|\lambda) ?\)
P(O|λ) = \(\sum_{i=1}^{N}\alpha_T(i)\)
\(\alpha_t(i) = P(O_1, O_2, O_3, …, O_t, q_t = S_i | \lambda\)
\(\beta_t(i) = P(O_{t+1}, O_{t+2}, …O_T | q_t = S_i, \lambda\)
The Forward-Backward Algorithm
Q2. What sequence of states best explains a sequence of observations?
\(O = \{O_1, O_2, O_3, …, O_T\}\)
\(Q = \{q_1, q_2, q_3, …, q_T\} ?\)
What does “best” mean?
1) choose states that are individually most likely \(\gamma_t(i) = P(q_t = S_i | O, \lambda)\)
\( \gamma_t(i) = \frac{\alpha_t(i)\beta_t(i)}{P(O|\lambda)} = \frac{\alpha_t(i)\beta_t(i)}{\sum_{j=1}^{N}\alpha_t(j)\beta_t(j)}\)
\(\sum_{i=1}^{N}\gamma_t(i) = 1\)
\(q_t = argmax_{1 \leq i \leq N}[\gamma_t(i)]\), 1 ≤ t ≤ T
(1) This will maximize the expected number of correct states, but … it might not make sense
(2) choose a path sequence that maximizes P(Q|O, λ)
P(Q, O|λ)
\(P(\{q_1, q_2, q_3 … q_T\}, \{O_1, O_2, O_3, …, O_T\}|\lambda)\)
The Viterbi Algorithm
What is the path with the highest probability that accounts for the first tt observations and ends at state \(S_i\) ?
\(\delta_t(i) = max_{q_1,q_2,…q_{t-1}}P({q_1,q_2,q_3 \cdots q_t = i}, {O_1,O_2,O_3, \cdots O_t}|\lambda)\)
induction step
\(\delta_{t+1}(j) = [max_i \delta_t(i)a_{ij}]\cdot b_j(O_{t+1})\)
We need to keep track of which i maximized the result at each time step
Initialization \(\delta_1(i) = \pi_i b_i(O_1)\), \(\psi_i(i) = 0\)
Inductive step
\(\delta_t(j) = max_{1 \leq i \leq N}[\delta_{t-1}(i)a_{ij}]\cdot b_j(O_t)\)
\(\psi_t(j) = argmax_{1 \leq i \leq N}[\delta_{t-1}(i)a_{ij}]\), 2 ≤ t ≤ T, 1 ≤ j ≤ N
Termination
\(P^{*} = max_{1 \leq i \leq N}[\delta_T(i)]\), \(q_T^{*} = argmax_{1 \leq i \leq N}[\delta_T(i)]\)
The path sequence that maximizes P(Q, O|λ)
\(\delta_t(j) = max_{1 \leq i \leq N}[\delta_{t-1}(i)a_{ij}]\cdot b_j(O_t)\)
\(\alpha_t(j) = [\sum_{i=1}^{N}\alpha_{t-1}(i)a_{ij}] b_j(O_t)\)
Q3. Given a set of observations sequences how to we learn the model probabilities that would generate them?
Review our Resources
\(O = \{O_1, O_2, O_3, …, O_T\}\)
\(\lambda = (A, B, \pi) ?\)
\(\alpha_t(i) = P(O_1, O_2, O_3, …, O_t, q_t = S_i | \lambda)\)
\(\beta_t(i) = P(O_{t+1}, O_{t+2}, O_3, …, O_T, q_t = S_i | \lambda)\)
\(\gamma_t(i) = P(q_t = S_i | O, \lambda)\)
Baum-Welch Algorithm
There is no known way to solve for the globally optimal parameters of lambda
We will search for a locally optimal result
A result that converges to a stable good answer but isn’t guaranteed to be the best answer.
This is an EM method
– Expectation-Maximization (EM)
– Iterative, converges to a local optimum
aka gradient “descent”
We need a new mathematical tool
\(\xi_t(i, j) = P(q_t = S_i, q_{t+1} = S_j | O, \lambda)\)
\(\xi_t(i, j) = \frac{\alpha_t(i)a_{ij}b_j(O_{t+1})\beta_{t+1}(j)}{P(O|\lambda)}\)
\(\xi_t(i, j) = \frac{\alpha_t(i)a_{ij}b_j(O_{t+1})\beta_{t+1}(j)}{\sum_{i=1}^{N}\sum_{j=1}^{N}\alpha_t(i)a_{ij}b_j(O_{t+1})\beta_{t+1}(j)}\)
\(\xi_t(i, j)\) is related to \(\gamma_t(i)\)
\(\gamma_t(i) = \sum_{j=1}^{N}\xi_t(i, j)\)
\(\gamma_t(i)\) is the probability of being in \(S_i\) at time t
if we sum over all t then we have a number that can be treated as the expected number of times \(S_i\) is ever visited
\(\xi_t(i, j)\) is the probability of transitioning from \(S_i\) to \(S_j\) at time t
if we sum over all t then we have a number that can be treated as the expected number of times \(S_i\) ever transitions to \(S_j\)
\(\sum_{t=1}^{T-1}\gamma_t(i)\) = expected number of transitions from \(S_i\)
\(\sum_{t=1}^{T-1}\xi_t(i, j)\) = expected number of transitions from \(S_i\) to \(S_j\)
So now
– We have an existing model, λ = (A, B, π)
– We have a set of observations, O
– We have a set of tools \(\alpha_t(i), \beta_t(i), \gamma_t(i), \xi_t(i, j)\)
How do we use these to improve our model? \(\bar{\lambda}\) = ?
\(\bar{\pi} = \gamma_1(i)\) = expected frequency in \(S_i\) at time (t = 1)
\(\bar{a_{ij}} = \frac{\text{expected number of transitions from } S_i \text{ to } S_j}{\text{expected number of transitions from } S_i}\)
\(\bar{a_{ij}} = \frac{\sum_{t=1}^{T-1}\xi_t(i,j)}{\sum_{t=1}^{T-1}\gamma_t(i)}\)
\(\bar{b_j(k)} = \frac{\text{expected number of times in state j and observing } v_k}{\text{expected number of times in state j}}\)
\(\bar{b}_j(k) = \frac{\sum_{t=1 s.t.O_t=v_k}^{T}\gamma_t(i)}{\sum_{t=1}^{T}\gamma_t(i)}\)
Given λ = (A, B, π) and O we can produce \(\alpha_t(i), \beta_t(i), \gamma_t(i), \xi_t(i, j)\)
Given \(\alpha_t(i), \beta_t(i), \gamma_t(i), \xi_t(i, j)\) we can produce \(\bar{\lambda} = (\bar{A}, \bar{B}, \bar{\pi})\)
References
- Prof. Donald J. Patterson(djp3), (1/12) Hidden Markov Models 01: The Markov Property, Mar 20, 2020 – Apr 10, 2020, https://youtu.be/J_y5hx_ySCg?si=EXAPxw9ZDfNdgSWN
- ~
- Prof. Donald J. Patterson(djp3), (12/12) Hidden Markov Models 12: the Baum-Welch algorithm, Apr 10, 2020, https://youtu.be/JRsdt05pMoI?si=VYCOu8N6Ps6Akbll