TIL: Hidden Markov Model
While doing research for a consulting project that sadly went nowhere (classifying rock type by core samples if you’re asking), I discovered Hidden Markov Models (HMMs).
Weather example
Imagine we want to classify the state of weather, as us British are wont to do, based on the Temperature (C) and Humidity (%).
We might design a naive model that could classify the three different states (Sunny, Cloudy and Rainy) based on thresholds in the temperature and humidity variables. The scatter points in the plot below represent the observed temperature and humidity measurements for 60 time steps, which clearly have some noise as they have a good amount of overlap. The dotted black lines show the decision boundaries of the Naive classifier, which does an OK job of classifying states, but noisy data could lead to the predicted state flipping back and forth in an un-natural way.
We can avoid this with the genius of a Hidden Markov Model that allows us to define both:
- How likely the transition between states is
- What is the probability of seeing parameters T and H, given a state $s$?
In technical lingo, the HMM allows us to encode our special British knowledge into the “transition matrix” and “emission probabilities”.
Transition matrix
TRANSITION = np.array([
# Sunny Cloudy Rainy
[0.75, 0.20, 0.05], # from Sunny
[0.25, 0.55, 0.20], # from Cloudy
[0.05, 0.35, 0.60], # from Rainy
])
The transition probabilities can be see above. As an example how to read the matrix, we are saying if it is currently Sunny there is a 75% chance it will remain Sunny, 20% it will become cloudy and only 5% it will transition straight to rainy.
This is a really elegant design, since we could even update our knowledge. Maybe we think the chance of transitioning from state Sunny -> Rainy is more probable in April over August.
Emission probabilities
The emission probabilities capture how likely a certain measurement is, given we are in the titular “hidden markov state”.
EMISSION_MEANS = np.array([
[24.0, 45.0], # Sunny
[19.0, 65.0], # Cloudy
[14.0, 80.0], # Rainy
])
EMISSION_COVS = np.array([
[[4.5**2, 0.0 ], [0.0, 14.0**2]], # Sunny
[[4.5**2, 0.0 ], [0.0, 14.0**2]], # Cloudy
[[4.0**2, 0.0 ], [0.0, 12.0**2]], # Rainy
])
In the example above, the probability of observing a temperature T when it is Sunny follows a Gaussian with $\mu = 24C$ and a $\sigma = 4.5C$
Note: I am just using Gaussian probabilities because it makes things easy here. The model can work with any distribution
This can be visualised by plotting the same scatter points as above, but this time with the shaded regions showing the 65% confidence interval of each state
Viterbi algorithm
So how do we find the most probable path through all of these states? There are a few options, but the most ingenious is the Viterbi algorithm which goes a little something like this…
At the first time step, we see the emission, $o_1$. Therefore, we can represent the probability of the first state $s_1$ with the following product:
\[P(s_1{=}s,\; o_1) \;=\; \underbrace{\pi_s}_{\text{start in } s} \;\times\; \underbrace{f(o_1 \mid s)}_{\text{which then emits } o_1}\]where $\pi_s$ is the starting probabilities of each state and $f$ is our Gaussian function from earlier.
Using the Markov property that “each state is only depends on the one before it”, we can create a recursive equation for the score of the best path ending in state $s$ at each step (i.e. the joint probability of that path having produced all observations $o_1,\ldots,o_t$).
\[D_t(s) \;=\; \max_{s'} \Big[ \, D_{t-1}(s') \cdot A_{s' \to s} \, \Big] \cdot f(o_t \mid s)\]- $D_{t-1}(s’)$ is the “probability of the best path ending in state $s’$ at the previous step, having explained $o_1, \ldots, o_{t-1}$.” This is the running joint probability of one specific path through the states.
- $\times \; A_{s’ \to s}$ represents the transition from $s’$ to $s$. Given the Markov property, we just read this from the transition matrix
- $\max_{s’}\;[\ldots]$ keeps the winning path at each step.
- $\cdot \; f(o_t \mid s)$ captures the probability of this state, $s$, emitting observation $o_t$.”
However, before we try and compute this we need to realise that we are going to be multiplying many small numbers together. If we are multiply probabilities like 0.05*0.03 (~$10^{-4}$) at each time step we end up trying to compute numbers like $10^{-240}$ for our 60-step sequence. Obviously this is madness, so instead we will take the log-probabilities so we end up with the following equation:
\[\delta_t(s) = \max_{s'}\big[\delta_{t-1}(s') + \log A_{s' \to s}\big] + \log f(o_t \mid s)\]For each destination state, we can recored which previous state $s’$ led to the maximum score at that step.
\[\psi_t(s) \;=\; \arg\max_{s'} \Big[ \delta_{t-1}(s') + \log A_{s' \to s} \Big]\]i.e. “for each destination state $s$, $\psi_t(s)$ records which previous state $s’$ achieved the max.”
At the end of the sequence we pick the highest-scoring final cell: \(\hat s_T \;=\; \arg\max_s \, \delta_T(s)\) and then walk the stored backpointers right-to-left to reconstruct the full path: \(\hat s_{t-1} \;=\; \psi_t(\hat s_t), \quad t = T, T-1, \ldots, 2\)
Better explanation here
Run through
Let’s walk through the first two steps of this algorithm, with the temperature and humidity readings, along with their true state, in the table below:
| $T$ (°C) | $H$ (%) | true state | |
|---|---|---|---|
| $o_1$ | 17.77 | 69.18 | Cloudy |
| $o_2$ | 19.54 | 38.63 | Sunny |
At step 1 there is no previous state to transition from, so we initialise with just the prior on the starting state and the likelihood of the first observation
| State | $\log \pi_s$ | $\log f(o_1 \mid s)$ | $\delta_1(s)$ |
|---|---|---|---|
| Sunny | $-0.69$ | $-8.43$ | $-9.13$ |
| Cloudy | $-1.20$ | $-6.06$ | $\mathbf{-7.27}$ |
| Rainy | $-1.61$ | $-6.56$ | $-8.17$ |
For step 2, we compute three maxes (one per destination state), each giving us 3 new scores $\delta_2(s)$. Alongside each score we store a backpointer $\psi_2(s)$ that remembers which previous state achieved that max:
\[\psi_2(s) \;=\; \arg\max_{s'} \Big[ \delta_1(s') + \log A_{s' \to s} \Big]\]Working table — every candidate $\delta_1(s’) + \log A_{s’\to s}$
Using the transition matrix:
A = np.array([
# Sunny Cloudy Rainy
[0.75, 0.20, 0.05], # from Sunny
[0.25, 0.55, 0.20], # from Cloudy
[0.05, 0.35, 0.60], # from Rainy
])
For each destination, the bold entry wins. The two right-hand columns record the winning score and which previous state achieved it:
| destination $s$ | from Sunny | from Cloudy | from Rainy | winning score | $\psi_2(s)$ |
|---|---|---|---|---|---|
| Sunny | $-9.13 + (-0.29) = -9.42$ | $\mathbf{-7.27 + (-1.39) = -8.65}$ | $-8.17 + (-3.00) = -11.16$ | $\mathbf{-8.65}$ | Cloudy |
| Cloudy | $-9.13 + (-1.61) = -10.73$ | $\mathbf{-7.27 + (-0.60) = -7.87}$ | $-8.17 + (-1.05) = -9.22$ | $\mathbf{-7.87}$ | Cloudy |
| Rainy | $-9.13 + (-3.00) = -12.12$ | $-7.27 + (-1.61) = -8.88$ | $\mathbf{-8.17 + (-0.51) = -8.68}$ | $\mathbf{-8.68}$ | Rainy |
So the backpointers are split between Cloudy (if step 2’s state is Sunny or Cloudy) and Rainy (if step 2 is Rainy).
Finishing $\delta_2(s)$
To get to $\delta_2(s)$, we need to add the emission term for today’s observation:
\[\delta_2(s) \;=\; \underbrace{\max_{s'}\big[\delta_1(s') + \log A_{s'\to s}\big]}_{\text{winning score above}} \;+\; \underbrace{\log f(o_2 \mid s)}_{\text{how well } s \text{ explains } o_2}\]So the emission can change which destination is the overall best at this step, but it never changes the backpointers $\psi_2(s)$.
| $s$ | winning score | $\log f(o_2 \mid s)$ | $\delta_2(s)$ | $\psi_2(s)$ |
|---|---|---|---|---|
| Sunny | $-8.65$ | $-6.58$ | $\mathbf{-15.23}$ | Cloudy |
| Cloudy | $-7.87$ | $-7.76$ | $-15.63$ | Cloudy |
| Rainy | $-8.68$ | $-12.61$ | $-21.29$ | Rainy |
If we were to end the algorithm here, we would see that step = 2 has the highest score for Sunny. We can look back at $\psi_2(s)$ to see that if Sunny was step 2, then Cloudy would be step 1.
However, its very important to note that if we continue moving through the observations we could see a different path emerge as the global best score. In this case, the full sequence can be seen in the plot below, where it turns out the best path starts with the same Cloudy → Sunny as in our tabular example above
Rather than flipping between states like the naive model, the HMM through Viterbi’s algorithm is able to smoothly transition between states in a natural way and does very well when compared to the true states.
Modern uses
So there you go, its always fun to stumble across an algorithm that has been quietly decoding digital communications and tracking my sleep cycles and more.