Svanik Sharma's Website

Bellman Optimality

May 14, 2026

I just wanted to write some notes down on Bellman optimality equations in detail, since some of the sources I have read don't write the details out.

Bellman Optimality Equations

For a policy π\pi, we will determine:

  1. Computing vπv_\pi in terms of qπq_\pi.
  2. Computing qπq_\pi in terms of vπv_\pi.
  3. Computing vπv_\pi in terms of vπv_\pi (Bellman optimality)
  4. Computing qπq_\pi in terms of qπq_\pi (Bellman optimality)

Iterated Expectation

The subsequent derivations rely on iterated expectations. In particular, given random variables XX, YY, ZZ, we need to prove:

E[XZ]=E[E[XY,Z]Z] \begin{align*} \mathbb{E}[X|Z] = \mathbb{E}[\mathbb{E}[X|Y, Z]|Z] \end{align*}

We will assume these variables are continuous (in the discrete case replace integrals with sums):

E[XZ]=xp(xz)dx=xp(x,yz)dydxSum rule=xp(xy,z)p(yz)dydxBayes’ Theorem=xp(xy,z)p(yz)dxdyFubini’s Theorem=p(yz)xp(xy,z)dxdy=E[E[XY,Z]Z] \begin{align*} \mathbb{E}[X|Z] &= \int x p(x|z) dx \\ &= \int x \int p(x, y | z) dy dx & \text{Sum rule} \\ &= \int x \int p(x|y, z) p(y|z) dy dx & \text{Bayes' Theorem} \\ &= \int x \int p(x|y, z) p(y|z) dx dy & \text{Fubini's Theorem} \\ &= \int p(y|z) \int x p(x|y, z) dx dy \\ &= \mathbb{E}[\mathbb{E}[X|Y, Z]|Z] \end{align*}

Computing vπv_\pi in terms of qπq_\pi

vπ(s)=Eπ[GtSt=s]=E[Eπ[GtSt=s,At]St=s]=aA(s)π(as)Eπ[GtSt=s,At=a]=aA(s)π(as)qπ(as) \begin{align*} v_\pi(s) &= \mathbb{E}_\pi[G_t | S_t = s] \\ &= \mathbb{E}[\mathbb{E}_\pi[G_t | S_t = s, A_t] | S_t = s] \\ &= \sum_{a \in \mathcal{A}(s)} \pi(a|s) \mathbb{E}_\pi[G_t | S_t = s, A_t = a] \\ &= \sum_{a \in \mathcal{A}(s)} \pi(a|s) q_\pi(a|s) \end{align*}

Computing qπq_\pi in terms of vπv_\pi

qπ(s,a)=Eπ[GtSt=s,At=a]=Eπ[Rt+1+γGt+1St=s,At=a]=E[Eπ[Rt+1+γGt+1St=s,At=a,St+1]St=s,At=a]=E[Rt+1+γEπ[Gt+1St=s,At=a,St+1]St=s,At=a]Rt+1 is only dependent on StAt, not on π=E[Rt+1+γEπ[Gt+1St+1]St=s,At=a]Markov Property=E[Rt+1+γvπ(St+1)St=s,At=a]=s,rp(s,rs,a)[r+γvπ(s)] \begin{align*} q_\pi(s, a) &= \mathbb{E}_\pi[G_t | S_t = s, A_t = a] \\ &= \mathbb{E}_\pi [R_{t+1} + \gamma G_{t+1} | S_t = s, A_t = a] \\ &= \mathbb{E}[\mathbb{E}_\pi[R_{t+1} + \gamma G_{t+1} | S_t = s, A_t = a, S_{t+1}] | S_t = s, A_t = a] \\ &= \mathbb{E}[R_{t+1} + \gamma \mathbb{E}_\pi[G_{t+1} | S_t = s, A_t = a, S_{t+1}]| S_t = s, A_t = a] & \text{$R_{t+1}$ is only dependent on $S_t$, $A_t$, not on $\pi$} \\ &= \mathbb{E}[R_{t+1} + \gamma \mathbb{E}_\pi[G_{t+1} | S_{t+1}]| S_t = s, A_t = a] & \text{Markov Property} \\ &= \mathbb{E}[R_{t+1} + \gamma v_\pi(S_{t+1}) | S_t = s, A_t = a] \\ &= \sum_{s', r} p(s', r | s, a) \Bigl[r + \gamma v_\pi(s')\Bigl] \end{align*}

Computing vπv_\pi in terms of vπv_\pi

vπ(s)=aA(s)π(as)qπ(as)Based on first section=aA(s)π(as)Eπ[GtSt=s,At=a]=aA(s)π(as)Eπ[Rt+1+γGt+1St=s,At=a]=aA(s)π(as)E[Eπ[Rt+1+γGt+1St=s,At=a,St+1]St=s,At=a]=aA(s)π(as)E[Rt+1+γEπ[Gt+1St=s,At=a,St+1]St=s,At=a]Rt+1 is only dependent on StAt, not on π=aA(s)π(as)E[Rt+1+γEπ[Gt+1St+1]St=s,At=a]=aA(s)π(as)E[Rt+1+γvπ(St+1)St=s,At=a]Markov Property=aA(s)π(as)s,rp(s,rs,a)[r+γvπ(s)] \begin{align*} v_\pi(s) &= \sum_{a \in \mathcal{A}(s)} \pi(a|s) q_\pi(a|s) & \text{Based on first section} \\ &= \sum_{a \in \mathcal{A}(s)} \pi(a|s) \mathbb{E}_\pi[G_t | S_t = s, A_t = a] \\ &= \sum_{a \in \mathcal{A}(s)} \pi(a|s) \mathbb{E}_\pi[R_{t+1} + \gamma G_{t+1} | S_t = s, A_t = a] \\ &= \sum_{a \in \mathcal{A}(s)} \pi(a|s) \mathbb{E}[\mathbb{E}_\pi[R_{t+1} + \gamma G_{t+1} | S_t = s, A_t = a, S_{t+1}]|S_t = s, A_t = a] \\ &= \sum_{a \in \mathcal{A}(s)} \pi(a|s) \mathbb{E}[R_{t+1} + \gamma \mathbb{E}_\pi[G_{t+1} | S_t = s, A_t = a, S_{t+1}]| S_t = s, A_t = a] & \text{$R_{t+1}$ is only dependent on $S_t$, $A_t$, not on $\pi$} \\ &= \sum_{a \in \mathcal{A}(s)} \pi(a|s) \mathbb{E}[R_{t+1} + \gamma \mathbb{E}_\pi[G_{t+1} | S_{t+1}]|S_t = s, A_t = a] \\ &= \sum_{a \in \mathcal{A}(s)} \pi(a|s) \mathbb{E}[R_{t+1} + \gamma v_\pi(S_{t+1})|S_t = s, A_t = a] & \text{Markov Property} \\ &= \sum_{a \in \mathcal{A}(s)} \pi(a|s) \sum_{s', r} p(s', r|s, a) \Bigl[r + \gamma v_\pi(s')\Bigl] \end{align*}

Computing qπq_\pi in terms of qπq_\pi

qπ(s,a)=s,rp(s,rs,a)[r+γvπ(s)]Based on second section=s,rp(s,rs,a)[r+γaA(s)π(as)qπ(s,a)]Based on first section \begin{align*} q_\pi(s, a) &= \sum_{s', r} p(s', r | s, a) \Bigl[r + \gamma v_\pi(s')\Bigl] & \text{Based on second section} \\ &= \sum_{s', r} p(s', r | s, a) \Bigl[r + \gamma \sum_{a' \in \mathcal{A}(s')} \pi(a'|s') q_\pi(s', a')\Bigl] & \text{Based on first section} \\ \end{align*}