I just wanted to write some notes down on Bellman optimality equations in detail, since some of the sources I have read don't write the details out.
Bellman Optimality Equations
For a policy π, we will determine:
- Computing vπ in terms of qπ.
- Computing qπ in terms of vπ.
- Computing vπ in terms of vπ (Bellman optimality)
- Computing qπ in terms of qπ (Bellman optimality)
Iterated Expectation
The subsequent derivations rely on iterated expectations. In particular, given random variables X, Y, Z, we need to prove:
E[X∣Z]=E[E[X∣Y,Z]∣Z]
We will assume these variables are continuous (in the discrete case replace integrals with sums):
E[X∣Z]=∫xp(x∣z)dx=∫x∫p(x,y∣z)dydx=∫x∫p(x∣y,z)p(y∣z)dydx=∫x∫p(x∣y,z)p(y∣z)dxdy=∫p(y∣z)∫xp(x∣y,z)dxdy=E[E[X∣Y,Z]∣Z]Sum ruleBayes’ TheoremFubini’s Theorem
Computing vπ in terms of qπ
vπ(s)=Eπ[Gt∣St=s]=E[Eπ[Gt∣St=s,At]∣St=s]=a∈A(s)∑π(a∣s)Eπ[Gt∣St=s,At=a]=a∈A(s)∑π(a∣s)qπ(a∣s)
Computing qπ in terms of vπ
qπ(s,a)=Eπ[Gt∣St=s,At=a]=Eπ[Rt+1+γGt+1∣St=s,At=a]=E[Eπ[Rt+1+γGt+1∣St=s,At=a,St+1]∣St=s,At=a]=E[Rt+1+γEπ[Gt+1∣St=s,At=a,St+1]∣St=s,At=a]=E[Rt+1+γEπ[Gt+1∣St+1]∣St=s,At=a]=E[Rt+1+γvπ(St+1)∣St=s,At=a]=s′,r∑p(s′,r∣s,a)[r+γvπ(s′)]Rt+1 is only dependent on St, At, not on πMarkov Property
Computing vπ in terms of vπ
vπ(s)=a∈A(s)∑π(a∣s)qπ(a∣s)=a∈A(s)∑π(a∣s)Eπ[Gt∣St=s,At=a]=a∈A(s)∑π(a∣s)Eπ[Rt+1+γGt+1∣St=s,At=a]=a∈A(s)∑π(a∣s)E[Eπ[Rt+1+γGt+1∣St=s,At=a,St+1]∣St=s,At=a]=a∈A(s)∑π(a∣s)E[Rt+1+γEπ[Gt+1∣St=s,At=a,St+1]∣St=s,At=a]=a∈A(s)∑π(a∣s)E[Rt+1+γEπ[Gt+1∣St+1]∣St=s,At=a]=a∈A(s)∑π(a∣s)E[Rt+1+γvπ(St+1)∣St=s,At=a]=a∈A(s)∑π(a∣s)s′,r∑p(s′,r∣s,a)[r+γvπ(s′)]Based on first sectionRt+1 is only dependent on St, At, not on πMarkov Property
Computing qπ in terms of qπ
qπ(s,a)=s′,r∑p(s′,r∣s,a)[r+γvπ(s′)]=s′,r∑p(s′,r∣s,a)[r+γa′∈A(s′)∑π(a′∣s′)qπ(s′,a′)]Based on second sectionBased on first section