Editing Bellman equation (section)

== Derivation ==

=== A dynamic decision problem ===
Let <math>x_t</math> be the state at time <math>t</math>. For a decision that begins at time 0, we take as given the initial state <math>x_0</math>. At any time, the set of possible actions depends on the current state; we express this as <math> a_{t} \in \Gamma (x_t)</math>, where a particular action <math>a_t</math> represents particular values for one or more control variables, and <math>\Gamma (x_t)</math> is the set of actions available to be taken at state <math>x_t</math>. It is also assumed that the state changes from <math>x</math> to a new state <math>T(x,a)</math> when action <math>a</math> is taken, and that the current payoff from taking action <math>a</math> in state <math>x</math> is <math>F(x,a)</math>. Finally, we assume impatience, represented by a [[discount factor]] <math>0<\beta<1</math>.

Under these assumptions, an infinite-horizon decision problem takes the following form:

:<math> V(x_0) \; = \; \max_{ \left \{ a_{t} \right \}_{t=0}^{\infty} }  \sum_{t=0}^{\infty} \beta^t F(x_t,a_{t}), </math>

subject to the constraints

:<math> a_{t} \in \Gamma (x_t), \; x_{t+1}=T(x_t,a_t), \; \forall t = 0, 1, 2, \dots </math>

Notice that we have defined notation <math>V(x_0)</math> to denote the optimal value that can be obtained by maximizing this objective function subject to the assumed constraints. This function is the ''value function''. It is a function of the initial state variable <math>x_0</math>, since the best value obtainable depends on the initial situation.

=== Bellman's principle of optimality ===
The dynamic programming method breaks this decision problem into smaller subproblems. Bellman's ''principle of optimality'' describes how to do this:<blockquote>Principle of Optimality: An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision. (See Bellman, 1957, Chap. III.3.)<ref name=BellmanDP /><ref name=dreyfus /><ref name=BellmanTheory>{{cite journal |first=R |last=Bellman |pmc=1063639 |title=On the Theory of Dynamic Programming |journal=Proc Natl Acad Sci U S A |date=August 1952 |volume=38 |issue=8 |pages=716–9 |pmid=16589166 |doi=10.1073/pnas.38.8.716|bibcode=1952PNAS...38..716B |doi-access=free }}</ref></blockquote>
In computer science, a problem that can be broken apart like this is said to have [[optimal substructure]]. In the context of dynamic [[game theory]], this principle is analogous to the concept of [[subgame perfect equilibrium]], although what constitutes an optimal policy in this case is conditioned on the decision-maker's opponents choosing similarly optimal policies from their points of view.

As suggested by the ''principle of optimality'', we will consider the first decision separately, setting aside all future decisions (we will start afresh from time 1 with the new state <math>x_1 </math>). Collecting the future decisions in brackets on the right, the above infinite-horizon decision problem is equivalent to:{{Clarify|date=September 2017}}

:<math> \max_{ a_0 } \left \{ F(x_0,a_0)
+ \beta  \left[ \max_{ \left \{ a_{t} \right \}_{t=1}^{\infty} }
\sum_{t=1}^{\infty} \beta^{t-1} F(x_t,a_{t}):
a_{t} \in \Gamma (x_t), \; x_{t+1}=T(x_t,a_t), \; \forall t \geq 1 \right] \right \}</math>

subject to the constraints

:<math> a_0 \in \Gamma (x_0), \; x_1=T(x_0,a_0). </math>

Here we are choosing <math>a_0</math>, knowing that our choice will cause the time 1 state to be <math>x_1=T(x_0,a_0)</math>. That new state will then affect the decision problem from time 1 on. The whole future decision problem appears inside the square brackets on the right.{{Clarify|date=September 2017}}{{Explain|date=September 2017}}

=== The Bellman equation ===
So far it seems we have only made the problem uglier by separating today's decision from future decisions. But we can simplify by noticing that what is inside the square brackets on the right is ''the value'' of the time 1 decision problem, starting from state <math>x_1=T(x_0,a_0)</math>.

Therefore, the problem can be rewritten as a [[Recursion|recursive]] definition of the value function:

:<math>V(x_0) = \max_{ a_0 } \{ F(x_0,a_0) + \beta V(x_1) \} </math>, subject to the constraints: <math> a_0 \in \Gamma (x_0), \; x_1=T(x_0,a_0). </math>

This is the Bellman equation. It may be simplified even further if the time subscripts are dropped and the value of the next state is plugged in:

:<math>V(x) = \max_{a \in \Gamma (x) } \{ F(x,a) + \beta V(T(x,a)) \}.</math>

The Bellman equation is classified as a [[functional equation]], because solving it means finding the unknown function <math>V</math>, which is the ''value function''. Recall that the value function describes the best possible value of the objective, as a function of the state <math>x</math>. By calculating the value function, we will also find the function <math>a(x)</math> that describes the optimal action as a function of the state; this is called the ''policy function''.

=== In a stochastic problem ===
{{See also|Markov decision process}}

In the deterministic setting, other techniques besides dynamic programming can be used to tackle the above [[optimal control]] problem. However, the Bellman Equation is often the most convenient method of solving ''stochastic'' optimal control problems.

For a specific example from economics, consider an infinitely-lived consumer with initial wealth endowment <math>{\color{Red}a_0}</math> at period <math>0</math>. They have an instantaneous [[utility function]] <math>u(c)</math> where <math>c</math> denotes consumption and discounts the next period utility at a rate of <math>0< \beta<1 </math>. Assume that what is not consumed in period <math>t</math> carries over to the next period with interest rate <math>r</math>. Then the consumer's utility maximization problem is to choose a consumption plan <math>\{{\color{OliveGreen}c_t}\}</math> that solves

:<math>\max \sum_{t=0} ^{\infty} \beta^t u ({\color{OliveGreen}c_t})</math>

subject to

:<math>{\color{Red}a_{t+1}} = (1 + r) ({\color{Red}a_t} - {\color{OliveGreen}c_t}), \; {\color{OliveGreen}c_t} \geq 0,</math>

and

:<math>\lim_{t \rightarrow \infty} {\color{Red}a_t} \geq 0.</math>

The first constraint is the capital accumulation/law of motion specified by the problem, while the second constraint is a [[Transversality (mathematics)|transversality condition]] that the consumer does not carry debt at the end of their life. The Bellman equation is

:<math>V(a) = \max_{ 0 \leq c \leq a } \{ u(c) + \beta V((1+r) (a - c)) \},</math>

Alternatively, one can treat the sequence problem directly using, for example, the [[Hamiltonian (control theory)|Hamiltonian equations]].

Now, if the interest rate varies from period to period, the consumer is faced with a stochastic optimization problem. Let the interest ''r'' follow a [[Markov process]] with probability transition function <math>Q(r, d\mu_r)</math> where <math>d\mu_r</math> denotes the [[probability measure]] governing the distribution of interest rate next period if current interest rate is <math>r</math>. In this model the consumer decides their current period consumption after the current period interest rate is announced.

Rather than simply choosing a single sequence <math>\{{\color{OliveGreen}c_t}\}</math>, the consumer now must choose a sequence <math>\{{\color{OliveGreen}c_t}\}</math> for each possible realization of a <math>\{r_t\}</math> in such a way that their lifetime expected utility is maximized:

:<math>\max_{ \left \{ c_{t} \right \}_{t=0}^{\infty} } \mathbb{E}\bigg( \sum_{t=0} ^{\infty} \beta^t u ({\color{OliveGreen}c_t})   \bigg).</math>

The expectation <math>\mathbb{E}</math> is taken with respect to the appropriate probability measure given by ''Q'' on the sequences of ''r''{{'}}s. Because ''r'' is governed by a Markov process, dynamic programming simplifies the problem significantly. Then the Bellman equation is simply:

:<math>V(a, r) =  \max_{ 0 \leq c \leq a } \{ u(c) + \beta \int V((1+r) (a - c), r') Q(r, d\mu_r) \} .</math>

Under some reasonable assumption, the resulting optimal policy function ''g''(''a'',''r'') is [[measurable]].

For a general stochastic sequential optimization problem with Markovian shocks and where the agent is faced with their decision ''[[ex-post]]'', the Bellman equation takes a very similar form

:<math>V(x, z) = \max_{c \in \Gamma(x,z)} \{F(x, c, z) + \beta \int V( T(x,c), z') d\mu_z(z')\}. </math>