Editing Reinforcement learning (section)

=== Brute force ===
The [[brute-force search|brute force]] approach entails two steps:
* For each possible policy, sample returns while following it
* Choose the policy with the largest expected discounted return

One problem with this is that the number of policies can be large, or even infinite. Another is that the variance of the returns may be large, which requires many samples to accurately estimate the discounted return of each policy.

These problems can be ameliorated if we assume some structure and allow samples generated from one policy to influence the estimates made for others. The two main approaches for achieving this are [[#Value function|value function estimation]] and [[#Direct policy search|direct policy search]].