A comparison of Temporal-Difference(0) and Constant-α Monte Carlo methods on the Random Walk Task
The Monte Carlo (MC) and the Temporal-Difference (TD) methods are both fundamental technics in the field of reinforcement learning; they solve the prediction problem based on the experiences from interacting with the environment rather than the environment’s model. However, the TD method is a combination of MC methods and Dynamic Programming (DP), making it differs from the MC method in the aspects of the update rule, the bootstrapping, and the bias/variance. TD methods are also proven to have better performance and faster convergence compared to the MC in most cases.
In this post, we’ll compare TD and MC, or more specifically, the TD(0) and constant-α MC methods, on a simple grid environment and a more comprehensive Random Walk [2] environment. Hoping this post can help readers interested in Reinforcement Learning better understand how each method updates the state-value function and how their performance differs in the same testing environment.
We will implement algorithms and comparisons in Python, and libraries used in this post are as follows:
The constant-α MC method is a regular MC method with a constant step size parameter α, and this constant parameter helps to make the value estimate more sensible to the recent experience. In practice, the choice of the α value depends on a trade-off between stability and adaptability. The following is the MC method’s equation for updating the state-value function at time t:
0 Comments