Advantage Actor Critic
Created on 2022-06-14T23:29:44-05:00
State: which state in the state machine you are presently at.
Critic: attempts to predict amount of reward that can be obtained from this state.
Actor: selects which transition (action) to take.
Advantage: interpreting the reward from an outcome against the predicted outcomes of other states; focusing on the least bad of all bad choices.
Entropy: penalty used to encourage exploration.
Discounts: a discount rate quashes the amount of reward anticipated in the future to prefer rewards that happen sooner.
Record the states and transitions taken for a journey of some length.
Monte-Carlo Fox: runs entire simulation and propagates reward scores backwards perfectly.
A2C instead uses critic to estimate how many points can be earned from this state and uses that to backpropagate scores across some past number of state changes.
Algorithm
Be in a state.
Sample the next decision from probability field.
Observe reward from transition
Log transition and states over time
Every so often stop and use critic to estimate future rewards from current state, then backpropagate this and observed rewards through historical state changes.