SARSA

Created on 2022-06-16T20:58:24-05:00

for every episode:
initialize state
   sample an action
   for each step in the episode until the end of the episode:
      take action and observe reward, new state
      sample an action
      update expected reward from state->action to:
         Q(S,A) <- Q(S, A) + alpha[R + k Q(S', A') - Q(S, A)]
         where alpha and k are tuning variables

Medium.com article