Initial/basic question on Temporal Differences Reinforcement Learning. Do these algorithms (Q-Learning, SARSA) assume that the current state and current action taken will influence what the next state will be? The obvious feedback for the Agent is the reward and the next state, but does this mean that the next state can then be predicted at some point, making the Agent not only adapt but also to predict future states based on current state?
Hey, I'm no expert. But i'm pretty sure the main reason for using Q-Learning is because you don't have a model of the transition function. You can do it this way but it wouldn't be Q-Learning. It'd be more like model based learning (where you learn the transition function then solve with something like value iteration) If we look at a single update of Q-Learning s - prev state (just came from) a - action taken s' - state you ended up in a' - actions from the new state r - the reward for getting to s' alpha - learning rate gamma - discount factor for future rewards Q(s, a) = (1 - alpha)*Q(s, a) + alpha(r + gamma*(max a' (Q(s', a')))) It's not that the next state can be predicted at some point, it's the expected reward that can be predicted. I can't comment on SARSA without doing a bit of looking around but i'm pretty sure they're very similar.
Thanks mate, that's what I wanted to know.
Join our real-time social learning platform and learn together with your friends!