An agent is a decision-maker who takes decisions based on rewards and punishments. The agent needs to:
- learn from data that it has
- experiment to get new data it needs
- plan for desired outcomes over the time horizon
It makes decisions under uncertainty and with long-term consequences by interacting with the environment which gives a notion of a certain state, and based on this state, the agent takes an action, and it gets feedback based on that action. This feedback is partial because the agent chose a certain action among many options he could choose. The ultimate goal of the agent is maximizing a sum of future rewards, and it repeats this whole process to get the goal.
Success in RL
The most famous example of RL is AlphaGo(2016). Basically they had very large state representations and action spaces, and approximate what is represented on the board using a neural network. By training the model for a long time, and finally, the model defeated the professional player. Now, AlphaZero, which is even better than the initial alphago, came out.
- Temporal Difference Learning and TD-Gammon (1995): the first time that neural network was used to approximate the value function of the state. Although it is not called RL, it was a fundamental of RL.
- Playing atari with deep reinforcement learning (2013) — DQN on Atari 2600: showing superhuman performance with many different Atari games using the same architecture called DQN.
- Algorithms for StarCraft, Chess, Go, Robotics, etc.
Characteristics of RL
It is crucial for RL not just to find an underlying structure of data but also to actively get better data and collect better information so that we can maximize the sum of rewards. As an agent, we should go out into the world and collect the right information that could give us the right signal.
1. Reward
Rewards are real-value feedbacks. This measures how well the agent is behaving. The agent’s goal is to maximize a discounted* sum of rewards each time. The word ‘discounted’ means that now has more value than the future with the notion of economics. The reward is given by the environment, but it is also given by the agent’s internal interpretation.
2. Policy
In RL, policy means the agent’s behavior. It is usually the function form mapping from state to the agent’s action. The policy can be either deterministic or stochastic. It always gives the same value with the same action. At the same time, it also has randomness. Because it helps the agent get out of a sub-optimal state.
3. Value Function
The expectation or estimation of future rewards(V^pi) from the state, S. It sums the discounted future rewards following the policy pi for the state, S. It measures how good or bad each state is. An action is chosen based on the value function, rather than the reward. However, without rewards, there could be no values, and the only purpose of estimating values is to collect more rewards!
4. (optional) Model of environment
The model gives information on how the environment behaves. The environment is the oracle that gives us what is the next state and what rewards to get after the agent takes action from the current state. Models are used for ‘Planning’ determining the course of action by considering possible future situations. Model is optional because there are two different methods in RL, model-free and model-based methods.
Markov Decision Process, MDP
There are four different elements in the Markov decision process, ‘s’ is from state s, ‘a’ is from the action, ‘r’ is from reward, and the red formula with pi means the policy which maps from state to action.
The key to the Markov decision process is the transition model which gives us what the next state is based on the current state and the current action the agent takes. This transition model is Markovian — given the current state, the next state is independent of the previous states.
RL is the science of decision making
RL is about the sequence of decisions. The result or the solution RL brings is a total surprise. We may get the result that we even didn’t expect. RL is cool but is a difficult problem in general. It takes many more skills than other frameworks.
- Exploration-exploitation dilemma: We only receive feedback for the actions we take. This partial feedback creates this dilemma. We need to search for an absence of space enough to find the right signal. To collect the right information, we need to explore. But, at the same time, while we explore, we might lose the chance to collect potentially good rewards based on the data we already have. So, there is a trade-off relation between exploration and exploitation.
- Non-independent data: Based on statistics, the basic assumption we make is that data is IID(identical independently distributed). However, in RL data itself is not IID. They are correlated with each other rather than independent, and this is one of the difficulties to make estimations with this data not knowing the correlation between the data.
- Generalization: When we say the decision making problem, we want to generalize the solution. However, many of RL algorithms, even with the same task, if we change the environment slightly, the situation becomes totally different. Therefore, generalization is is another key challenge in RL.
- Sample & computational complexities: What we’re doing with RL is actually an ‘online’ decision making, and this comes in a real-time manner, so we have to keep the sample and computational complexities reasonably small to deploy the solution to the real-world.
Next
- Tabular solution methods: Bandit algorithms, Dynamic programming, Temporal-difference learning(SARSA, Q-learning)
- Approximate solution methods: On-policy & off-policy with function approximation, Policy gradient methods
- Recent advances in deep RL
- Real-world RL problems