# Difference between revisions of "learn what not to learn"

(→Action Elimination) |
(→Action Elimination) |
||

Line 25: | Line 25: | ||

Action elimination with contextual bandits: | Action elimination with contextual bandits: | ||

− | Let <math display="inline">x(s_t)\in R^d </math> be the feature representation of <math display="inline">s_t </math>. We assume that under this representation there exists a set of parameters <math display="inline">\theta_a^*\in R_d </math> such that the elimination signal in state <math display="inline">s_t </math> is <math display="inline">e_t(s_t,a) = \theta_a^Tx(s_t)+\eta_t </math>, where <math display="inline"> \Vert\theta_a^*\Vert_2\leq S</math>. <math display="inline">\eta_t</math> is an R-subgaussian random variable with zero mean that models additive noise to the elimination signal. | + | Let <math display="inline">x(s_t)\in R^d </math> be the feature representation of <math display="inline">s_t </math>. We assume that under this representation there exists a set of parameters <math display="inline">\theta_a^*\in R_d </math> such that the elimination signal in state <math display="inline">s_t </math> is <math display="inline">e_t(s_t,a) = \theta_a^Tx(s_t)+\eta_t </math>, where <math display="inline"> \Vert\theta_a^*\Vert_2\leq S</math>. <math display="inline">\eta_t</math> is an R-subgaussian random variable with zero mean that models additive noise to the elimination signal. When there is no noise in the elimination signal, then R=0. Otherwise, <math display="inline">R\leq 1</math> since the elimination signal is bounded in [0,1]. We will relax the assumption and let elimination signal to be: <math display="inline">0\leq E[\e_t(s_t,a)]\leq l </math> for any valid action and <math display="inline"> u\leq E[e_t(s_t, a)]\leq 1</math> for any invalid action. And <math display="inline"> l\leq u</math>. Next we denote by <math display="inline">X_{t,a}</math> as the matrix whose rows are the observed state representation vectors in which action a was chosen, up to time t. <math display="inline">E_{t,a}</math> as the vector whose elements are the observed state representation elimination signals in which action a was chosen, up to time t. Denote the solution to the regularized linear regression <math display="inline">\Vert X_{t,a}\theta_{t,a}-E_{t,a}\Vert_2^2+\lambda\Vert \theta_{t,a}\Vert_2^2 </math> (for some <math display="inline">\lambda>0</math>) by <math display="inline">\hat{\theta}_{t,a}=\bar{V}_{t,a}^{-1}X_{t,a}^TE_{t,a} </math> where <math display="inline">\bar{V}_{t,a}=\lambda I + X_{t,a}^TX_{t,a}</math>. |

## Revision as of 17:29, 19 November 2018

# Introduction

Learning how to act when the action space are large is challenging for reinforcement learning. For a specific case that many actions are irrelevant, it is sometimes easier for the algorithm to learn which action not to take. The paper propose a new reinforcement learning approach for dealing with large action spaces by restricting the available actions in each state to a subset of the most likely ones. More specifically, it propose a system that learns the approximation of Q-function and concurrently leans to eliminate actions. The method need to utilize an additional elimination signal which incorporates domain-specific prior knowledge. For example, in parser-based text games, the parser gives feedback regarding irrelevant actions after the action is played. (e.g., Player: "Climb the tree." Parser: "There are no trees to climb") Then a machine learning model can be trained to generalize to unseen states.

The paper focus mainly on tasks where both states and the actions are natural language. It introduce a novel deep reinforcement learning approach which has a DQN network and an Action Elimination Network(AEN), both using the CNN which is suitable to NLP tasks. The AEN is trained to predict invalid actions, supervised by the elimination signal from the environment. Note that the core assumption is that it is easy to predict which actions are invalid or inferior in each state and leverage that information for control.

# Related Work

Text-Based Games(TBG): The state of environment in TBG is described by simple language. The player interact with the environment with text command as action which respect a pre-defined grammar. An popular example is Zork which has been tested in the paper.

Representations for TBG: Good word representation is necessary in order to learn control policies from text. Previous work on TBG used pre-trained embeddings directly for control. other works combined pre-trained embeddings with neural networks.

DRL with linear function approximation: DRL methods such as the DQN have achieved state-of-the-art results in a variety of challenging, high-dimensional domains. This is mainly because neural networks can learn rich domain representations for value function and policy.

RL in Large Action Spaces: Prior work concentrated on factorizing the action space into binary subspace(Pazis and Parr, 2011; Dulac-Arnold et al., 2012; Lagoudakis and Parr, 2003), other works proposed to embed the discrete actions into a continuous space, then choose the nearest discrete action according to the optimal actions in the continuous space(Dulac-Arnold et al., 2015; Van Hasselt and Wiering, 2009). He et. al. (2015)extended DQN to unbounded(natural language) action spaces. Learning to eliminate actions was first mentioned by (Even-Dar, Mannor, and Mansour, 2003). They proposed to learn confidence intervals around the value function in each state. Lipton et al.(2016a) proposed to learn a classifier that dtects hazardous state and then use it to shape the reward. Fulda et al.(2017) presented a method for affordance extraction via inner products of pre-trained word embeddings.

# Action Elimination

The approach in the paper builds on the standard RL formulation: At each time step t, the agent observes state [math]s_t [/math] and chooses a discrete action [math]a_t\in\{1,...,|A|\} [/math]. Then the agent obtains a reward [math]r_t(s_t,a_t) [/math] and next state [math]s_{t+1} [/math]. The goal of the algorithm is to learn a policy [math]\pi(a|s) [/math] which maximizes the expected future discount return [math]V^\pi(s)=E^\pi[\sum_{t=0}^{\infty}\gamma^tr(s_t,a_t)|s_0=s]. [/math]

After executing an action, the agent observes a binary elimination signal e(s,a), which equals to 1 if action a can be eliminated for state s, 0 otherwise.

Definition 1. Valid state-action pairs with respect to an elimination signal are state action pairs which the elimination process should not eliminate.

Definition 2. Admissible state-action pairs with respect to an elimination algorithm are state action pairs which the elimination algorithm does not eliminate.

Action elimination with contextual bandits: Let [math]x(s_t)\in R^d [/math] be the feature representation of [math]s_t [/math]. We assume that under this representation there exists a set of parameters [math]\theta_a^*\in R_d [/math] such that the elimination signal in state [math]s_t [/math] is [math]e_t(s_t,a) = \theta_a^Tx(s_t)+\eta_t [/math], where [math] \Vert\theta_a^*\Vert_2\leq S[/math]. [math]\eta_t[/math] is an R-subgaussian random variable with zero mean that models additive noise to the elimination signal. When there is no noise in the elimination signal, then R=0. Otherwise, [math]R\leq 1[/math] since the elimination signal is bounded in [0,1]. We will relax the assumption and let elimination signal to be: [math]0\leq E[\e_t(s_t,a)]\leq l [/math] for any valid action and [math] u\leq E[e_t(s_t, a)]\leq 1[/math] for any invalid action. And [math] l\leq u[/math]. Next we denote by [math]X_{t,a}[/math] as the matrix whose rows are the observed state representation vectors in which action a was chosen, up to time t. [math]E_{t,a}[/math] as the vector whose elements are the observed state representation elimination signals in which action a was chosen, up to time t. Denote the solution to the regularized linear regression [math]\Vert X_{t,a}\theta_{t,a}-E_{t,a}\Vert_2^2+\lambda\Vert \theta_{t,a}\Vert_2^2 [/math] (for some [math]\lambda\gt 0[/math]) by [math]\hat{\theta}_{t,a}=\bar{V}_{t,a}^{-1}X_{t,a}^TE_{t,a} [/math] where [math]\bar{V}_{t,a}=\lambda I + X_{t,a}^TX_{t,a}[/math].