Geometry of Policy Improvement

07/11/2017
Publication GSI2017
OAI : oai:www.see.asso.fr:17410:22433
contenu protégé  Document accessible sous conditions - vous devez vous connecter ou vous enregistrer pour accéder à ou acquérir ce document.
- Accès libre pour les ayants-droit
 

Résumé

We investigate the geometry of optimal memoryless time independent decision making in relation to the amount of information that the acting agent has about the state of the system. We show that the expected long term reward, discounted or per time step, is maximized by policies that randomize among at most k actions whenever at most k world states are consistent with the agent's observation. Moreover, we show that the expected reward per time step can be studied in terms of the expected discounted reward. Our main tool is a geometric version of the policy improvement lemma, which identi es a polyhedral cone of policy changes in which the state value function increases for all states.

Geometry of Policy Improvement

Collection

application/pdf Geometry of Policy Improvement Guido Montùfar, Johannes Rauh
Détails de l'article
contenu protégé  Document accessible sous conditions - vous devez vous connecter ou vous enregistrer pour accéder à ou acquérir ce document.
- Accès libre pour les ayants-droit

We investigate the geometry of optimal memoryless time independent decision making in relation to the amount of information that the acting agent has about the state of the system. We show that the expected long term reward, discounted or per time step, is maximized by policies that randomize among at most k actions whenever at most k world states are consistent with the agent's observation. Moreover, we show that the expected reward per time step can be studied in terms of the expected discounted reward. Our main tool is a geometric version of the policy improvement lemma, which identi es a polyhedral cone of policy changes in which the state value function increases for all states.
Geometry of Policy Improvement

Média

Voir la vidéo

Métriques

0
0
1.45 Mo
 application/pdf
bitcache://c842663606873ddefbe05a6ab118e3ce8630156a

Licence

Creative Commons Aucune (Tous droits réservés)

Sponsors

Sponsors Platine

alanturinginstitutelogo.png
logothales.jpg

Sponsors Bronze

logo_enac-bleuok.jpg
imag150x185_couleur_rvb.jpg

Sponsors scientifique

logo_smf_cmjn.gif

Sponsors

smai.png
gdrmia_logo.png
gdr_geosto_logo.png
gdr-isis.png
logo-minesparistech.jpg
logo_x.jpeg
springer-logo.png
logo-psl.png

Organisateurs

logo_see.gif
<resource  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
                xmlns="http://datacite.org/schema/kernel-4"
                xsi:schemaLocation="http://datacite.org/schema/kernel-4 http://schema.datacite.org/meta/kernel-4/metadata.xsd">
        <identifier identifierType="DOI">10.23723/17410/22433</identifier><creators><creator><creatorName>Guido Montùfar</creatorName></creator><creator><creatorName>Johannes Rauh</creatorName></creator></creators><titles>
            <title>Geometry of Policy Improvement</title></titles>
        <publisher>SEE</publisher>
        <publicationYear>2018</publicationYear>
        <resourceType resourceTypeGeneral="Text">Text</resourceType><subjects><subject>Partially Observable Markov Decision Process</subject><subject>Reinforcement Learning</subject><subject>memoryless stochastic policy</subject><subject>policy gradient theorem</subject></subjects><dates>
	    <date dateType="Created">Sat 3 Mar 2018</date>
	    <date dateType="Updated">Sat 3 Mar 2018</date>
            <date dateType="Submitted">Tue 13 Nov 2018</date>
	</dates>
        <alternateIdentifiers>
	    <alternateIdentifier alternateIdentifierType="bitstream">c842663606873ddefbe05a6ab118e3ce8630156a</alternateIdentifier>
	</alternateIdentifiers>
        <formats>
	    <format>application/pdf</format>
	</formats>
	<version>37116</version>
        <descriptions>
            <description descriptionType="Abstract">We investigate the geometry of optimal memoryless time independent decision making in relation to the amount of information that the acting agent has about the state of the system. We show that the expected long term reward, discounted or per time step, is maximized by policies that randomize among at most k actions whenever at most k world states are consistent with the agent's observation. Moreover, we show that the expected reward per time step can be studied in terms of the expected discounted reward. Our main tool is a geometric version of the policy improvement lemma, which identi es a polyhedral cone of policy changes in which the state value function increases for all states.
</description>
        </descriptions>
    </resource>
.

Geometry of Policy Improvement Guido Montúfar1,2 and Johannes Rauh1 1 Max Planck Institute for Mathematics in the Sciences Inselstraße 22, 04103 Leipzig, Germany 2 Departments of Mathematics and Statistics, UCLA, CA 90095-1555, USA Abstract. We investigate the geometry of optimal memoryless time in- dependent decision making in relation to the amount of information that the acting agent has about the state of the system. We show that the expected long term reward, discounted or per time step, is maximized by policies that randomize among at most k actions whenever at most k world states are consistent with the agent’s observation. Moreover, we show that the expected reward per time step can be studied in terms of the expected discounted reward. Our main tool is a geometric version of the policy improvement lemma, which identifies a polyhedral cone of policy changes in which the state value function increases for all states. Keywords: Partially Observable Markov Decision Process, Reinforce- ment Learning, memoryless stochastic policy, policy gradient theorem 1 Introduction We are interested in the amount of randomization that is needed in action selec- tion mechanisms in order to maximize the expected value of a long term reward, depending on the uncertainty of the acting agent about the system state. It is known that in a Markov Decision Process (MDP), the optimal policy may always be chosen deterministic (see, e.g., [5]), in the sense that the action a that the agent chooses is a deterministic function of the world state w the agent observes. This is no longer true in a Partially Observable MDP (POMDP), where the agent does not observe w directly, but only the value s of a sensor. In general, optimal memoryless policies for POMDPs are stochastic. However, the more information the agent has about w, the less stochastic an optimal policy needs to be. As shown in [4], if a particular sensor value s uniquely identifies w, then the optimal policy may be chosen such that, on observing s, the agent always chooses the same action. We generalize this as follows: The agent may choose an optimal policy such that, if a given sensor value s can be observed from at most k world states, then the agent chooses an action probabilistically among a set of at most k actions. Such characterizations can be used to restrict the search space when searching for an optimal policy. In [1], it was proposed to construct a low-dimensional manifold of policies that contains an optimal policy in its closure and to restrict the learning algorithm to this manifold. In [4], it was shown how to do this in the POMDP setting when it is known that the optimal policy can be chosen deterministic in certain sensor states. This construction can be generalized and gives manifolds of even smaller dimension when the randomization of the policy can be further restricted. As in [4], we study the case where at each time step the agent receives a reward that depends on the world state w and the chosen action a. We are inter- ested in the long term reward in either the average or the discounted sense [6]. Discounted rewards are often preferred in theoretical analysis, because of the properties of the dynamic programming operators. In [4], the analysis of average rewards was much more involved than the analysis of discounted rewards. While the case of discounted rewards follows from a policy improvement argument, an elaborate geometric analysis was needed for the case of average rewards. Various works have compared average and discounted rewards [8, 3, 2]. Here, we develop a tool that allows us to transfer properties of optimal policies from the discounted case to the average case. Namely, the average case can be seen as the limit of the discounted case when the discount factor γ approaches 1. If the Markov chain is irreducible and aperiodic, this limit is uniform, and the optimal policies of the discounted case converge to optimal policies of the average case. 2 Optimal Policies for POMDPs A (discrete time) partially observable Markov decision process (POMDP) is defined by a tuple pW, S, A, α, β, Rq, where W, S, A are finite sets of world states, sensor states, and actions, β : W Ñ ∆S and α: W ˆ A Ñ ∆W are Markov kernels describing sensor measurements and world state transitions, and R: W ˆ A Ñ R is a reward signal. We consider stationary (memoryless and time independent) action selection mechanisms, described by Markov kernels of the form π: S Ñ ∆A. We denote the set of stationary policies by ∆S,A. We write pπ pa|wq “ ř s βps|wqπpa|sq for the effective world state policy. Standard reference texts are [6, 5]. We assume that the Markov chain starts with a distribution µ P ∆W and then progresses according to α, β and a fixed policy π. We denote by µt π P ∆W the distribution of the world state at time t. It is well known that the limit pπ µ :“ limT Ñ8 1 T řT ´1 t“0 µt π exists and is a stationary distribution of the Markov chain. The following technical assumption is commonly made: p˚q For all π, the Markov chain over world states is aperiodic and irreducible. The most important implication of irreducibility is that the limit distribution pπ µ is independent of µ. If the chain has period s, then pπ µ “ limT Ñ8 1 s řs t“1 µT `t π . In particular, under assumption p˚q, µt π Ñ pπ µ for any µ. (Since we assume finite sets, all notions of convergence of probability distributions are equivalent.) The objective of learning is to maximize the expected value of a long term reward. The (normalized) discounted reward with discount factor γ P r0, 1q is Rγ µpπq “ p1´γq 8 ÿ t“0 γt ÿ w µt πpwq ÿ a pπ pa|wqRpw, aq “ p1´γqEπ,µ ” 8 ÿ t“0 γt Rpwt, atq ı . The average reward is Rµpπq “ ÿ w pπ µpwq ÿ a pπ pa|wqRpw, aq. Under assumption p˚q, Rµ is independent of the choice of µ and depends contin- uously on π, as we show next. Since ∆S,A is compact, the existence of optimal policies is guaranteed. Without assumption p˚q, optimal policies for Rµ need not exist. On the other hand, the expected discounted reward Rγ µ is always continuous, so that, for this, optimal policies always exist. Lemma 1. Under assumption p˚q, Rµpπq is continuous as a function of π. Proof. By p˚q, pπ µ is the unique solution to a linear system of equations that smoothly depends on π. Thus, Rµ is continuous as a function of π. ˝ Lemma 2. For fixed µ and γ P r0, 1q, Rγ µpπq is continuous as a function of π. Proof. Fix  ą 0. There exists l ą 0 such that p1 ´ γq ř8 t“l γt R ď {4, where R “ maxw,a |Rpw, aq|. For each t, the distribution µt π depends continuously on π. For fixed π, let U be a neighborhood of π such that |µt πpwq ´ µt π1 pwq| ď 1 2|W |R  for t “ 0, . . . , l ´ 1, w P W and π1 P U. Then, for all π1 P U, |Rγ µpπq´Rγ µpπ1 q| ď  2 `p1´γq l´1 ÿ t“0 γt ÿ w |µt πpwq´µt π1 pwq|R ď  2 ` |W| 2|W|R R “ . ˝ The following refinement of the analysis of [4] is our main result. Theorem 1. Consider a POMDP pW, S, A, α, β, Rq, and let µ P ∆W and γ P r0, 1q. There is a stationary (memoryless, time independent) policy π˚ P ∆S,A with | supppπ˚ p¨|sqq| ď | supppβps|¨qq| for all s P S and Rγ µpπ˚ q ě Rγ µpπq for all π P ∆S,A. Under assumption p˚q, the same holds true for Rµ in place of Rγ µ. We prove the discounted case in Section 3 and the average case in Section 4. 3 Discounted Rewards from Policy Improvement The state value function V π of a policy π is defined as the unique solution of the Bellman equation V π pwq “ ÿ a pπ pa|wq ” Rpw, aq ` γ ÿ w1 αpw1 |w, aqV π pw1 q ı , w P W. It is useful to write V π pwq “ ř a pπ pa|wqQπ pw, aq, where Qπ pw, aq “ Rpw, aq ` γ ř w1 αpw1 |w, aqV π pw1 q, w P W, a P A, is the state action value function. Observe that Rγ µpπq “ p1´γq ř w µpwqV π pwq. If two policies π, π1 satisfy V π1 pwq ě V π pwq for all w, then Rγ µpπ1 q ě Rγ µpπq for all µ. The following is a more explicit version of a lemma from [4]: Lemma 3 (Policy improvement lemma). Let π, π1 P ∆S,A and pwq “ ř a pπ1 pa|wqQπ pw, aq ´ V π pwq for all w P W. Then V π1 pwq “ V π pwq ` Eπ1,w0“w ” 8 ÿ t“0 γt pwtq ı for all w P W. If pwq ě 0 for all w P W, then V π1 pwq ě V π pwq ` dπ1 pwqpwq for all w P W, where dπ1 pwq “ ř8 t“0 γt Prpwt “ w|π1 , w0 “ wq ě 1 is the discounted expected number of visits to w. Proof. V π pwq “ ÿ a pπ1 pa|wqQπ pw, aq ´ pwq “ Eπ1,w0“w ”´ Rpw0, a0q ´ pw0q ¯ ` γV π pw1q ı “ Eπ1,w0“w ”´ Rpw0, a0q ´ pw0q ¯ ` γ ´ ÿ a pπ1 pa|w1qQπ pw1, aq ´ pw1q ¯ı “ Eπ1,w0“w ” 8 ÿ t“0 γt ´ Rpwt, atq ´ pwtq ¯ı “ V π1 pwq ´ Eπ1,w0“w ” 8 ÿ t“0 γt pwtq ı . ˝ Lemma 3 allows us to find policy changes that increase V π pwq for all w P W and thereby Rγ µpπq for any µ. Definition 1. Fix a policy π P ∆S,A. For each sensor state s P S consider the set supppβps|¨qq “ tw P W : βps|wq ą 0u “ tws 1, . . . , ws ks u, and define the linear forms lπ,s i : ∆A Ñ R; q ÞÑ ÿ a qpaqQπ pws i , aq, i “ 1, . . . , ks. The policy improvement cone at policy π and sensation s is Lπ,s “ q P ∆A : lπ,s i pqq ě lπ,s i pπp¨|sqq for all i “ 1, . . . , ks ( . The (total) policy improvement cone at policy π is Lπ “ π1 P ∆S,A : π1 p¨|sq P Lπ,s for all s P S ( . Lπ,s and Lπ are intersections of ∆A and ∆S,A with intersections of affine halfs- paces (see Fig. 1). Since π P Lπ , the policy improvement cones are never empty. Lemma 4. Let π P ∆S,A and π1 P Lπ . Then, for all w, V π1 pwq ´ V π pwq ě dπ1 pwq ÿ s βps|wq ÿ a pπ1 pa|sq ´ πpa|sqqQπ pw, aq ě 0. Proof. Fix w P W. In the notation from Lemma 3, suppose that supppβp¨|wqq “ ts1, . . . , slu and that w “ w sj ij for j “ 1 . . . , l. Then pwq “ ÿ a pπ1 pa|wqQπ pw, aq ´ ÿ a pπ pa|wqQπ pw, aq “ l ÿ j“1 βpsj|wql π,sj ij pπ1 p¨|sjq ´ πp¨|sjqq ě 0, since π1 P Lπ . The statement now follows from Lemma 3. ˝ Remark 1. Lemma 4 relates to the policy gradient theorem [7], which says that BV π pwq Bπpa1|s1q “ dπ pwq ÿ s βps|wq ÿ a Bπpa|sq Bπpa1|s1q Qπ pw, aq. (1) Our result adds that, for each w, the value function V π1 pwq is bounded from below by a linear function of π1 that takes value at least V π pwq within the entire policy improvement cone Lπ . See Fig. 1. Now we show that there is an optimal policy with small support. Lemma 5. Let P be a polytope, and let l1, . . . , lk be linear forms on P. For any p P P, let Li,` “ tq P P : lipqq ě lippqu. Then Şk i“1 Li,` contains an element q that belongs to a face of P of dimension at most k ´ 1. Proof. The argument is by induction. For k “ 1, the maximum of l1 on P is attained at a vertex q of P. Clearly, l1pqq ě l1ppq, and so q P L1,`. Now suppose that k ą 1. Let P1 :“ P XLk,`. Each face of P1 is a subset of a face of P of at most one more dimension. By induction, Şk´1 i“1 Li,` XP1 contains an element q that belongs to a face of P1 of dimension at most k ´ 2. ˝ Proof (of Theorem 1 for discounted rewards). By Lemma 5, each policy improve- ment cone Lπ,s contains an element q that belongs to a face of ∆A of dimension at most pk ´ 1q (that is, the support of q has cardinality at most k), where k “ | supppβps|¨qq|. Putting these together, we find a policy π1 in the total pol- icy improvement cone that satisfies | supppπp¨|sqq| ď | supppβps|¨qq| for all s. By Lemma 4, V π1 pwq ě V π pwq for all w, and so Rγ µpπ1 q ě Rγ µpπq. ˝ Remark 2. The | supp βps|¨q| positive probability actions at sensation s do not necessarily correspond to the actions that the agent would choose if she knew the identity of the world state, as our example in Section 5 shows. 4 Average Rewards from Discounted Rewards The average reward per time step can be written in terms of the discounted reward as Rpπq “ Rγ pπ µ . However, the hypothesis V π1 pwq ě V π pwq for all w, does not directly imply any relation between Rpπ1 q and Rpπq, since they compare the value function against different stationary distributions. We show that results for discounted rewards translate nonetheless to results for average rewards. q p l2 l1 L π(·|s) • V π (w) • Lπ,s Fig. 1. Left: Illustration of the policy improvement cone. Right: Illustration of the state value function V π pwq for some fixed w, showing the linear lower bound over the policy improvement cone Lπ,s . This numerical example is discussed further in Section 5. Lemma 6. Let µ be fixed, and assume p˚q. For any  ą 0 there exists l ą 0 such that for all π and all t ě l, |µt πpwq ´ pπ µpwq| ď  for all w. Proof. By p˚q, the transition matrix of the Markov chain has the eigenvalue one with multiplicity one, with left eigenvector denoted by pπ µ. Let p2, . . . , p|W | be orthonormal left eigenvectors to the other eigenvalues λ2, . . . , λ|W |, ordered such that λ2 has the largest absolute value. There is a unique expansion µ “ c1pπ µ ` c2p2 ` ¨ ¨ ¨ ` c|W |p|W |. Then µt π “ c1pπ µ ` ř|W | i“2 ciλt ipi. Letting t Ñ 8, it follows that c1 “ 1. By orthonormality, |ci|2 ď ř|W | i“2 c2 i ď }µ}2 2 ď 1 and |pipwq| ď 1 for i “ 2, . . . , |W|. Therefore, |µt πpwq ´ pπ µpwq| “ | ř|W | i“2 ciλt ip|W |pwq| ď |W||λ2|t . |λ2| depends continuously on the transition matrix, which itself depends con- tinuously on π. Since ∆S,A is compact, |λ2| “ |λ2pπq| has a maximum d, and d ă 1 due to p˚q. Therefore, |µt πpwq ´ pπ µpwq| ď |W|dt for all π. The statement follows from this. ˝ Proposition 1. For fixed µ, under assumption p˚q, Rγ µpπq Ñ Rµpπq uniformly in π as γ Ñ 1. Proof. For fixed µ and , let l be as in Lemma 6. Let R “ maxw,a |Rpw, aq|. Then Rγ µpπq “ p1 ´ γq l´1 ÿ k“0 γk ÿ w µk πpwq ÿ a πpa|wqRpw, aq ` p1 ´ γqγl 8 ÿ k“0 γk ÿ w pπ µpwq ÿ a πpa|wqRpw, aq ` OpRqp1 ´ γq 8 ÿ k“0 γk “ Opp1 ´ γqlRq ` OpRq ` γl Rµpπq for all π. For given δ ą 0, we can choose  ą 0 such that the term OpRq is smaller in absolute value than δ{3. This also fixes l “ lpq. Then, for any γ ă 1 large enough, the term Opp1´γqlRq is smaller than δ{3, and also |pγl ´1qRµpπq| ď δ{3. This shows that for γ ă 1 large enough, |Rγ µpπq ´ Rµpπq| ď δ, independent of π. The statement follows since δ ą 0 was arbitrary. ˝ Theorem 2. For any γ P r0, 1q, let π̂γ be a policy that maximizes Rγ µ. Let π̂ be a limit point of a convergent subsequence as γ Ñ 1. Then π̂ maximizes Rµ, and limγÑ1 Rγ µpπ̂γq “ Rµpπ̂q. Proof. For any  ą 0, there is δ ą 0 such that γ ě 1´δ implies |Rµpπq´Rγ µpπq| ď  for all π. Thus | maxπ Rµpπq ´ maxπ Rγ µ| ď , whence limγÑ1 maxπ Rγ µpπq “ maxπ Rµpπq. Moreover, | maxπ Rµpπq´Rµpπ̂γq| ď 2`| maxπ Rγ µpπq´Rγ µpπ̂γq| “ 2. By continuity, the limit value of Rµ applied to a convergent subsequence of the π̂γ is the maximum of Rµ. ˝ Corollary 1. Fix a world state w, and let r ě 0. If there exists for each γ P r0, 1q a policy π̂γ that is optimal for Rγ µ with | supppπp¨|sqq| ď r, then there exists a policy π̂ with | supppπp¨|sqq| ď r that is optimal for Rµ. Proof. Take a limit point of the family π̂γ as γ Ñ 1 and apply Theorem 2. ˝ Remark 3. Without p˚q, one can show that Rγ µpπq still converges to Rµpπq for each fixed π, but convergence is no longer uniform. Also, Rµ need not be con- tinuous in π, and so an optimal policy need not exist. 5 Example We illustrate our results on an example from [4]. Consider an agent with sensor states S “ t1, 2, 3u and actions A “ t1, 2, 3u. The system has world states W “ t1, 2, 3, 4u with the transitions and rewards illustrated in Fig. 2. At w “ 1, 4 all actions produce the same outcomes. States w “ 2, 3 are observed as s “ 2. Hence we can focus on πp¨|s “ 2q P ∆A. We evaluate 861 evenly spaced policies in this 2-simplex. Fig. 2 shows color maps of the expected reward (interpolated between evaluations), with lighter colors corresponding to higher values. As in Fig. 1, red vectors are the gradients of the linear forms (corresponding to Qπ pw, ¨q, w “ 2, 3), and dashed blue lines limit the policy improvement cones Lπ,s“2 . Stepping into the improvement cone always increases V π pwq “ Rγ µ“δw pπq for all w P W. Note that each cone contains a policy at an edge of the simplex, i.e., assigning positive probability to at most two actions. The convergence of Rγ µ to Rµ as γ Ñ 1 is visible. Note also that for γ “ 0.6 the optimal policy requires two positive probability actions, so that our upper bound | supppπp¨|sqq| ď | supppβps|¨qq| is attained. Acknowledgment: We thank Nihat Ay for support and insightful comments. 1 4 2 3 1, 2, 3 3 3 1 2 1 2 1, 2, 3 R =     0 0 0 −1 +1 −0.1 +1 −1 −0.1 +1 +1 +1     Fig. 2. Illustration of the example form Section 5. Top: State transitions and reward signal. Bottom: Numerical evaluation of the expected long term reward. References 1. N. Ay, G. Montúfar, and J. Rauh. Advances in Cognitive Neurodynamics (III), chapter Selection Criteria for Neuromanifolds of Stochastic Dynamics, pages 147– 154. Springer Netherlands, 2013. 2. M. Hutter. General discounting versus average reward. In Algorithmic Learning Theory 17, pages 244–258. Springer Berlin Heidelberg, 2006. 3. S. Kakade. Optimizing average reward using discounted rewards. In Computational Learning Theory 14, pages 605–615. Springer Berlin Heidelberg, 2001. 4. G. Montúfar, K. Ghazi-Zahedi, and N. Ay. Geometry and determinism of optimal stationary control in partially observable Markov decision processes. arXiv:1503.07206, 2015. 5. S. M. Ross. Introduction to Stochastic Dynamic Programming. Academic Press, Inc., 1983. 6. R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998. 7. R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems 12, pages 1057–1063. MIT Press, 2000. 8. J. N. Tsitsiklis and B. Van Roy. On average versus discounted reward temporal- difference learning. Machine Learning, 49(2):179–191, 2002.