License: Creative Commons Attribution 3.0 Unported license (CC BY 3.0)
When quoting this document, please refer to the following
DOI: 10.4230/LIPIcs.CONCUR.2018.8
URN: urn:nbn:de:0030-drops-95468
URL: http://dagstuhl.sunsite.rwth-aachen.de/volltexte/2018/9546/
Go to the corresponding LIPIcs Volume Portal


Kretínský, Jan ; Pérez, Guillermo A. ; Raskin, Jean-François

Learning-Based Mean-Payoff Optimization in an Unknown MDP under Omega-Regular Constraints

pdf-format:
LIPIcs-CONCUR-2018-8.pdf (0.5 MB)


Abstract

We formalize the problem of maximizing the mean-payoff value with high probability while satisfying a parity objective in a Markov decision process (MDP) with unknown probabilistic transition function and unknown reward function. Assuming the support of the unknown transition function and a lower bound on the minimal transition probability are known in advance, we show that in MDPs consisting of a single end component, two combinations of guarantees on the parity and mean-payoff objectives can be achieved depending on how much memory one is willing to use. (i) For all epsilon and gamma we can construct an online-learning finite-memory strategy that almost-surely satisfies the parity objective and which achieves an epsilon-optimal mean payoff with probability at least 1 - gamma. (ii) Alternatively, for all epsilon and gamma there exists an online-learning infinite-memory strategy that satisfies the parity objective surely and which achieves an epsilon-optimal mean payoff with probability at least 1 - gamma. We extend the above results to MDPs consisting of more than one end component in a natural way. Finally, we show that the aforementioned guarantees are tight, i.e. there are MDPs for which stronger combinations of the guarantees cannot be ensured.

BibTeX - Entry

@InProceedings{kretnsk_et_al:LIPIcs:2018:9546,
  author =	{Jan Kret{\'i}nsk{\'y} and Guillermo A. P{\'e}rez and Jean-Fran{\c{c}}ois Raskin},
  title =	{{Learning-Based Mean-Payoff Optimization in an Unknown MDP under Omega-Regular Constraints}},
  booktitle =	{29th International Conference on Concurrency Theory  (CONCUR 2018)},
  pages =	{8:1--8:18},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-087-3},
  ISSN =	{1868-8969},
  year =	{2018},
  volume =	{118},
  editor =	{Sven Schewe and Lijun Zhang},
  publisher =	{Schloss Dagstuhl--Leibniz-Zentrum fuer Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{http://drops.dagstuhl.de/opus/volltexte/2018/9546},
  URN =		{urn:nbn:de:0030-drops-95468},
  doi =		{10.4230/LIPIcs.CONCUR.2018.8},
  annote =	{Keywords: Markov decision processes, Reinforcement learning, Beyond worst case}
}

Keywords: Markov decision processes, Reinforcement learning, Beyond worst case
Collection: 29th International Conference on Concurrency Theory (CONCUR 2018)
Issue Date: 2018
Date of publication: 31.08.2018


DROPS-Home | Fulltext Search | Imprint | Privacy Published by LZI