dynamic programming performance comparison

up vote
-2
down vote

favorite

Assume we have some Markov decision process (MDP) with state space $s$, cost function $c:smapstomathbb R_+ $ and transition probabilities $[textPr(s'|s,a)]$, where $aina$ corresponds to some action. Consider a infinite horizon dynamic programming problem with discount factor $gamma$, for some state dependent deterministic policy $pi:smapstoa$, the value $V_pi(s)$ of state $s$ under policy $pi$ (according to various text books) is defined to be
$$
V_pi(s)=mathbb E_s_t,pi(s_t))left[left.sum_t=0^inftygamma^tc(s_t)right|s_0=sright].
$$
According to Proposition 5.4.1 in Dynamic Programming and Optimal Control (4th edition, volume 1, and by Dr. Bertsekas), policy $pi$ is optimal if and only if for every $sins$, $V_pi(s)$ attains the minimum.

My question is, for two policies $pi_1$ and $pi_2$ not necessarily being optimal, how do we compare the performance? I've considered the following two metrics:

For each $sins$, we calculate $V_pi_1(s)$ and $V_pi_2(s)$ with the same set of sequences of events that is large enough to evaluate $V_pi_1(s)$ and $V_pi_2(s)$. If for each $s$, $V_pi_1(s)geq V_pi_2(s)$, then $pi_1$ is better than $pi_2$. However, I'm not sure what to conclude if the inequality only holds for a part of the states.

We simulate $pi_1$ and $pi_2$ with the same sample path (sequence of events) that is long enough, say, with length $T$ and make sure they are both in the steady state (stationary distribution) from time step $T/2$. Then from $T/2$, we calculate the following
$$
J(pi,epsilon)=frac2Tsum_t=T/2^T-T'sum_i=0^T'gamma^i c(s_t+i)
$$
with $gamma^T'<epsilon$ where $epsilon$ is chosen to be small (e.g. $10^-20$). If $J(pi_1,epsilon)>J(pi_2,epsilon)$, then $pi_1$ is better than $pi_2$. However, this method requires very fine selection of $T$ and $epsilon$ to guarantee the precision.

I would appreciate any help with this issue for the selection of an appropriate performance metric.

edited Aug 2 at 20:25

asked Aug 2 at 4:58

Queue

Performance metric is generally application dependent. You may consider MSE under normal circumstances.
â€“Â Creator
Aug 2 at 5:14

add a commentÂ |Â

up vote
-2
down vote

favorite

My question is, for two policies $pi_1$ and $pi_2$ not necessarily being optimal, how do we compare the performance? I've considered the following two metrics:

For each $sins$, we calculate $V_pi_1(s)$ and $V_pi_2(s)$ with the same set of sequences of events that is large enough to evaluate $V_pi_1(s)$ and $V_pi_2(s)$. If for each $s$, $V_pi_1(s)geq V_pi_2(s)$, then $pi_1$ is better than $pi_2$. However, I'm not sure what to conclude if the inequality only holds for a part of the states.

We simulate $pi_1$ and $pi_2$ with the same sample path (sequence of events) that is long enough, say, with length $T$ and make sure they are both in the steady state (stationary distribution) from time step $T/2$. Then from $T/2$, we calculate the following
$$
J(pi,epsilon)=frac2Tsum_t=T/2^T-T'sum_i=0^T'gamma^i c(s_t+i)
$$
with $gamma^T'<epsilon$ where $epsilon$ is chosen to be small (e.g. $10^-20$). If $J(pi_1,epsilon)>J(pi_2,epsilon)$, then $pi_1$ is better than $pi_2$. However, this method requires very fine selection of $T$ and $epsilon$ to guarantee the precision.

I would appreciate any help with this issue for the selection of an appropriate performance metric.

edited Aug 2 at 20:25

asked Aug 2 at 4:58

Queue

Performance metric is generally application dependent. You may consider MSE under normal circumstances.
â€“Â Creator
Aug 2 at 5:14

add a commentÂ |Â

up vote
-2
down vote

favorite

My question is, for two policies $pi_1$ and $pi_2$ not necessarily being optimal, how do we compare the performance? I've considered the following two metrics:

For each $sins$, we calculate $V_pi_1(s)$ and $V_pi_2(s)$ with the same set of sequences of events that is large enough to evaluate $V_pi_1(s)$ and $V_pi_2(s)$. If for each $s$, $V_pi_1(s)geq V_pi_2(s)$, then $pi_1$ is better than $pi_2$. However, I'm not sure what to conclude if the inequality only holds for a part of the states.

We simulate $pi_1$ and $pi_2$ with the same sample path (sequence of events) that is long enough, say, with length $T$ and make sure they are both in the steady state (stationary distribution) from time step $T/2$. Then from $T/2$, we calculate the following
$$
J(pi,epsilon)=frac2Tsum_t=T/2^T-T'sum_i=0^T'gamma^i c(s_t+i)
$$
with $gamma^T'<epsilon$ where $epsilon$ is chosen to be small (e.g. $10^-20$). If $J(pi_1,epsilon)>J(pi_2,epsilon)$, then $pi_1$ is better than $pi_2$. However, this method requires very fine selection of $T$ and $epsilon$ to guarantee the precision.

I would appreciate any help with this issue for the selection of an appropriate performance metric.

edited Aug 2 at 20:25

asked Aug 2 at 4:58

Queue

My question is, for two policies $pi_1$ and $pi_2$ not necessarily being optimal, how do we compare the performance? I've considered the following two metrics:

For each $sins$, we calculate $V_pi_1(s)$ and $V_pi_2(s)$ with the same set of sequences of events that is large enough to evaluate $V_pi_1(s)$ and $V_pi_2(s)$. If for each $s$, $V_pi_1(s)geq V_pi_2(s)$, then $pi_1$ is better than $pi_2$. However, I'm not sure what to conclude if the inequality only holds for a part of the states.

We simulate $pi_1$ and $pi_2$ with the same sample path (sequence of events) that is long enough, say, with length $T$ and make sure they are both in the steady state (stationary distribution) from time step $T/2$. Then from $T/2$, we calculate the following
$$
J(pi,epsilon)=frac2Tsum_t=T/2^T-T'sum_i=0^T'gamma^i c(s_t+i)
$$
with $gamma^T'<epsilon$ where $epsilon$ is chosen to be small (e.g. $10^-20$). If $J(pi_1,epsilon)>J(pi_2,epsilon)$, then $pi_1$ is better than $pi_2$. However, this method requires very fine selection of $T$ and $epsilon$ to guarantee the precision.

I would appreciate any help with this issue for the selection of an appropriate performance metric.

edited Aug 2 at 20:25

asked Aug 2 at 4:58

Queue

edited Aug 2 at 20:25

asked Aug 2 at 4:58

Queue

asked Aug 2 at 4:58

Queue

asked Aug 2 at 4:58

Queue

Performance metric is generally application dependent. You may consider MSE under normal circumstances.
â€“Â Creator
Aug 2 at 5:14

add a commentÂ |Â

Performance metric is generally application dependent. You may consider MSE under normal circumstances.
â€“Â Creator
Aug 2 at 5:14

Performance metric is generally application dependent. You may consider MSE under normal circumstances.
â€“Â Creator
Aug 2 at 5:14

add a commentÂ |Â

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
);
);
, "mathjax-editing");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "69"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
noCode: true, onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f2869727%2fdynamic-programming-performance-comparison%23new-answer', 'question_page');

);

Post as a guest

Name

active

oldest

votes

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

Search This Blog

ukmuiik