dynamic programming performance comparison

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
-2
down vote

favorite












Assume we have some Markov decision process (MDP) with state space $s$, cost function $c:smapstomathbb R_+ $ and transition probabilities $[textPr(s'|s,a)]$, where $aina$ corresponds to some action. Consider a infinite horizon dynamic programming problem with discount factor $gamma$, for some state dependent deterministic policy $pi:smapstoa$, the value $V_pi(s)$ of state $s$ under policy $pi$ (according to various text books) is defined to be
$$
V_pi(s)=mathbb E_s_t,pi(s_t))left[left.sum_t=0^inftygamma^tc(s_t)right|s_0=sright].
$$
According to Proposition 5.4.1 in Dynamic Programming and Optimal Control (4th edition, volume 1, and by Dr. Bertsekas), policy $pi$ is optimal if and only if for every $sins$, $V_pi(s)$ attains the minimum.



My question is, for two policies $pi_1$ and $pi_2$ not necessarily being optimal, how do we compare the performance? I've considered the following two metrics:



  1. For each $sins$, we calculate $V_pi_1(s)$ and $V_pi_2(s)$ with the same set of sequences of events that is large enough to evaluate $V_pi_1(s)$ and $V_pi_2(s)$. If for each $s$, $V_pi_1(s)geq V_pi_2(s)$, then $pi_1$ is better than $pi_2$. However, I'm not sure what to conclude if the inequality only holds for a part of the states.


  2. We simulate $pi_1$ and $pi_2$ with the same sample path (sequence of events) that is long enough, say, with length $T$ and make sure they are both in the steady state (stationary distribution) from time step $T/2$. Then from $T/2$, we calculate the following
    $$
    J(pi,epsilon)=frac2Tsum_t=T/2^T-T'sum_i=0^T'gamma^i c(s_t+i)
    $$
    with $gamma^T'<epsilon$ where $epsilon$ is chosen to be small (e.g. $10^-20$). If $J(pi_1,epsilon)>J(pi_2,epsilon)$, then $pi_1$ is better than $pi_2$. However, this method requires very fine selection of $T$ and $epsilon$ to guarantee the precision.


I would appreciate any help with this issue for the selection of an appropriate performance metric.







share|cite|improve this question





















  • Performance metric is generally application dependent. You may consider MSE under normal circumstances.
    – Creator
    Aug 2 at 5:14














up vote
-2
down vote

favorite












Assume we have some Markov decision process (MDP) with state space $s$, cost function $c:smapstomathbb R_+ $ and transition probabilities $[textPr(s'|s,a)]$, where $aina$ corresponds to some action. Consider a infinite horizon dynamic programming problem with discount factor $gamma$, for some state dependent deterministic policy $pi:smapstoa$, the value $V_pi(s)$ of state $s$ under policy $pi$ (according to various text books) is defined to be
$$
V_pi(s)=mathbb E_s_t,pi(s_t))left[left.sum_t=0^inftygamma^tc(s_t)right|s_0=sright].
$$
According to Proposition 5.4.1 in Dynamic Programming and Optimal Control (4th edition, volume 1, and by Dr. Bertsekas), policy $pi$ is optimal if and only if for every $sins$, $V_pi(s)$ attains the minimum.



My question is, for two policies $pi_1$ and $pi_2$ not necessarily being optimal, how do we compare the performance? I've considered the following two metrics:



  1. For each $sins$, we calculate $V_pi_1(s)$ and $V_pi_2(s)$ with the same set of sequences of events that is large enough to evaluate $V_pi_1(s)$ and $V_pi_2(s)$. If for each $s$, $V_pi_1(s)geq V_pi_2(s)$, then $pi_1$ is better than $pi_2$. However, I'm not sure what to conclude if the inequality only holds for a part of the states.


  2. We simulate $pi_1$ and $pi_2$ with the same sample path (sequence of events) that is long enough, say, with length $T$ and make sure they are both in the steady state (stationary distribution) from time step $T/2$. Then from $T/2$, we calculate the following
    $$
    J(pi,epsilon)=frac2Tsum_t=T/2^T-T'sum_i=0^T'gamma^i c(s_t+i)
    $$
    with $gamma^T'<epsilon$ where $epsilon$ is chosen to be small (e.g. $10^-20$). If $J(pi_1,epsilon)>J(pi_2,epsilon)$, then $pi_1$ is better than $pi_2$. However, this method requires very fine selection of $T$ and $epsilon$ to guarantee the precision.


I would appreciate any help with this issue for the selection of an appropriate performance metric.







share|cite|improve this question





















  • Performance metric is generally application dependent. You may consider MSE under normal circumstances.
    – Creator
    Aug 2 at 5:14












up vote
-2
down vote

favorite









up vote
-2
down vote

favorite











Assume we have some Markov decision process (MDP) with state space $s$, cost function $c:smapstomathbb R_+ $ and transition probabilities $[textPr(s'|s,a)]$, where $aina$ corresponds to some action. Consider a infinite horizon dynamic programming problem with discount factor $gamma$, for some state dependent deterministic policy $pi:smapstoa$, the value $V_pi(s)$ of state $s$ under policy $pi$ (according to various text books) is defined to be
$$
V_pi(s)=mathbb E_s_t,pi(s_t))left[left.sum_t=0^inftygamma^tc(s_t)right|s_0=sright].
$$
According to Proposition 5.4.1 in Dynamic Programming and Optimal Control (4th edition, volume 1, and by Dr. Bertsekas), policy $pi$ is optimal if and only if for every $sins$, $V_pi(s)$ attains the minimum.



My question is, for two policies $pi_1$ and $pi_2$ not necessarily being optimal, how do we compare the performance? I've considered the following two metrics:



  1. For each $sins$, we calculate $V_pi_1(s)$ and $V_pi_2(s)$ with the same set of sequences of events that is large enough to evaluate $V_pi_1(s)$ and $V_pi_2(s)$. If for each $s$, $V_pi_1(s)geq V_pi_2(s)$, then $pi_1$ is better than $pi_2$. However, I'm not sure what to conclude if the inequality only holds for a part of the states.


  2. We simulate $pi_1$ and $pi_2$ with the same sample path (sequence of events) that is long enough, say, with length $T$ and make sure they are both in the steady state (stationary distribution) from time step $T/2$. Then from $T/2$, we calculate the following
    $$
    J(pi,epsilon)=frac2Tsum_t=T/2^T-T'sum_i=0^T'gamma^i c(s_t+i)
    $$
    with $gamma^T'<epsilon$ where $epsilon$ is chosen to be small (e.g. $10^-20$). If $J(pi_1,epsilon)>J(pi_2,epsilon)$, then $pi_1$ is better than $pi_2$. However, this method requires very fine selection of $T$ and $epsilon$ to guarantee the precision.


I would appreciate any help with this issue for the selection of an appropriate performance metric.







share|cite|improve this question













Assume we have some Markov decision process (MDP) with state space $s$, cost function $c:smapstomathbb R_+ $ and transition probabilities $[textPr(s'|s,a)]$, where $aina$ corresponds to some action. Consider a infinite horizon dynamic programming problem with discount factor $gamma$, for some state dependent deterministic policy $pi:smapstoa$, the value $V_pi(s)$ of state $s$ under policy $pi$ (according to various text books) is defined to be
$$
V_pi(s)=mathbb E_s_t,pi(s_t))left[left.sum_t=0^inftygamma^tc(s_t)right|s_0=sright].
$$
According to Proposition 5.4.1 in Dynamic Programming and Optimal Control (4th edition, volume 1, and by Dr. Bertsekas), policy $pi$ is optimal if and only if for every $sins$, $V_pi(s)$ attains the minimum.



My question is, for two policies $pi_1$ and $pi_2$ not necessarily being optimal, how do we compare the performance? I've considered the following two metrics:



  1. For each $sins$, we calculate $V_pi_1(s)$ and $V_pi_2(s)$ with the same set of sequences of events that is large enough to evaluate $V_pi_1(s)$ and $V_pi_2(s)$. If for each $s$, $V_pi_1(s)geq V_pi_2(s)$, then $pi_1$ is better than $pi_2$. However, I'm not sure what to conclude if the inequality only holds for a part of the states.


  2. We simulate $pi_1$ and $pi_2$ with the same sample path (sequence of events) that is long enough, say, with length $T$ and make sure they are both in the steady state (stationary distribution) from time step $T/2$. Then from $T/2$, we calculate the following
    $$
    J(pi,epsilon)=frac2Tsum_t=T/2^T-T'sum_i=0^T'gamma^i c(s_t+i)
    $$
    with $gamma^T'<epsilon$ where $epsilon$ is chosen to be small (e.g. $10^-20$). If $J(pi_1,epsilon)>J(pi_2,epsilon)$, then $pi_1$ is better than $pi_2$. However, this method requires very fine selection of $T$ and $epsilon$ to guarantee the precision.


I would appreciate any help with this issue for the selection of an appropriate performance metric.









share|cite|improve this question












share|cite|improve this question




share|cite|improve this question








edited Aug 2 at 20:25
























asked Aug 2 at 4:58









Queue

11




11











  • Performance metric is generally application dependent. You may consider MSE under normal circumstances.
    – Creator
    Aug 2 at 5:14
















  • Performance metric is generally application dependent. You may consider MSE under normal circumstances.
    – Creator
    Aug 2 at 5:14















Performance metric is generally application dependent. You may consider MSE under normal circumstances.
– Creator
Aug 2 at 5:14




Performance metric is generally application dependent. You may consider MSE under normal circumstances.
– Creator
Aug 2 at 5:14















active

oldest

votes











Your Answer




StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
);
);
, "mathjax-editing");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "69"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
noCode: true, onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);








 

draft saved


draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f2869727%2fdynamic-programming-performance-comparison%23new-answer', 'question_page');

);

Post as a guest



































active

oldest

votes













active

oldest

votes









active

oldest

votes






active

oldest

votes










 

draft saved


draft discarded


























 


draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f2869727%2fdynamic-programming-performance-comparison%23new-answer', 'question_page');

);

Post as a guest













































































Comments

Popular posts from this blog

What is the equation of a 3D cone with generalised tilt?

Color the edges and diagonals of a regular polygon

Relationship between determinant of matrix and determinant of adjoint?