dynamic programming performance comparison
Clash Royale CLAN TAG#URR8PPP
up vote
-2
down vote
favorite
Assume we have some Markov decision process (MDP) with state space $s$, cost function $c:smapstomathbb R_+ $ and transition probabilities $[textPr(s'|s,a)]$, where $aina$ corresponds to some action. Consider a infinite horizon dynamic programming problem with discount factor $gamma$, for some state dependent deterministic policy $pi:smapstoa$, the value $V_pi(s)$ of state $s$ under policy $pi$ (according to various text books) is defined to be
$$
V_pi(s)=mathbb E_s_t,pi(s_t))left[left.sum_t=0^inftygamma^tc(s_t)right|s_0=sright].
$$
According to Proposition 5.4.1 in Dynamic Programming and Optimal Control (4th edition, volume 1, and by Dr. Bertsekas), policy $pi$ is optimal if and only if for every $sins$, $V_pi(s)$ attains the minimum.
My question is, for two policies $pi_1$ and $pi_2$ not necessarily being optimal, how do we compare the performance? I've considered the following two metrics:
For each $sins$, we calculate $V_pi_1(s)$ and $V_pi_2(s)$ with the same set of sequences of events that is large enough to evaluate $V_pi_1(s)$ and $V_pi_2(s)$. If for each $s$, $V_pi_1(s)geq V_pi_2(s)$, then $pi_1$ is better than $pi_2$. However, I'm not sure what to conclude if the inequality only holds for a part of the states.
We simulate $pi_1$ and $pi_2$ with the same sample path (sequence of events) that is long enough, say, with length $T$ and make sure they are both in the steady state (stationary distribution) from time step $T/2$. Then from $T/2$, we calculate the following
$$
J(pi,epsilon)=frac2Tsum_t=T/2^T-T'sum_i=0^T'gamma^i c(s_t+i)
$$
with $gamma^T'<epsilon$ where $epsilon$ is chosen to be small (e.g. $10^-20$). If $J(pi_1,epsilon)>J(pi_2,epsilon)$, then $pi_1$ is better than $pi_2$. However, this method requires very fine selection of $T$ and $epsilon$ to guarantee the precision.
I would appreciate any help with this issue for the selection of an appropriate performance metric.
dynamic-programming
add a comment |Â
up vote
-2
down vote
favorite
Assume we have some Markov decision process (MDP) with state space $s$, cost function $c:smapstomathbb R_+ $ and transition probabilities $[textPr(s'|s,a)]$, where $aina$ corresponds to some action. Consider a infinite horizon dynamic programming problem with discount factor $gamma$, for some state dependent deterministic policy $pi:smapstoa$, the value $V_pi(s)$ of state $s$ under policy $pi$ (according to various text books) is defined to be
$$
V_pi(s)=mathbb E_s_t,pi(s_t))left[left.sum_t=0^inftygamma^tc(s_t)right|s_0=sright].
$$
According to Proposition 5.4.1 in Dynamic Programming and Optimal Control (4th edition, volume 1, and by Dr. Bertsekas), policy $pi$ is optimal if and only if for every $sins$, $V_pi(s)$ attains the minimum.
My question is, for two policies $pi_1$ and $pi_2$ not necessarily being optimal, how do we compare the performance? I've considered the following two metrics:
For each $sins$, we calculate $V_pi_1(s)$ and $V_pi_2(s)$ with the same set of sequences of events that is large enough to evaluate $V_pi_1(s)$ and $V_pi_2(s)$. If for each $s$, $V_pi_1(s)geq V_pi_2(s)$, then $pi_1$ is better than $pi_2$. However, I'm not sure what to conclude if the inequality only holds for a part of the states.
We simulate $pi_1$ and $pi_2$ with the same sample path (sequence of events) that is long enough, say, with length $T$ and make sure they are both in the steady state (stationary distribution) from time step $T/2$. Then from $T/2$, we calculate the following
$$
J(pi,epsilon)=frac2Tsum_t=T/2^T-T'sum_i=0^T'gamma^i c(s_t+i)
$$
with $gamma^T'<epsilon$ where $epsilon$ is chosen to be small (e.g. $10^-20$). If $J(pi_1,epsilon)>J(pi_2,epsilon)$, then $pi_1$ is better than $pi_2$. However, this method requires very fine selection of $T$ and $epsilon$ to guarantee the precision.
I would appreciate any help with this issue for the selection of an appropriate performance metric.
dynamic-programming
Performance metric is generally application dependent. You may consider MSE under normal circumstances.
â Creator
Aug 2 at 5:14
add a comment |Â
up vote
-2
down vote
favorite
up vote
-2
down vote
favorite
Assume we have some Markov decision process (MDP) with state space $s$, cost function $c:smapstomathbb R_+ $ and transition probabilities $[textPr(s'|s,a)]$, where $aina$ corresponds to some action. Consider a infinite horizon dynamic programming problem with discount factor $gamma$, for some state dependent deterministic policy $pi:smapstoa$, the value $V_pi(s)$ of state $s$ under policy $pi$ (according to various text books) is defined to be
$$
V_pi(s)=mathbb E_s_t,pi(s_t))left[left.sum_t=0^inftygamma^tc(s_t)right|s_0=sright].
$$
According to Proposition 5.4.1 in Dynamic Programming and Optimal Control (4th edition, volume 1, and by Dr. Bertsekas), policy $pi$ is optimal if and only if for every $sins$, $V_pi(s)$ attains the minimum.
My question is, for two policies $pi_1$ and $pi_2$ not necessarily being optimal, how do we compare the performance? I've considered the following two metrics:
For each $sins$, we calculate $V_pi_1(s)$ and $V_pi_2(s)$ with the same set of sequences of events that is large enough to evaluate $V_pi_1(s)$ and $V_pi_2(s)$. If for each $s$, $V_pi_1(s)geq V_pi_2(s)$, then $pi_1$ is better than $pi_2$. However, I'm not sure what to conclude if the inequality only holds for a part of the states.
We simulate $pi_1$ and $pi_2$ with the same sample path (sequence of events) that is long enough, say, with length $T$ and make sure they are both in the steady state (stationary distribution) from time step $T/2$. Then from $T/2$, we calculate the following
$$
J(pi,epsilon)=frac2Tsum_t=T/2^T-T'sum_i=0^T'gamma^i c(s_t+i)
$$
with $gamma^T'<epsilon$ where $epsilon$ is chosen to be small (e.g. $10^-20$). If $J(pi_1,epsilon)>J(pi_2,epsilon)$, then $pi_1$ is better than $pi_2$. However, this method requires very fine selection of $T$ and $epsilon$ to guarantee the precision.
I would appreciate any help with this issue for the selection of an appropriate performance metric.
dynamic-programming
Assume we have some Markov decision process (MDP) with state space $s$, cost function $c:smapstomathbb R_+ $ and transition probabilities $[textPr(s'|s,a)]$, where $aina$ corresponds to some action. Consider a infinite horizon dynamic programming problem with discount factor $gamma$, for some state dependent deterministic policy $pi:smapstoa$, the value $V_pi(s)$ of state $s$ under policy $pi$ (according to various text books) is defined to be
$$
V_pi(s)=mathbb E_s_t,pi(s_t))left[left.sum_t=0^inftygamma^tc(s_t)right|s_0=sright].
$$
According to Proposition 5.4.1 in Dynamic Programming and Optimal Control (4th edition, volume 1, and by Dr. Bertsekas), policy $pi$ is optimal if and only if for every $sins$, $V_pi(s)$ attains the minimum.
My question is, for two policies $pi_1$ and $pi_2$ not necessarily being optimal, how do we compare the performance? I've considered the following two metrics:
For each $sins$, we calculate $V_pi_1(s)$ and $V_pi_2(s)$ with the same set of sequences of events that is large enough to evaluate $V_pi_1(s)$ and $V_pi_2(s)$. If for each $s$, $V_pi_1(s)geq V_pi_2(s)$, then $pi_1$ is better than $pi_2$. However, I'm not sure what to conclude if the inequality only holds for a part of the states.
We simulate $pi_1$ and $pi_2$ with the same sample path (sequence of events) that is long enough, say, with length $T$ and make sure they are both in the steady state (stationary distribution) from time step $T/2$. Then from $T/2$, we calculate the following
$$
J(pi,epsilon)=frac2Tsum_t=T/2^T-T'sum_i=0^T'gamma^i c(s_t+i)
$$
with $gamma^T'<epsilon$ where $epsilon$ is chosen to be small (e.g. $10^-20$). If $J(pi_1,epsilon)>J(pi_2,epsilon)$, then $pi_1$ is better than $pi_2$. However, this method requires very fine selection of $T$ and $epsilon$ to guarantee the precision.
I would appreciate any help with this issue for the selection of an appropriate performance metric.
dynamic-programming
edited Aug 2 at 20:25
asked Aug 2 at 4:58
Queue
11
11
Performance metric is generally application dependent. You may consider MSE under normal circumstances.
â Creator
Aug 2 at 5:14
add a comment |Â
Performance metric is generally application dependent. You may consider MSE under normal circumstances.
â Creator
Aug 2 at 5:14
Performance metric is generally application dependent. You may consider MSE under normal circumstances.
â Creator
Aug 2 at 5:14
Performance metric is generally application dependent. You may consider MSE under normal circumstances.
â Creator
Aug 2 at 5:14
add a comment |Â
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f2869727%2fdynamic-programming-performance-comparison%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Performance metric is generally application dependent. You may consider MSE under normal circumstances.
â Creator
Aug 2 at 5:14