Does a Markov Reward Process with “attracting state” have well-defined state-values?

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
0
down vote

favorite












I have recently learned some things about Reinforcement Learning and did now think for myself about Markov Reward Processes (MRPs), without engaging a lot with the literature. I have almost no background in probability theory, which may explain the naivety of my question.



Usually in an MRP one uses discounting to make the mathematics "finite". I tried to generalize this in the following way: Instead of discounting, I do not discount, but I regard one state as "attracting" in the sense that from every state there is a path to the attracting state. The attracting state itself is also absorbing, so no arrow leaves it, except an arrow pointing to itself. And furthermore, the reward for the state transition from the attracting state to itself is always $0$.



Question 1: Is there always a well-defined, finite expected value for the sum of rewards starting in a specific state?



Question 2: Does the term "attracting state" also exist in the literature, and how is it called there?



What follows now just makes the description thus far more explicit:
I consider an MRP $(S, R, P)$, where $S =s_1, dots s_n+1$ is the set of states, $R = r_1, dots, r_n+1$ is the set of rewards, where $r_i$ is the reward received for leaving state $s_i$, and $P in mathbbR^(n+1) times (n+1)$ is the state transition probability matrix, with entries $p_ij = textPr(X_n+1 = s_j | X_n = s_i )$.



Furthermore, the state $s_n+1$ is attracting, which means the following:



(i) For all $i$ there is a sequence $i = k_1, dots, k_s = n + 1$ such that $prod_j = 1^s-1p_k_j,k_j+1 > 0$.



(ii) $p_n+1,n+1 = 1$ and $p_n+1,j = 0$ for all $j neq n+1$.



(iii) $r_n+1 = 0$.



It is easy to see that this setting is strictly more general than the setting of MRPs with discounting: Assuming we have an MRP with discounting factor $gamma$, we can build an MRP with attracting state from it by adding one additional state such that every other state reaches the new state with probability $1-gamma$ in every step (and by rescaling the other probabilities accordingly).



The value of state $i$ is defined as



$$V(i) = textEleft[sum_j = 0^infty R_j | X_0 = s_iright],$$



where $R_j$ is the reward received in time-step $j$ for leaving state $X_j$.



I believe that this expected value is always a finite, well-defined number, but I am not quite able to prove it. What I was able to show is the following: Assuming the expected value exists (and is a real number) then we can find it as usual by applying Dynamic programming with help of the Bellmann Equation. The Bellmann operator is not a contraction in this case, but if we compose it often with itself then we can build a contraction from it and the usual prove, using the Banach fix-point theorem, leads to a solution. But this whole proof only gives a procedure for computing the expected values assuming they exist, and this is the place where I do not quite know how to do it. Do you have any ideas?



It may be that the answer will already be clear as soon as I understand the proof fully in the discounting case. Unfortunately, the only text about that that I am aware of, Algorithms for Reinforcement Learning, does also not seem to resolve my question. (Although in that case, the expectation is defined over values which are all finite. In my case, there can be a probability of $0$ to have infinite future reward or even undefined future reward, for example if the sequence of rewards is something like $-1, 1, -1, 1, -1, 1, ...$).



Best regards



Leon







share|cite|improve this question





















  • Answer to Question 2: these are normally (in what I’ve read) called “absorbing states” (see this Wikipedia article, for example). Also, the book by Puterman on Markov Decision Processes might be a helpful reference (it was written before reinforcement learning was really “hot”, but has nice coverage of theory. It’s about MDPs, not MRPs, but helpful if you’re interested)
    – David M.
    Aug 2 at 23:21











  • Also, if you think about the case where $R_i=1$ for all transient states (and $R_i=0$ for absorbing states, as you suggest), then the total expected reward is just the expected number of times you visit each transient state. This value is well-defined, finite and easy to compute (see the Wikipedia article in the previous comment). From this, you might be able to think of a way to build the general proof.
    – David M.
    Aug 3 at 0:07














up vote
0
down vote

favorite












I have recently learned some things about Reinforcement Learning and did now think for myself about Markov Reward Processes (MRPs), without engaging a lot with the literature. I have almost no background in probability theory, which may explain the naivety of my question.



Usually in an MRP one uses discounting to make the mathematics "finite". I tried to generalize this in the following way: Instead of discounting, I do not discount, but I regard one state as "attracting" in the sense that from every state there is a path to the attracting state. The attracting state itself is also absorbing, so no arrow leaves it, except an arrow pointing to itself. And furthermore, the reward for the state transition from the attracting state to itself is always $0$.



Question 1: Is there always a well-defined, finite expected value for the sum of rewards starting in a specific state?



Question 2: Does the term "attracting state" also exist in the literature, and how is it called there?



What follows now just makes the description thus far more explicit:
I consider an MRP $(S, R, P)$, where $S =s_1, dots s_n+1$ is the set of states, $R = r_1, dots, r_n+1$ is the set of rewards, where $r_i$ is the reward received for leaving state $s_i$, and $P in mathbbR^(n+1) times (n+1)$ is the state transition probability matrix, with entries $p_ij = textPr(X_n+1 = s_j | X_n = s_i )$.



Furthermore, the state $s_n+1$ is attracting, which means the following:



(i) For all $i$ there is a sequence $i = k_1, dots, k_s = n + 1$ such that $prod_j = 1^s-1p_k_j,k_j+1 > 0$.



(ii) $p_n+1,n+1 = 1$ and $p_n+1,j = 0$ for all $j neq n+1$.



(iii) $r_n+1 = 0$.



It is easy to see that this setting is strictly more general than the setting of MRPs with discounting: Assuming we have an MRP with discounting factor $gamma$, we can build an MRP with attracting state from it by adding one additional state such that every other state reaches the new state with probability $1-gamma$ in every step (and by rescaling the other probabilities accordingly).



The value of state $i$ is defined as



$$V(i) = textEleft[sum_j = 0^infty R_j | X_0 = s_iright],$$



where $R_j$ is the reward received in time-step $j$ for leaving state $X_j$.



I believe that this expected value is always a finite, well-defined number, but I am not quite able to prove it. What I was able to show is the following: Assuming the expected value exists (and is a real number) then we can find it as usual by applying Dynamic programming with help of the Bellmann Equation. The Bellmann operator is not a contraction in this case, but if we compose it often with itself then we can build a contraction from it and the usual prove, using the Banach fix-point theorem, leads to a solution. But this whole proof only gives a procedure for computing the expected values assuming they exist, and this is the place where I do not quite know how to do it. Do you have any ideas?



It may be that the answer will already be clear as soon as I understand the proof fully in the discounting case. Unfortunately, the only text about that that I am aware of, Algorithms for Reinforcement Learning, does also not seem to resolve my question. (Although in that case, the expectation is defined over values which are all finite. In my case, there can be a probability of $0$ to have infinite future reward or even undefined future reward, for example if the sequence of rewards is something like $-1, 1, -1, 1, -1, 1, ...$).



Best regards



Leon







share|cite|improve this question





















  • Answer to Question 2: these are normally (in what I’ve read) called “absorbing states” (see this Wikipedia article, for example). Also, the book by Puterman on Markov Decision Processes might be a helpful reference (it was written before reinforcement learning was really “hot”, but has nice coverage of theory. It’s about MDPs, not MRPs, but helpful if you’re interested)
    – David M.
    Aug 2 at 23:21











  • Also, if you think about the case where $R_i=1$ for all transient states (and $R_i=0$ for absorbing states, as you suggest), then the total expected reward is just the expected number of times you visit each transient state. This value is well-defined, finite and easy to compute (see the Wikipedia article in the previous comment). From this, you might be able to think of a way to build the general proof.
    – David M.
    Aug 3 at 0:07












up vote
0
down vote

favorite









up vote
0
down vote

favorite











I have recently learned some things about Reinforcement Learning and did now think for myself about Markov Reward Processes (MRPs), without engaging a lot with the literature. I have almost no background in probability theory, which may explain the naivety of my question.



Usually in an MRP one uses discounting to make the mathematics "finite". I tried to generalize this in the following way: Instead of discounting, I do not discount, but I regard one state as "attracting" in the sense that from every state there is a path to the attracting state. The attracting state itself is also absorbing, so no arrow leaves it, except an arrow pointing to itself. And furthermore, the reward for the state transition from the attracting state to itself is always $0$.



Question 1: Is there always a well-defined, finite expected value for the sum of rewards starting in a specific state?



Question 2: Does the term "attracting state" also exist in the literature, and how is it called there?



What follows now just makes the description thus far more explicit:
I consider an MRP $(S, R, P)$, where $S =s_1, dots s_n+1$ is the set of states, $R = r_1, dots, r_n+1$ is the set of rewards, where $r_i$ is the reward received for leaving state $s_i$, and $P in mathbbR^(n+1) times (n+1)$ is the state transition probability matrix, with entries $p_ij = textPr(X_n+1 = s_j | X_n = s_i )$.



Furthermore, the state $s_n+1$ is attracting, which means the following:



(i) For all $i$ there is a sequence $i = k_1, dots, k_s = n + 1$ such that $prod_j = 1^s-1p_k_j,k_j+1 > 0$.



(ii) $p_n+1,n+1 = 1$ and $p_n+1,j = 0$ for all $j neq n+1$.



(iii) $r_n+1 = 0$.



It is easy to see that this setting is strictly more general than the setting of MRPs with discounting: Assuming we have an MRP with discounting factor $gamma$, we can build an MRP with attracting state from it by adding one additional state such that every other state reaches the new state with probability $1-gamma$ in every step (and by rescaling the other probabilities accordingly).



The value of state $i$ is defined as



$$V(i) = textEleft[sum_j = 0^infty R_j | X_0 = s_iright],$$



where $R_j$ is the reward received in time-step $j$ for leaving state $X_j$.



I believe that this expected value is always a finite, well-defined number, but I am not quite able to prove it. What I was able to show is the following: Assuming the expected value exists (and is a real number) then we can find it as usual by applying Dynamic programming with help of the Bellmann Equation. The Bellmann operator is not a contraction in this case, but if we compose it often with itself then we can build a contraction from it and the usual prove, using the Banach fix-point theorem, leads to a solution. But this whole proof only gives a procedure for computing the expected values assuming they exist, and this is the place where I do not quite know how to do it. Do you have any ideas?



It may be that the answer will already be clear as soon as I understand the proof fully in the discounting case. Unfortunately, the only text about that that I am aware of, Algorithms for Reinforcement Learning, does also not seem to resolve my question. (Although in that case, the expectation is defined over values which are all finite. In my case, there can be a probability of $0$ to have infinite future reward or even undefined future reward, for example if the sequence of rewards is something like $-1, 1, -1, 1, -1, 1, ...$).



Best regards



Leon







share|cite|improve this question













I have recently learned some things about Reinforcement Learning and did now think for myself about Markov Reward Processes (MRPs), without engaging a lot with the literature. I have almost no background in probability theory, which may explain the naivety of my question.



Usually in an MRP one uses discounting to make the mathematics "finite". I tried to generalize this in the following way: Instead of discounting, I do not discount, but I regard one state as "attracting" in the sense that from every state there is a path to the attracting state. The attracting state itself is also absorbing, so no arrow leaves it, except an arrow pointing to itself. And furthermore, the reward for the state transition from the attracting state to itself is always $0$.



Question 1: Is there always a well-defined, finite expected value for the sum of rewards starting in a specific state?



Question 2: Does the term "attracting state" also exist in the literature, and how is it called there?



What follows now just makes the description thus far more explicit:
I consider an MRP $(S, R, P)$, where $S =s_1, dots s_n+1$ is the set of states, $R = r_1, dots, r_n+1$ is the set of rewards, where $r_i$ is the reward received for leaving state $s_i$, and $P in mathbbR^(n+1) times (n+1)$ is the state transition probability matrix, with entries $p_ij = textPr(X_n+1 = s_j | X_n = s_i )$.



Furthermore, the state $s_n+1$ is attracting, which means the following:



(i) For all $i$ there is a sequence $i = k_1, dots, k_s = n + 1$ such that $prod_j = 1^s-1p_k_j,k_j+1 > 0$.



(ii) $p_n+1,n+1 = 1$ and $p_n+1,j = 0$ for all $j neq n+1$.



(iii) $r_n+1 = 0$.



It is easy to see that this setting is strictly more general than the setting of MRPs with discounting: Assuming we have an MRP with discounting factor $gamma$, we can build an MRP with attracting state from it by adding one additional state such that every other state reaches the new state with probability $1-gamma$ in every step (and by rescaling the other probabilities accordingly).



The value of state $i$ is defined as



$$V(i) = textEleft[sum_j = 0^infty R_j | X_0 = s_iright],$$



where $R_j$ is the reward received in time-step $j$ for leaving state $X_j$.



I believe that this expected value is always a finite, well-defined number, but I am not quite able to prove it. What I was able to show is the following: Assuming the expected value exists (and is a real number) then we can find it as usual by applying Dynamic programming with help of the Bellmann Equation. The Bellmann operator is not a contraction in this case, but if we compose it often with itself then we can build a contraction from it and the usual prove, using the Banach fix-point theorem, leads to a solution. But this whole proof only gives a procedure for computing the expected values assuming they exist, and this is the place where I do not quite know how to do it. Do you have any ideas?



It may be that the answer will already be clear as soon as I understand the proof fully in the discounting case. Unfortunately, the only text about that that I am aware of, Algorithms for Reinforcement Learning, does also not seem to resolve my question. (Although in that case, the expectation is defined over values which are all finite. In my case, there can be a probability of $0$ to have infinite future reward or even undefined future reward, for example if the sequence of rewards is something like $-1, 1, -1, 1, -1, 1, ...$).



Best regards



Leon









share|cite|improve this question












share|cite|improve this question




share|cite|improve this question








edited Aug 2 at 22:51









joriki

164k10179328




164k10179328









asked Aug 2 at 22:45









Leon Lang

568213




568213











  • Answer to Question 2: these are normally (in what I’ve read) called “absorbing states” (see this Wikipedia article, for example). Also, the book by Puterman on Markov Decision Processes might be a helpful reference (it was written before reinforcement learning was really “hot”, but has nice coverage of theory. It’s about MDPs, not MRPs, but helpful if you’re interested)
    – David M.
    Aug 2 at 23:21











  • Also, if you think about the case where $R_i=1$ for all transient states (and $R_i=0$ for absorbing states, as you suggest), then the total expected reward is just the expected number of times you visit each transient state. This value is well-defined, finite and easy to compute (see the Wikipedia article in the previous comment). From this, you might be able to think of a way to build the general proof.
    – David M.
    Aug 3 at 0:07
















  • Answer to Question 2: these are normally (in what I’ve read) called “absorbing states” (see this Wikipedia article, for example). Also, the book by Puterman on Markov Decision Processes might be a helpful reference (it was written before reinforcement learning was really “hot”, but has nice coverage of theory. It’s about MDPs, not MRPs, but helpful if you’re interested)
    – David M.
    Aug 2 at 23:21











  • Also, if you think about the case where $R_i=1$ for all transient states (and $R_i=0$ for absorbing states, as you suggest), then the total expected reward is just the expected number of times you visit each transient state. This value is well-defined, finite and easy to compute (see the Wikipedia article in the previous comment). From this, you might be able to think of a way to build the general proof.
    – David M.
    Aug 3 at 0:07















Answer to Question 2: these are normally (in what I’ve read) called “absorbing states” (see this Wikipedia article, for example). Also, the book by Puterman on Markov Decision Processes might be a helpful reference (it was written before reinforcement learning was really “hot”, but has nice coverage of theory. It’s about MDPs, not MRPs, but helpful if you’re interested)
– David M.
Aug 2 at 23:21





Answer to Question 2: these are normally (in what I’ve read) called “absorbing states” (see this Wikipedia article, for example). Also, the book by Puterman on Markov Decision Processes might be a helpful reference (it was written before reinforcement learning was really “hot”, but has nice coverage of theory. It’s about MDPs, not MRPs, but helpful if you’re interested)
– David M.
Aug 2 at 23:21













Also, if you think about the case where $R_i=1$ for all transient states (and $R_i=0$ for absorbing states, as you suggest), then the total expected reward is just the expected number of times you visit each transient state. This value is well-defined, finite and easy to compute (see the Wikipedia article in the previous comment). From this, you might be able to think of a way to build the general proof.
– David M.
Aug 3 at 0:07




Also, if you think about the case where $R_i=1$ for all transient states (and $R_i=0$ for absorbing states, as you suggest), then the total expected reward is just the expected number of times you visit each transient state. This value is well-defined, finite and easy to compute (see the Wikipedia article in the previous comment). From this, you might be able to think of a way to build the general proof.
– David M.
Aug 3 at 0:07















active

oldest

votes











Your Answer




StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
);
);
, "mathjax-editing");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "69"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
noCode: true, onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);








 

draft saved


draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f2870560%2fdoes-a-markov-reward-process-with-attracting-state-have-well-defined-state-val%23new-answer', 'question_page');

);

Post as a guest



































active

oldest

votes













active

oldest

votes









active

oldest

votes






active

oldest

votes










 

draft saved


draft discarded


























 


draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f2870560%2fdoes-a-markov-reward-process-with-attracting-state-have-well-defined-state-val%23new-answer', 'question_page');

);

Post as a guest













































































Comments

Popular posts from this blog

What is the equation of a 3D cone with generalised tilt?

Color the edges and diagonals of a regular polygon

Relationship between determinant of matrix and determinant of adjoint?