Does a Markov Reward Process with â€œattracting stateâ€ have well-defined state-values?

up vote
0
down vote

favorite

I have recently learned some things about Reinforcement Learning and did now think for myself about Markov Reward Processes (MRPs), without engaging a lot with the literature. I have almost no background in probability theory, which may explain the naivety of my question.

Usually in an MRP one uses discounting to make the mathematics "finite". I tried to generalize this in the following way: Instead of discounting, I do not discount, but I regard one state as "attracting" in the sense that from every state there is a path to the attracting state. The attracting state itself is also absorbing, so no arrow leaves it, except an arrow pointing to itself. And furthermore, the reward for the state transition from the attracting state to itself is always $0$.

Question 1: Is there always a well-defined, finite expected value for the sum of rewards starting in a specific state?

Question 2: Does the term "attracting state" also exist in the literature, and how is it called there?

What follows now just makes the description thus far more explicit:
I consider an MRP $(S, R, P)$, where $S =s_1, dots s_n+1$ is the set of states, $R = r_1, dots, r_n+1$ is the set of rewards, where $r_i$ is the reward received for leaving state $s_i$, and $P in mathbbR^(n+1) times (n+1)$ is the state transition probability matrix, with entries $p_ij = textPr(X_n+1 = s_j | X_n = s_i )$.

Furthermore, the state $s_n+1$ is attracting, which means the following:

(i) For all $i$ there is a sequence $i = k_1, dots, k_s = n + 1$ such that $prod_j = 1^s-1p_k_j,k_j+1 > 0$.

(ii) $p_n+1,n+1 = 1$ and $p_n+1,j = 0$ for all $j neq n+1$.

(iii) $r_n+1 = 0$.

It is easy to see that this setting is strictly more general than the setting of MRPs with discounting: Assuming we have an MRP with discounting factor $gamma$, we can build an MRP with attracting state from it by adding one additional state such that every other state reaches the new state with probability $1-gamma$ in every step (and by rescaling the other probabilities accordingly).

The value of state $i$ is defined as

$$V(i) = textEleft[sum_j = 0^infty R_j | X_0 = s_iright],$$

where $R_j$ is the reward received in time-step $j$ for leaving state $X_j$.

I believe that this expected value is always a finite, well-defined number, but I am not quite able to prove it. What I was able to show is the following: Assuming the expected value exists (and is a real number) then we can find it as usual by applying Dynamic programming with help of the Bellmann Equation. The Bellmann operator is not a contraction in this case, but if we compose it often with itself then we can build a contraction from it and the usual prove, using the Banach fix-point theorem, leads to a solution. But this whole proof only gives a procedure for computing the expected values assuming they exist, and this is the place where I do not quite know how to do it. Do you have any ideas?

It may be that the answer will already be clear as soon as I understand the proof fully in the discounting case. Unfortunately, the only text about that that I am aware of, Algorithms for Reinforcement Learning, does also not seem to resolve my question. (Although in that case, the expectation is defined over values which are all finite. In my case, there can be a probability of $0$ to have infinite future reward or even undefined future reward, for example if the sequence of rewards is something like $-1, 1, -1, 1, -1, 1, ...$).

Best regards

Leon

edited Aug 2 at 22:51

joriki

164k10179328

asked Aug 2 at 22:45

Leon Lang

568213

Answer to Question 2: these are normally (in what IÃ¢Â€Â™ve read) called Ã¢Â€Âœabsorbing statesÃ¢Â€Â (see this Wikipedia article, for example). Also, the book by Puterman on Markov Decision Processes might be a helpful reference (it was written before reinforcement learning was really Ã¢Â€ÂœhotÃ¢Â€Â, but has nice coverage of theory. ItÃ¢Â€Â™s about MDPs, not MRPs, but helpful if youÃ¢Â€Â™re interested)
â€“Â David M.
Aug 2 at 23:21

Also, if you think about the case where $R_i=1$ for all transient states (and $R_i=0$ for absorbing states, as you suggest), then the total expected reward is just the expected number of times you visit each transient state. This value is well-defined, finite and easy to compute (see the Wikipedia article in the previous comment). From this, you might be able to think of a way to build the general proof.
â€“Â David M.
Aug 3 at 0:07

add a commentÂ |Â

up vote
0
down vote

favorite

Question 1: Is there always a well-defined, finite expected value for the sum of rewards starting in a specific state?

Question 2: Does the term "attracting state" also exist in the literature, and how is it called there?

Furthermore, the state $s_n+1$ is attracting, which means the following:

(i) For all $i$ there is a sequence $i = k_1, dots, k_s = n + 1$ such that $prod_j = 1^s-1p_k_j,k_j+1 > 0$.

(ii) $p_n+1,n+1 = 1$ and $p_n+1,j = 0$ for all $j neq n+1$.

(iii) $r_n+1 = 0$.

The value of state $i$ is defined as

$$V(i) = textEleft[sum_j = 0^infty R_j | X_0 = s_iright],$$

where $R_j$ is the reward received in time-step $j$ for leaving state $X_j$.

Best regards

Leon

edited Aug 2 at 22:51

joriki

164k10179328

asked Aug 2 at 22:45

Leon Lang

568213

Answer to Question 2: these are normally (in what IÃ¢Â€Â™ve read) called Ã¢Â€Âœabsorbing statesÃ¢Â€Â (see this Wikipedia article, for example). Also, the book by Puterman on Markov Decision Processes might be a helpful reference (it was written before reinforcement learning was really Ã¢Â€ÂœhotÃ¢Â€Â, but has nice coverage of theory. ItÃ¢Â€Â™s about MDPs, not MRPs, but helpful if youÃ¢Â€Â™re interested)
â€“Â David M.
Aug 2 at 23:21

Also, if you think about the case where $R_i=1$ for all transient states (and $R_i=0$ for absorbing states, as you suggest), then the total expected reward is just the expected number of times you visit each transient state. This value is well-defined, finite and easy to compute (see the Wikipedia article in the previous comment). From this, you might be able to think of a way to build the general proof.
â€“Â David M.
Aug 3 at 0:07

add a commentÂ |Â

up vote
0
down vote

favorite

Question 1: Is there always a well-defined, finite expected value for the sum of rewards starting in a specific state?

Question 2: Does the term "attracting state" also exist in the literature, and how is it called there?

Furthermore, the state $s_n+1$ is attracting, which means the following:

(i) For all $i$ there is a sequence $i = k_1, dots, k_s = n + 1$ such that $prod_j = 1^s-1p_k_j,k_j+1 > 0$.

(ii) $p_n+1,n+1 = 1$ and $p_n+1,j = 0$ for all $j neq n+1$.

(iii) $r_n+1 = 0$.

The value of state $i$ is defined as

$$V(i) = textEleft[sum_j = 0^infty R_j | X_0 = s_iright],$$

where $R_j$ is the reward received in time-step $j$ for leaving state $X_j$.

Best regards

Leon

edited Aug 2 at 22:51

joriki

164k10179328

asked Aug 2 at 22:45

Leon Lang

568213

Question 1: Is there always a well-defined, finite expected value for the sum of rewards starting in a specific state?

Question 2: Does the term "attracting state" also exist in the literature, and how is it called there?

Furthermore, the state $s_n+1$ is attracting, which means the following:

(i) For all $i$ there is a sequence $i = k_1, dots, k_s = n + 1$ such that $prod_j = 1^s-1p_k_j,k_j+1 > 0$.

(ii) $p_n+1,n+1 = 1$ and $p_n+1,j = 0$ for all $j neq n+1$.

(iii) $r_n+1 = 0$.

The value of state $i$ is defined as

$$V(i) = textEleft[sum_j = 0^infty R_j | X_0 = s_iright],$$

where $R_j$ is the reward received in time-step $j$ for leaving state $X_j$.

Best regards

Leon

edited Aug 2 at 22:51

joriki

164k10179328

asked Aug 2 at 22:45

Leon Lang

568213

edited Aug 2 at 22:51

joriki

164k10179328

edited Aug 2 at 22:51

joriki

164k10179328

edited Aug 2 at 22:51

joriki

164k10179328

asked Aug 2 at 22:45

Leon Lang

568213

asked Aug 2 at 22:45

Leon Lang

568213

asked Aug 2 at 22:45

Leon Lang

568213

Answer to Question 2: these are normally (in what IÃ¢Â€Â™ve read) called Ã¢Â€Âœabsorbing statesÃ¢Â€Â (see this Wikipedia article, for example). Also, the book by Puterman on Markov Decision Processes might be a helpful reference (it was written before reinforcement learning was really Ã¢Â€ÂœhotÃ¢Â€Â, but has nice coverage of theory. ItÃ¢Â€Â™s about MDPs, not MRPs, but helpful if youÃ¢Â€Â™re interested)
â€“Â David M.
Aug 2 at 23:21

Also, if you think about the case where $R_i=1$ for all transient states (and $R_i=0$ for absorbing states, as you suggest), then the total expected reward is just the expected number of times you visit each transient state. This value is well-defined, finite and easy to compute (see the Wikipedia article in the previous comment). From this, you might be able to think of a way to build the general proof.
â€“Â David M.
Aug 3 at 0:07

add a commentÂ |Â

Answer to Question 2: these are normally (in what IÃ¢Â€Â™ve read) called Ã¢Â€Âœabsorbing statesÃ¢Â€Â (see this Wikipedia article, for example). Also, the book by Puterman on Markov Decision Processes might be a helpful reference (it was written before reinforcement learning was really Ã¢Â€ÂœhotÃ¢Â€Â, but has nice coverage of theory. ItÃ¢Â€Â™s about MDPs, not MRPs, but helpful if youÃ¢Â€Â™re interested)
â€“Â David M.
Aug 2 at 23:21

Also, if you think about the case where $R_i=1$ for all transient states (and $R_i=0$ for absorbing states, as you suggest), then the total expected reward is just the expected number of times you visit each transient state. This value is well-defined, finite and easy to compute (see the Wikipedia article in the previous comment). From this, you might be able to think of a way to build the general proof.
â€“Â David M.
Aug 3 at 0:07

Answer to Question 2: these are normally (in what IÃ¢Â€Â™ve read) called Ã¢Â€Âœabsorbing statesÃ¢Â€Â (see this Wikipedia article, for example). Also, the book by Puterman on Markov Decision Processes might be a helpful reference (it was written before reinforcement learning was really Ã¢Â€ÂœhotÃ¢Â€Â, but has nice coverage of theory. ItÃ¢Â€Â™s about MDPs, not MRPs, but helpful if youÃ¢Â€Â™re interested)
â€“Â David M.
Aug 2 at 23:21

Also, if you think about the case where $R_i=1$ for all transient states (and $R_i=0$ for absorbing states, as you suggest), then the total expected reward is just the expected number of times you visit each transient state. This value is well-defined, finite and easy to compute (see the Wikipedia article in the previous comment). From this, you might be able to think of a way to build the general proof.
â€“Â David M.
Aug 3 at 0:07

add a commentÂ |Â

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
);
);
, "mathjax-editing");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "69"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
noCode: true, onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f2870560%2fdoes-a-markov-reward-process-with-attracting-state-have-well-defined-state-val%23new-answer', 'question_page');

);

Post as a guest

Name

active

oldest

votes

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

Search This Blog

ukmuiik