Intervenionist decision theory without interventions
Causal models are a useful tool for reasoning about causal relations. Meek and Glymour 1994 suggested that they also provide new resources to formulate causal decision theory. The suggestion has been endorsed by Pearl 2009, Hitchcock 2016, and others. I will discuss three problems with this proposal and suggest that fixing them leads us back to more or less the decision theory of Lewis 1981 and Skyrms 1982.
But first let me explain Meek and Glymour's proposal.
Causal models encode causal information by a probability measure over a directed acyclic graph. The nodes in the graph are random variables whose values stand for relevant (possible) events in the world; the probability measure stands for the objective chance (or frequency) of various values and combinations of values. In many cases one can assume the "Causal Markov Condition", which ensures that conditional on values for its causal parents, any variable is probabilistically dependent only on its effects.
For the application to decision theory, it is important that an adequate model need not explicitly represent all causally relevant factors. If a variable X can be influenced through multiple paths, one may only represent some of these and fold the others into an "error term". The error term must however be "d-separated" from the explicitly represented causal ancestors of X, which effectively means that it is probabilistically independent of those other causes.
In causal reasoning, we often need to distinguish two ways of updating on a change of a given variable. To illustrate, suppose we know that there's a lower incidence of disease X among people who take substance Y. One hypothesis that would explain this observation is that there's a common cause of reduced X incidence and taking Y. For instance, those who take Y might be generally more concerned about their health and therefore exercise more, which is the real cause of the reduced incidence in X. On this hypothesis, taking Y is evidence that an agent is less likely to have disease X, but if we made a controlled experiment in which we gave some people Y and others a placebo, the correlation would be expected to disappear. That's how we would test the present hypothesis. To predict what will happen in the experiment on the assumption of the hypothesis, we have to treat taking or not taking Y as an "intervention" that breaks the correlation with possible common causes. (The fact that somebody in the treatment group of the experiment takes Y is no evidence that they're more concerned with health than people in the control group.)
In general, an intervention on a variable makes it independent of its parent variables. What makes this possible are error terms. In the X and Y example, agents in the treatment group take Y because they are paid to do so as part of the experiment. This causal factor is an error term in the model. As required, the error term is probabilistically independent of the explicitly represented other cause for taking Y, namely general concern for one's health.
Now Meek and Glymour's suggestion is that everyone should use Jeffrey's formula for computing expected utilities via conditional probabilities. The disagreement between Evidential and Causal Decision Theory (EDT and CDT), they suggest, is not a normative disagreement about rational choice, but rather a disagreement over whether the relevant acts are considered as interventions.
For example, in Newcomb's problem, there is a correlation (due to a common cause) between one-boxing and the opaque box containing a million dollars. Let B=1 express that the agent chooses to one-box. Conditional on B=1, there is a high probabiliy that there's a million in the box. However, conditional on an intervention to one-box, the probability of the million is equal to its unconditional probability: the correlation disappears, just as it does in the X and Y example.
Now for the problems.
The fist (and most obvious) is that there is no guarantee that interventions of the relevant kind are available. We can't just assume that for every value x of any variable A that represents an act, there is an intervention event do(A=x) distinct from A=x.
The required assumption is obscured by misleading terminology. If an agent faces a genuine choice between A=1 and A=2, then one naturally thinks that she must be free to "intervene" on the value of A; that she can make do(A=1) true or false at will. But 'intervening' and 'do(A=1)' are technical terms, and in the required technical sense it is not at all obvious that genuine choices are always choices between interventions.
Return to Newcomb's problem. The obvious hypothesis about the causal relationships in Newcomb's problem is captured in the following graph.
Here, B is the variable for one-boxing or two-boxing, P is the prediction, O is the outcome, and C is the common cause of prediction and choice: the agent's disposition to one-box or two-box. Let's assume that the predictor is fallible. How does the fallibility come about? There are two possibilities (which could be combined). Either the predictor has imperfect access to the common cause C, or C does not determine B. Suppose the fallibility is of the first kind. That is, we assume that there are causal factors C which fully determine the agent's choice, but the predictor does not have full access to these factors. That's easy to imagine. The causal factors C cause the predictor's evidence E which in turn causes her prediction, but E is an imperfect sign of C: it is possible that E=1 even though C=2, or that E=2 even though C=1. We could model this by introducing an error term on E, or directly on P (if we don't mention E explicitly).
In this version of Newcomb's Problem, there is no error term on B. So there is no possibility of "intervening" on B in the technical sense of causal models. This does not mean that the agent has no real choice. To be sure, the agent doesn't have strong libertarian freedom, since her choice is fully determined by the causal factors C. But who cares? It's highly contentious whether the idea of strong libertarian freedom is even coherent. It's even more contentious that ordinary humans are free in this sense. And almost nobody believes that robots have that kind of freedom. But robots still face decisions. Many are interested in decision theory precisely because they want to program intelligent artificial agents. An adequate decision theory should not presuppose that the relevant agent has libertarian free will.
That's the first problem. Here is the second. Suppose there are error terms on the right-hand side in Newcomb's problem. More specifically, let C be the agent's general disposition to follow CDT or EDT, and suppose acts of one-boxing can be caused not just by C but also by random electromagnetic fluctuations in the agent's muscles. These fluctuations are proper error terms because they decorrelate B from C. That's just what the interventionist seems to want. But if that's the causal story, it would be wrong to assess the choiceworthiness of one-boxing and two-boxing by conditionalizing on do(B=1) and do(B=2) respectively. For that means to effectively conditionalize on the relevant electromagnetic fluctuation events, which are in no sense under the agent's control. They are not even sensitive to the agent's beliefs and desires (we may assume).
Here the technical nature of the expressions 'intervention' and 'do' become obvious. In the technical sense, the random electromagnetic fluctuations are interventions, and they realize do(B=1). But they are not interventions or doings on part of the agent in any ordinary sense.
The third problem is pointed out in Stern 2017. I'll try to make it a little more explicit than Stern does.
Consider the following causal structure.
Here A represents the agent's possible actions, which may be smoking and not smoking. These are evidentially correlated with some desirable or undesirable outcome O (cancer or not cancer) via a common cause C (as in Fisher's hypothesis about the relationship between smoking and cancer). I is an intervention variable, which, we assume, decorrelates A from C and therefore O. Think of I as something like the agent's libertarian free will.
The depicted structure is not yet a causal model because it doesn't specify the chances. Suppose the agent's credence is evenly divided between two hypotheses about the relevant chances, H1 and H2. According to H1, I=1 and O=1 both have probability 0.9; according to H2 they both have probability 0.1. (It doesn't matter what else H1 and H2 say.)
By the Principal Principle, \begin{align} Cr(O=1) &= Cr(O=1 / H1)Cr(H1) + Cr(O=1 / H2)Cr(H2)\\ &= .9 * .5 + .1 * .5 = .5\\ Cr(I=1) &= Cr(I=1 / H1)Cr(H1) + Cr(I=1 / H2)Cr(H2)\\ &= .9 * .5 + .1 * .5 = .5 \end{align}
Since both H1 and H2 treat O and I as independent, it follows again from the Principal Principle that
\[ Cr(O=1 / I=1 \land H1) = Cr(O=1 / H1) = .9\\ Cr(O=1 / I=1 \land H2) = Cr(O=1 / H2) = .1 \]By Bayes' Theorem,
\[ Cr(H1 / I=1) = Cr(I=1 / H1) Cr(H1) / Cr(I=1) = .9 * .5 / .5 = .9\\ Cr(H2 / I=1) = Cr(I=1 / H2) Cr(H2) / Cr(I=1) = .1 * .5 / .5 = .1 \]Finally, by the probability calculus,
\begin{align} Cr(O=1 / I=1) =&\; Cr(O=1 / I=1 \land H1)Cr(H1 / I=1)\\ &\;+ Cr(O=1 / I=1 \land H2)Cr(H2 / I=1). \end{align}Putting all this together, we have
\[ Cr(O=1) = .5\\ Cr(O=1 / I=1) = .9 * .9 + .1 * .1 = .82 \]So although the agent assigns credence 1 to causal hypothesis on which I and O are probabilistically independent, the two variables are not independent in her beliefs.
This means that conditional on do(A=1), which is tantamount to I=1, the agent assigns much greater probability to O=1 than conditional on do(A=2). According to Meek & Glymour et al, the agent should therefore choose A=1 (via I=1). But this means to act on a spurious correlation.
(The argument does not require an explicit intervention variable. An evidential correlation between A and H1 would do just as well as the assumed correlation between I and H1.)
Stern's observation puts the nail in the coffin of Meek and Glymour's conjecture that CDT and EDT agree on the validity of Jeffrey's formula for calculating expected utilities, but disagree over whether the relevant acts are understood as interventions or ordinary events. In the present example, conditionalizing on interventions in Jeffrey's formula doesn't yield a recognizably causal decision theory.
As a corrolary, we can see that there's an important difference between conditionalizing on do(A=1) and subjunctively supposing A=1, what Joyce 1999 would write as P( * \ A=1), with a backslash. Joyce 2010 suggests that if P( * \ A=1) is understood in terms of imaging or expected chance then there's a close connection between P( * / do(A=1)) and P( * \ A=1), so that the the operation of conditionalizing on do(A=1) may actually be understood as subjunctive supposition rather than conditionalizing on an intervention event. But the discussion presupposes that we are certain of the objective probabilities. If we are not, conditionalizing on do(A=1) is not at all the same as subjunctively supposing A=1.
To get around the third problem, Stern proposes to use Lewis's K-partition formula for calculating expected utilities, on which Jeffrey's formula is applied locally within each "dependency hypothesis" K and expected utility is the weighted average of the results, weighted by the agent's credence in the relevant dependency hypotheses. In Stern's "interventionist decision theory", the depedency hypotheses are identified with causal models. So expected utility is computed as follows (again, I'm slightly more explicit here than Stern himself):
\[ EU(A) = \sum_K Cr(K) \sum_O Cr(O / do(A) \land K) V(O) \](Since causal models are effectively hypotheses about chance, this account is perhaps even closer to Skyrms's version of CDT than to Lewis's.)
This gets around the problem because any evidence A may provide for or against a particular causal model becomes irrelevant.
Notice that Stern's proposal is "doubly causal", as it were. First, it replaces Jeffrey's formula by the Lewis-Skyrms formula, in order to factor out spurious correlations between acts and causal hypotheses. Second, it replaces ordinary acts A by interventions, do(A). Do we really need both?
Arguably not. Return to Newcomb's problem. Here the Lewis-Skyrms approach already recommends two-boxing because it distinguishes two relevant dependency hypotheses. According to the first, the opaque box is empty and so there's a high chance of getting $0 through one-boxing; according to the second, the opaque box contains $1M and so there's a high chance of getting $1M through one-boxing.
Can the interventionist also treat these as two different causal models? Yes. Easily. The two models would have the same causal graph, but different objective probabilities. In one model, it is certain that the predictor predicts one-boxing, in the other it is certain that the predictor predicts two-boxing. This may not fit the frequentist interpretation of probabilities in causal models, but this interpretation spells trouble for interventionist accounts of decision anyway, since (a) the Principal Principle for frequencies is much more problematic than for more substantive chances, (b) population level statistics make it even harder to find suitable error terms for intervening (as Papineau 2000 points out). If instead we think of the probabilities more along the lines of objective chance (though it could be statistical mechanical chance), it is quite natural to think that at the time of the decision, the contents of the box are no longer a matter of non-trivial chance.
So there are good reasons for the interventionist to follow Lewis and Skyrms and model Newcomb's problem as involving two relevant causal hypotheses K. And then we get two-boxing as the recommendation even if, conditional on each hypothesis, we conditionalize on B rather than do(B).
This is nice because it also solves the first two problems for the interventionist: the availability and eligibility of interventions. On the revised version of Stern's account, we don't need interventions any more.
Of course, the revised version of Stern's account is basically the decision theory of Lewis and Skyrms. The only difference is that dependency hypotheses are spelled out as causal models.
Upshot: The theory of causal models can indeed be useful for thinking about rational choice, because causal models are natural candidates to play the role of dependency hypotheses in K-partition accounts of expected utility. The supposedly central concept of an intervention, however, is not only problematic in this context, but also redundant. We can do better without it.