Contexts of utterance without utterance tokens
I think one should not define "context of utterance" so that a context of utterance for an expression must always contain an utterance of the expression (or "truth in a context of utterance" so that a sentence can only be true in a context where it is uttered).
This obviously depends on how or where the term is meant to be used. The use I have mostly in mind is in the semantics/pragmatics of context-dependence, or indexicality.
Competent speakers of English know how to determine the semantic value(s) of a sentence uttered in a given context. Take truth value: we know that
I) an utterance of "I am hungry" is true in a context of utterance C (roughly) iff the speaker in C is hungry at the time of C.
Now the question is: does this rule cover only situations C where some speaker actually utters "I am hungry" or also situations where the speaker utters something else or even where there is no speaker at all?
Here is a misguided reason for thinking it applies only to situatuins where someone utters "I am hungry": In a sense, different utterances of "I am hungry" mean different things -- that you are hungry if uttered by you, but that I am hungry if uttered by me. But then rule (I) isn't quite correct. For if we don't hold fixed the meaning of the uttered words, then "I am hungry" is true in a context C where "hungry" means sleepy and the speaker is sleepy.
On this conception, a sentence is true in a context of utterance C iff whatever the sentence, individuated phonetically or orthographically, means at C is true. Then of course it makes little sense to assume that a sentence may be true in a context of utterance where it isn't uttered.
But this conception is useless for the semantics/pragmatics of English indexicals: the corrected rule for "I am hungry" would go like this:
I*) An utterance of "I am hungry" is true in a context of utterance C iff either 1) it means that the speaker in C is hungry and the speaker in C is hungry, or 2) it means that the speaker in C is sleepy and the speaker in C is sleepy, or 3) it means that 2+2=4 and 2+2=4, etc.
Yet this doesn't capture what competent speakers of English know about "I am hungry". For one, the rule for all other sentence ("I am sleepy", "2+2=4", etc.) would be exactly the same. Moreover, (I*) is a priori and thus known by people who know nothing at all about English.
So much for this misguided conception of contexts of utterance. Now some positive reasons for accepting contexts without utterance tokens.
Since English has infinitely many, and arbitrarily long, indexical sentences, rules like (I) had better be determined compositionally. Consider the clause for negation. It might go like this:
NEG) "not p" is true in a context C iff p is not true in C.
But this doesn't work if a sentence cannot be true in a context where it isn't uttered. For then "it's raining" turns out to be true in all and only those contexts where 1) it's raining and 2) someone says "it's raining". But clearly "it's not raining" is not true in all and only the contexts where either (1) or (2) fails. But that's what (NEG) would demand.
This cannot be fixed by some clever change in (NEG). The problem is that from the set of contexts satisfying both (1) and (2) there is no way back to the set of all contexts satisfying (1). If there was such a route, we could say that (NEG) maps "it's not raining" to the intersection of that set with the set of contexts where someone says "it's not raining". But since there is no such route, the set of contexts where "not p" is true is not a function of the set of contexts where p is true, and compositionality fails.
Here is a related, but more general argument. Suppose negation is always inserted somewhere in the middle of the negated sentence, as in "it's not raining". Then if the negation of p is uttered in a context C, p itself is never uttered in C. Hence if p only has a semantic value in contexts where it is uttered, the semantic value of ~p in C -- no matter if it's a truth value or some kind of proposition -- cannot be a function of the semantic value of p in C, because p will have no semantic value at all in C.
Thirdly, if our linguistic rules somehow determine, for each sentence S, a function X(S) from contexts of utterance to semantic values (propositions, truth values, or whatever), it is very natural to say, as Kaplan does, that two sentences S1 and S2 are synonymous if the corresponding functions X(S1) and X(S2) are identical. But if X(S1) is only defined for contexts where S1 is uttered and X(S2) for contexts where S2 is uttered, the functions X(S1) and X(S2) will be identical if and only if S1 = S2. It follows that no two sentences are ever synonymous. But even philosophers who believe that indeed no two sentences are ever synonymous would probably consider that to be an important Quinean insight, not a trivial consequence of the definition of synonymy.
Fourthly, for many expressions, their evaluation in different contexts of utterance corresponds closely to their evaluation at different counterfactual situations. For instance, when we describe other times and possible worlds, we apply "round", "bumpy", "beautiful", "transparent" and "mountain" to exactly the same things we would apply the words to if it turned out that the relevant time is present and the relevant world is actual. This is not a coincidence. Properties like roundness and bumpiness don't have a hidden essence we could rigidly pick out. To be round just means to have a certain shape, for actual and counterfactual things alike. Thus in a systematic theory, we might want to say things like: if a sentence S contains only words that are in this sense semantically stable or non-twin-earthable, then "at world w, time t and place p, S" is true iff S is true in the utterance context <w,t,p> (the context located at world w, time t and place p). But this won't work if S is not true in a context where it isn't uttered.
I'm not sure if you have Kaplan right. His contexts are divorced entirely from utterances. Utterances play no part in his logic or semantics. The reason is that he is suspicious of the possibility of validity cast in terms of utterances. His semantics are for sentences evauluated in contexts. As far as he is concerned he is only dealing with types (sentences). This is one of the reasons that he restricts contexts to proper contexts (i.e. ones where the agent is at the location in the world at the time). Kaplan would definitely agree that contexts of use should not be defined to require an utterance tokening.
There is some discussion of separating utterances and contexts in John Perry's afterword to his "Problem of the Essential Indexical" in the expanded edition of his essays with that name. He talks about Stalnaker's notion of diagonal proposition (evaluating some utterance u across different worlds) in relation to his own work. I think the Stalnaker citation is in that volume.
Kaplan's student Stefano Predelli has some things to say about the proper relation of utterances, sentences, and contexts in his book Contexts. If I remember right, he thinks that you can get an adequate semantics whether you start with utterances or with sentences. His book is a fruitful exploration and defense of the assumptions built into Kaplan's work.