A Note on Optimal Experimentation Under Risk Aversion

This paper solves the two-armed bandit problem when decision makersare risk averse. It shows, counterintuitively, that a more risk-averse decisionmaker might be more willing to take risky actions. The reason relates tothe fact that pulling the risky arm in bandit models produces informationon the environment – thereby reducing the risk that a decision maker willface in the future. This finding gives reason for caution when inferring riskpreferences from observed actions: in a bandit setup, observing a greaterappetite for risky actions can actually be indicative of more risk aversion,not less. Studies which do not take this into account may produce biasedestimates.

This paper analyzes how a rational, risk-averse decision maker solves the two-armed bandit problem of having to choose between a safe alternative that yields a known reward, and a risky one that generates an unknown payoff.
At first sight, it seems intuitive that decision makers who are more risk averse will be less willing to take the risky action. Indeed, an earlier paper (Chancelier, De Lara, and De Palma, 2009) arrives at such a conclusion. Below, we however show that there exists a previously overlooked part of the parameter space where this result is overturned.
Our model is based upon the exponential bandit model of Keller, Rady, and Cripps (2005). Following Roberts and Weitzman (1981) and Bolton and Harris (1999), who in turn built upon the seminal work of Rothschild (1974a,b), it uses a continuous-time framework. We extend the standard model by allowing for risk aversion on behalf of the decision maker. In doing so, we uncover the previously overlooked result that a more riskaverse decision maker might be more willing to pull the risky arm than a less risk-averse colleague.
The reason for this counterintuitive result relates to the notion that risk in bandit models can be reduced through experimentation with the risky arm. It is most likely to arise in settings where information arrives at a high frequency, which makes our finding of specific relevance to the machine learning literature (where reinforcement learning algorithms have the bandit problem at their core; see Sutton and Barto (1998)).
It is furthermore important to understand how risk aversion of decision makers affects the decisions they make. Willingness to take risks has been linked to the success of entrepreneurs (Cantillon, 1755;Knight, 1921;Kihlstrom and Laffont, 1979;Herranz, Krasa, and Villamil, 2015), while it has also been studied in a principal-agent setup -for example analyzing decision making by CEOs (Bandiera et al., 2011) and politicians (Lilienfeld et al., 2012). A common narrative that can be found in this literature is that more riskaverse individuals can be expected to take less risky actions. The point of this paper is to show that the introduction of learning and experimentation can overturn this wisdom: appointing a more risk-averse decision maker is no guarantee for the implementation of 1 less risky actions.
Finally, our findings imply that it is not obvious to infer risk preferences from observed actions: when there is scope for experimentation, observing a greater appetite for risky actions might actually be indicative of more risk aversion, not less.

A Keller-Rady-Cripps model with risk aversion
In this section, we employ the exponential bandit model of Keller, Rady, and Cripps (2005) but extend it by allowing for risk-averse decision makers (henceforth "DMs").
Time t ∈ [0, ∞) is continuous and there are N ≥ 1 players (who discount the future at rate r > 0), all facing a two-armed bandit problem. Each player n can allocate a fraction k n,t ∈ [0, 1] of his "informational" resource to the risky arm R, leaving (1 − k n,t ) for the safe arm S. The aggregate intensity of experimentation is denoted by K t ≡ N n=1 k n,t . The safe arm provides lump-sum payoffs s > 0 (with the value of s being known to all agents) according to a Poisson process with parameter 1 (which is known as well, hence why this arm is safe). So if player n uses the safe arm S for a fraction (1 − k n,t ) over an The source of risk in the risky arm R lies in the fact that its type (indicated by θ) is unknown to all agents at t = 0. They know that there are two possibilities: the arm is either "good" (θ = 1) or "bad" (θ = 0). At time t, each player n holds a common belief p t that the risky arm is of good quality. The DM's learning process on the risky arm's type is obstructed by the presence of noise in the associated payoff stream. When the arm is of good quality, it yields a lump-sum payoff equal to h with Poisson parameter λ > 0 (both h and λ are known to all players). When its quality is bad, it will never pay off. As more information arrives over time, the belief p t is revised according to Bayes' rule. When the DM plays the risky arm but no payment is observed, the DM's belief that the risky arm is "good" is revised downward by: (1) Upon arrival of a payoff h, p t jumps to 1 (as a bad arm would never pay off).
Consequently, the utility that a DM derives from the aforementioned payoffs is equal to u(s) (for the safe arm) and u(h) (in case the risky arm pays off).
To analyze the impact of risk aversion, we distinguish between two DM-types (indexed by i), where the DM with i = 1 is less risk averse than the DM with i = 2. In particular, the more risk averse DM "2" has an increasing, concave utility function u 2 (·) which is a concave transformation of u 1 (·) (the utility function of the less risk averse DM "1"). 1 We make the problem meaningful for both DM-types by assuming that: This assumption ensures that a DM pulls the risky arm if he knows that it is of the good type. (If λu i (h) were smaller than u i (s), a DM would never play the risky arm.) At this stage, we are able to define the expected current (myopic) payoff from R, where the expectation is formed given current belief p t , as: 3 The single-agent case We proceed by focusing on the case in which there is only a single agent of type i = 1, 2 who takes decisions. This simplest-possible case illustrates the essence of our argument about why higher risk aversion might increase a DM's willingness to pull the risky arm.
Section 4 will subsequently demonstrate that our core result can be generalized to a setup in which N > 1 agents cooperate in a team.
In the single-agent case, the objective of the DM is to choose {k t } t≥0 so as to maximize the total expected discounted payoff: From this point onward, the solution procedure is very much analogous to that in Keller, Rady, and Cripps (2005) -the only difference being the presence of risk aversion.
As in Keller, Rady, and Cripps (2005), the Principle of Optimality implies that the value function V i for DM-type i satisfies: To eliminate the expectations operator, observe that with subjective probability pkλdt, a payoff is observed -revealing to the DM that the risky arm is of the good type. In that case, the value function jumps to V i (1) = λu i (h). With complementary probability 1 − pkλdt, no payoff is observed. Then, application of Bayes' rule (1) enables us to write Combining this with equation (2), while approximating e −rdt by (1 − rdt), leads to the following Bellman equation: At this stage, it is interesting to note that the maximand in the Bellman equation continues to be linear in k, which implies that the solution will be of a bang-bang nature.
As a result, and exactly as in Keller, Rady, and Cripps (2005), it is always optimal for the DM to play either k = 0 or k = 1. In the former case, V i (p) = u i (s). In the latter case, V i satisfies the following first-order ordinary differential equation: whose solution is given by: This solution has the exact same structure as that in Keller, Rady, and Cripps (2005).
It therefore inherits the feature that there exists a cutoff belief p * i below which it is optimal for the DM to play the safe arm S, while playing the risky arm R becomes optimal when the DM's belief p > p * i . By imposing value matching (V * i (p * i ) = u i (s)) and smooth pasting ((V * i ) (p * i ) = 0), we can derive the cutoff belief p * i as: After some algebra, it then emerges that the difference p * 2 − p * 1 is equal to: . Hence: Noting that u 2 (x)/u 1 (x) is decreasing in x, 2 we can establish our main result: Proposition 1. In the single-agent problem, the ordering of the cutoff beliefs for DM-types 1 and 2 is as follows: (a) When h > s, p Part (a) of Proposition 1 implies that the more risk-averse DM needs a more optimistic belief on the quality of the risky arm to become willing to play R. In case (b) however, 3 This, as well as part (c) of the proposition, requires λ to be high enough so that Assumption 1 is not violated. If it were violated, the DM would never pull the risky arm (even if the latter was known to be of good quality).
4 Following suggestions by a referee, we show in Appendix B that one can generalize this proposition to any utility function for which we can measure risk attitudes by the absolute measure of risk aversion by studying an infinitesimal increase in risk aversion using Pratt's (1964) representation. the more risk-averse DM has the lower threshold p * -implying that he will play R at more pessimistic beliefs relative to the less risk-averse DM.
To gain intuition for Proposition 1, start with part (c). When h = s, the safe arm gives rise to the exact same payoff as a good-quality risky arm (only at a different frequency given that λ > 1). As a result, h/s = u i (h)/u i (s) = 1 for i = 1, 2 and all payoff-related terms disappear from the cutoff formulae (3). Both collapse to: From (4), one can see that pushing λ up (making the risky arm more attractive), lowers the cutoff belief (thus increasing the DM's willingness to try the risky arm). Crucially, however, when h = s, the cutoff belief falls at the same rate for all DMs irrespective of their degree of risk aversion (because u i (h)/u i (s) = 1 for i = 1, 2 and transformations of payoffs no longer affect the cutoff location), thereby keeping p * 1 = p * 2 . This no longer holds true when we increase h slightly to h > s. Again, we have made the risky arm more attractive, in response to which both p * 1 and p * 2 fall (cf. equation (3)). But since marginal utility of a more risk-averse DM decreases at a faster rate when the payoff rises, he gains fewer utils from the increase in h than his less risk-averse counterpart.
As a result, the less risk-averse DM's cutoff p * 1 falls by more than the more risk-averse DM's cutoff p * 2 -putting us in case (a) of Proposition 1, where the less risk-averse DM is more willing to pull the risky arm.
Further understanding of the difference between parts (a) and (b) of Proposition 1 can be gained by taking learning incentives into account and by realizing that our DM solves two fundamental trade-offs: 1. Choosing between an arm of known quality (the safe one, S) and an arm of unknown quality (the risky one, R), where information on the latter's quality can be gathered through experimentation. (This is the learning-dimension of the problem.) 2. Choosing between an arm that provides a relatively frequent stream of modest payoffs, and an arm that provides a less frequent stream of larger payoffs. (This is the dimension along which curvature in the utility function plays a role.) When h < s, the risky arm's payoff frequency λ has to be rather high by Assumption 1 (otherwise the DM will always choose S and the problem is not meaningful). A high λ implies that pulling the risky arm is relatively informative: if the arm is of the good type, a payoff h should be observed soon; if not, the belief about the nature of the risky arm will quickly be revised downward by equation (1) -ending the experimentation process once the belief p falls below the cutoff p * i . So when λ is high, uncertainty about the quality of the risky arm (captured under point 1) is likely to be short-lived. Consequently, the consideration under point 2 becomes more important. Along this dimension, a more riskaverse DM prefers a frequent stream of modest payoffs to an infrequent stream of larger payoffs. 5 In the case where h < s (but λ is high enough to meet Assumption 1), arm R is the one that offers a relatively frequent stream of modest payoffs (provided the arm is of good quality). The more risk-averse DM does not like the fact that this arm is risky (its quality is initially unknown and may turn out to be low, in which case it will never pay) but when λ is high, this risk is likely to be resolved soon and therefore of subordinate importance.
It is thus the trade-off between these two forces that determines a DM's decision to pull R or S. On the one hand, a more risk-averse DM is drawn towards the safe arm (the fact that the risky arm is of unknown quality introduces extra uncertainty in its payoff stream, which he dislikes). But, on the other hand, the DM realizes that pulling the risky arm enables him to reduce risk (which a risk-averse DM particularly likes). When λ is large, there is a high "informational return" to pulling the risky arm, as there is a good 5 Because of the concavity in his utility function, the more risk-averse DM values increases in h less and less (due to decreasing marginal utility, which a risk-neutral DM for example does not experience). To see this mathematically, consider two cases, indexed A and B. Assume that λ A > λ B and h B > h A = 1 (a normalization), but Hence, option A represents a rather frequent stream of relatively minor payments, while option B represents a rarer stream of larger payments. Given that the product λh is equal for both options, a risk-neutral DM is indifferent between them. A DM whose utility follows a concave transformation u(· ) of any payout h, prefers the frequent, modest flow provided by option A: his expected flow utility for option A is given by U More generally, this implies that a more risk-averse DM prefers a relatively frequent flow of modest payoffs to rarer arrivals of larger amounts. We thank a referee for pointing this out to us. chance that pulling R eliminates risk (which happens when a payoff h is observed, no matter how small h is). Subsequently, the DM is able to enjoy the (higher) utility stream provided by R in a world that no longer exhibits uncertainty on the nature of R. This explains the counterintuitive part of our result that a more risk-averse DM might be more willing to pull the risky arm than a less risk-averse DM. 6,7 The counterintuitive part (b) of Proposition 1 is seemingly at odds with the result of Chancelier, De Lara, and De Palma (2009), who conclude that more risk-averse DMs are always more likely to pull the safe arm in bandit problems. Closer inspection of Theorem 1 in Chancelier, De Lara, and De Palma (2009), however, reveals that the assumption made there restricts payoffs in such a way that it only covers case (a) of our Proposition 1. 8 There, we obtain the same result. Since the continuous-time framework employed in this paper makes the existence of different regimes more transparent, it becomes apparent that there is a part of the parameter space (with h < s and λ sufficiently high) in which the intuitive result does not arise.
Instances where λ is high are particularly likely to occur in online settings, where information abounds and arrives at a high frequency. Websites often use machine learning algorithms (such as reinforcement learning), for example to customize the display of advertisements to user-preferences. Such algorithms have the bandit problem at their core (cf. Sutton and Barto (1998)) and our result points out that increasing the degree of risk aversion in the objective function might actually give rise to a greater appetite for the risky alternative.
6 Proposition 1 continues to hold in the more general framework of Keller and Rady (2010): in their setup, even bad arms generate occasional payoffs equal to h -only at a lower frequency than good arms. More specifically, a good arm pays off according to a Poisson process with parameter λ H , while this parameter equals λ L for a bad arm (with λ H > λ L ). Setting λ L = 0 puts us back into the framework of Keller, Rady, and Cripps (2005) and simplifies the algebra considerably. 7 This can be rephrased in terms of entropy reduction: the entropy of a Poisson distribution is increasing in its parameter λ, as a result of which the expected entropy reduction (= uncertainty reduction = information production) is higher when the λ of the risky arm is higher. This makes it more attractive for a risk-averse DM to pull that arm.
8 To see this, note that Theorem 1 of Chancelier, De Lara, and De Palma (2009) can be rewritten in our notation/model as: "Assume that there exists a concave increasing function ϕ : R → R such that u 2 (s) ≥ ϕ(u 1 (s)) and u 2 (h) ≤ ϕ(u 1 (h))..." Starting from a situation where h > s (our case (a)), it is not possible to respect the concave increasing function ϕ(·) and move to a situation in which h < s (our case (b)) while satisfying the assumption that u 2 (s) ≥ ϕ(u 1 (s)) and u 2 (h) ≤ ϕ(u 1 (h)).
Alternatively, our counterintuitive result can arise in a labor market setup. Consider an interpretation of the two-armed bandit model which captures the career choice between becoming a worker or becoming an entrepreneur. When going down the latter route, the DM will face greater uncertainty about his long-run payoffs -at least initially (less so after he has learned the popularity of his product). 9 Our DM is uncertain on -sayhis organizational talent, which can be either high or low. If it is high, he will obtain a higher utility level as an entrepreneur; if it is low, his firm will never take off and he is better off as a worker. Following the seminal paper by Kihlstrom and Laffont (1979), most earlier papers featuring this choice have started from the (widely accepted) premise that more risk-averse individuals will choose to become workers (the safer option, which immediately gives greater clarity on long-run payoffs), while the less risk-averse ones will choose to start a business. In this literature, the narrative of the "risk tolerant entrepreneur" has been proposed as a parsimonious and plausible fix to the puzzling observation that entrepreneurs tend to earn less and bear more risk than salaried workers (cf. Hamilton (2000) and Moskowitz and Vissing-Jorgensen (2002)).
But by taking learning and experimentation dynamics into account, this paper demonstrates that this popular narrative does not necessarily hold true. It opens up the possibility that more risk-averse individuals might be more willing to start with the riskier action (setting up a business), if taking such a risk produces a sufficient amount of information on their organizational talent. 10 This theoretical ambiguity could explain why previous empirical studies have reported mixed results regarding the effect of risk aversion on the decision to become an entrepreneur. 11 In a principal-agent setup, it furthermore implies that appointing a more risk-averse agent is no guarantee that the principal will see more 9 See Kerr, Nanda, and Rhodes-Kropf (2014) and Manso (2015) for examples of this interpretation. 10 I.e.: if the information arrival rate λ is high enough. Bonatti and Hörner (2017) apply their bandit model to the labor market and argue that the information arrival rate λ is increasing in the amount of effort exerted by the DM, which seems intuitive. This suggests that high-effort exerting, relatively risk-averse DMs are more likely to display a greater preference for the risky arm than their less risk-averse counterparts (especially those who are not inclined to exert much effort).
11 Compare Schiller and Crewson (1997, who report mixed results themselves), Barsky et al. (1997, who find no significant effect;Andersen et al. 2014 also falls in this category) and Cramer et al. (2002, who do find a significant negative effect of risk aversion on the probability of becoming self-employed, but conclude that they are not able to make statements on causality).
More generally, our findings illustrate that it is not straightforward to infer risk preferences from observed actions when the setup is dynamic and offers scope for experimentation. Risk-aversion estimates obtained from game shows (such as Deal or No Deal 12 ), which neglect the point made by this paper, might suffer from a serious bias (not only along the quantitative dimension, but even along the qualitative one).

Generalization to a team of N > 1 agents
In this section, we will show that Proposition 1 generalizes to a setup in which N > 1 agents cooperate in a team (with all team members having the same utility function). The problem now consists of a social planner setting {(k 1,t , ..., k N,t )} t≥0 so as to maximize: Taking the same steps as before (and as in Keller, Rady, and Cripps (2005)) leads to the following expression for the cutoff belief: Consequently, we have: and it still is the case that: sgn(p * 2 − p * 1 ) = sgn(u 1 (h)u 2 (s) − u 1 (s)u 2 (h)) = sgn(u 2 (s)/u 1 (s) − u 2 (h)/u 1 (h)).
Consequently, Proposition 1 continues to apply to the N -agent cooperative problem.

Conclusion
We have shown that increasing a decision maker's degree of risk aversion in a two-armed bandit setup may make him more willing to experiment with the risky arm. The reason relates to the fact that experimenting with the risky action produces information on the environment, which reduces the risk that a decision maker will face going forward.
This insight has important implications. In a principal-agent setup it for example suggests that appointing a more risk-averse decision maker is no guarantee of less risky decisions, while it also implies that it is not straightforward to infer risk preferences from observed actions: when there is scope for experimentation, observing a greater appetite for risky actions might be indicative of more risk aversion, not less. This suggests that risk-aversion estimates stemming from dynamic game shows might suffer from serious biases -not only along the quantitative dimension, but even along the qualitative one.
7 Appendix A: u 2 (x)/u 1 (x) is decreasing in x Lemma 1. u 2 (x)/u 1 (x) is decreasing in x, where u 2 (x) is a concave transformation of u 1 (x) for all x ∈ [0, ∞) and u 1 (x) is a positive, increasing, concave function which is normalized such that u 1 (0) = 0.
Proof. The first derivative of the fraction is given by: Since u 1 (x) is an increasing function, the first term of this expression is positive. Consequently, the sign of this first derivative is equal to the sign of F (x) ≡ φ (u 1 (x)) u 1 (x) − φ (u 1 (x)). At this stage, it is easily verified that F (0) = 0, while: 8 Appendix B: Generalizing Proposition 1 using Pratt's representation Using Pratt's (1964) representation, we can write u(x) =´x 0 exp ´y 0 −r(z)dz dy, where r(x) = −u (x)/u (x) is the absolute measure of risk aversion associated with u(·). Let us consider an increase in the absolute measure of risk aversion by > 0. This brings us to u(x; ) ≈´x 0 exp ´y 0 − (r(z) + ) dz dy.