Distinct roles for DAT and COMT in regulating dopamine transients and reward-guided decision making

Mechanisms for regulation of dopamine transmission are critical to its effects on behavior and vary by region. Recycling via the dopamine transporter (DAT) predominates in striatum, while degradation by catechol-O-methyltransferase (COMT) predominates in cortex. However, questions remain about whether and how each mechanism affects fast fluctuations in dopamine transmission in these regions and influences behavior. To address this issue, we used pharmacological blockade of each clearance mechanism to assess their roles in reward-guided decision making and in regulating sub-second dopamine transmission in striatum and cortex. We found that DAT and COMT selectively influence reward value updating in opposite directions, with DAT blockade impairing and COMT inhibition improving learning in a multi-step decision making task that requires mice to monitor changes in both optimal response strategy and reward probabilities. By contrast, neither drug influenced the speed of reversals following a change in action-state transition probabilities. In addition, DAT but not COMT influenced task engagement and motivation to work for reward in both the decision making task and a progressive ratio paradigm. Fast scan cyclic voltammetry recordings of evoked dopamine release in anesthetized mice revealed that DAT but not COMT blockade enhanced dopamine transients in nucleus accumbens. Unexpectedly, neither manipulation had an effect on evoked release in medial frontal cortex. Together, these data refine our understanding of how dopamine clearance mechanisms operate in different regions and at distinct timescales to shape aspects of reward-guided decision making. Significance Statement Dopamine transmission is tightly regulated by clearance mechanisms and these clearance mechanisms exhibit regional specificity. However, while we know a lot about how the activity of dopamine neurons relates to reward-guided behavior, the precise role of different clearance mechanisms in shaping these processes is much less understood. This is important, as dysfunctional clearance mechanisms have been implicated in many neuropsychiatric disorders. Here, we show specific and distinct roles for two clearance mechanisms – the dopamine transporter and catechol-O-methyltransferase – in reward value, but not action-state probability, updating during multi-step decision making, and in regulating striatal and cortical dopamine transients. Our findings demonstrate how regulation of dopamine transmission, over distinct timescales and in different brain regions, can influence multiple aspects of reward-guided behavior.


Introduction
Many studies have examined the relationship between the activity of dopamine neurons and behavior. However, clearance mechanisms, which are critical for the temporal and spatial regulation of dopamine's actions, have been much less well studied. Clearance mechanisms exhibit regional specificity within the mesocorticolimbic system. Recycling of dopamine via the dopamine transporter (DAT) predominates in nucleus accumbens (NAc) and other striatal regions (Sulzer et al., 2016). In medial frontal cortex (MFC), where DAT is sparse and located relatively distantly from dopamine terminals (Ciliax et al., 1999;Sesack et al., 1998), enzymatic degradation, particularly by catechol-O-methyltransferase (COMT), is a more prominent method of clearance (Karoum et al., 1994;Tunbridge et al., 2006).
However, the function and effects of DAT and COMT are likely not so neatly separable. For instance, reward anticipation and receipt drive increases in dopamine levels in both striatum and frontal cortex (Bassareo et al., 2002(Bassareo et al., , 2007Ellwood et al., 2017), and COMT influences the 5 magnitude of this effect in frontal cortex (Lapish et al., 2009). In more complex settings, dopamine levels and COMT mediate the balance between model-free and model-based reinforcement learning systems (Doll et al., 2016;Sharp et al., 2016;Wunderlich et al., 2012).
The circuitry involved in these behavioral effects is unclear because, although DAT predominates in striatum and COMT in cortex, both molecules are expressed to some degree in both regions (Laatikainen et al., 2013;Matsumoto et al., 2003;Sesack et al., 1998). Moreover, reciprocal interactions between cortical and striatal dopamine (Clarke et al., 2014;Kellendonk et al., 2006;Pycock et al., 1980) raise the possibility of indirect effects.
Thus, questions remain about whether the actions of DAT and COMT are truly separable. In the behavioral realm, it is unclear to what extent DAT affects learning about rewards as well as the motivation to pursue them, particularly during more complex decision making tasks that recruit both striatal and cortical regions. While some studies have suggested that individual variation in DAT levels is related to development of reward-based response biases (Kaiser et al., 2018), direct pharmacological or genetic manipulations of DAT do not necessarily cause evident changes in reward learning (Cagniard et al., 2006a(Cagniard et al., , 2006bCosta et al., 2014). The effects of COMT manipulations on reward-guided behavior, meanwhile, remain largely unexplored . At the level of transmission, while fast and transient fluctuations in dopamine transmission are known to produce prediction error-like signals in both striatum and cortex (Ellwood et al., 2017;Hart et al., 2014), the extent to which DAT and COMT regulate this fast transmission -as opposed to slower minute-by-minute changes in dopamine levelsrequires further study, particularly in cortex.
Here, we used behavioral testing, in vivo electrochemistry, and pharmacology to investigate the influence of DAT and COMT on reward-guided behavior and dopamine transmission in mice.
Systemic administration of a DAT blocker (GBR-12909) or a COMT inhibitor (tolcapone) allowed 6 us to directly contrast the role of each clearance mechanism. We investigated how each agent influenced both motivation to work for reward and flexible multi-step decision making that required mice to adapt to changes in optimal response strategy and reward probabilities. Finally, to understand how DAT blockade and COMT inhibition affect the dynamics of fast fluctuations in dopamine, we recorded evoked dopamine transients in the NAc and MFC in anesthetized animals using fast scan cyclic voltammetry.

Animals
Male C57BL/6 mice were obtained from Envigo (formerly Harlan). Male mice were used in this study because COMT has previously been shown to exhibit sexually dimorphic effects (Harrison and Tunbridge, 2008). Mice were aged 9-26 weeks for behavioral experiments and 10-16 weeks for voltammetry experiments. Animals were housed on a 12/12-hr light/dark cycle; all behavioral tests were conducted during the light phase. Mice were food deprived to 85-90% of free feeding weight for the locomotor activity, progressive ratio, and multi-step decision making tasks, and were water deprived for 3hrs prior to test sessions for the sucrose preference test.
Food and water were provided ad libitum in all other cases. All mice were habituated to handling -including the restraint position used during injections -before experiments began.
Care and testing of all animals was conducted under the auspices of the UK Home Office laws and guidelines for the treatment of animals under scientific procedures and of the local ethical review board at the University of Oxford. 7 Drugs Mice were administered the selective, brain-penetrant COMT inhibitor tolcapone (30mg/kg; TRC Inc) (Barkus et al., 2016;Männistö and Kaakkola, 1999) and/or the selective DAT blocker GBR-12909 dihydrochloride (Tocris) (Izenwasser et al., 1990;Rothman et al., 1989). GBR-12909 dihydrochloride was administered at 6mg/kg, a dose that in pilot experiments produced increased locomotor activity but not stereotypy, as assessed using criteria adapted from Creese and Iversen (1974) for use in mice (data not shown). D-amphetamine (in sulfate formulation; Tocris) was used at 4mg/kg (Avelar et al., 2013;Daberkow et al., 2013), and the NET blocker atomoxetine (in hydrochloride formulation; Tocris) at 1mg/kg (Bymaster et al., 2002;Koda et al., 2010). All drugs were dissolved in 20% hydroxypropyl-beta-cyclodextrin (Acros Organics) in 0.9% saline (AquPharm), which served as a vehicle control in all experiments. All drugs were delivered by intraperitoneal injection, with an injection volume of either 5mL/kg or 10mL/kg (multi-step decision making task only). Drug administration timings were designed to account for differences in drug time courses of action.

Sucrose preference test
Sucrose preference was assessed in open-top cages equipped with two water bottles. Mice (previously run on locomotor/stereotypy test for drug dosage assessment and counterbalanced for prior drug exposure) were tested for 5hrs each day for a total of 7 days (Days 1-3: water exposure only; Days 4-7: sucrose exposure). Bottles were weighed immediately before and after testing to determine consumption. On day 7, mice received two injections -the first (tolcapone or vehicle) 1hr before testing and the second (GBR-12909 or vehicle) immediately 8 before testing. Preference for sucrose solution (10% weight/volume; Sigma Aldrich) was assessed as a ratio of sucrose consumption to total consumption.

Progressive ratio task
A total of 24 mice (two cohorts: n = 12 previously used for locomotor activity and sucrose preference tests, counterbalanced for prior drug exposure; n = 12 test naïve, given a vehicle injection 3 days prior to testing) were tested on the progressive ratio task. Of these, 3 mice were excluded from the analysis: 2 due to failed injections and 1 that was unable to learn the complete task. The timing of experimental drug administration differed between the two cohorts.
Cohort 1 received the first injection 120min and the second injection 60min before the start of the session, whereas cohort 2 received the first injection 105min and the second injection 15min before the session. No notable effects of cohort were observed (data not shown), so findings from the two cohorts are reported together.
The task was conducted as previously described (Sharma et al., 2012) in standard operant chambers (Med Associates Inc). Rewards consisted of 60µL of 10% sucrose solution. Animals were trained on increasing fixed ratio (FR) schedules until they were able to earn ≥ 15 rewards and achieved an active:inactive lever press ratio of ≥ 3:1 on an FR5 schedule over 2 consecutive days. During progressive ratio (PR) sessions, the number of active lever presses required to obtain each subsequent reward was increased according to the following equation: number of required lever presses = 5*e i*0.16 -5 ('i': trial number). Drug effects on behavior were assessed by giving mice two systemic injections prior to PR test sessions: tolcapone or vehicle, followed by GBR-12909 or vehicle. Each mouse received all possible drug combinations over four PR test sessions according to a counterbalanced within-subjects design. Drug testing days 9 were interleaved with two washout days during which no injections were given: one day of testing on an FR5 schedule and one day of testing on a PR schedule.

Multi-step decision making task
The task was adapted from the two-step task developed by Daw et al. (2011) for dissociating model-based and model-free reinforcement learning in humans, as reported in Akam et al. (2015Akam et al. ( , 2017. A total of 16 mice began training on the task; after 6 days the 8 animals that had performed the most trials during that time were selected for continued training and subsequent drug testing. The task was run in 8 custom made 12x12cm operant boxes controlled using pyControl (https://pycontrol.readthedocs.io). The behavioral apparatus consisted of 4 nose poke ports; a 'high' and a 'low' poke in the center flanked by 'left' and 'right' pokes ( Figure 2A). Each trial started with the high and low pokes lighting up. The subject chose high or low, causing either the left or right poke to light up. The subject then poked the illuminated side for a probabilistic reward (20% weight/volume sucrose solution; Sigma Aldrich). At any point in time, one reward port had a high probability of giving reward (0.8) and the other a low reward probability (0.2).
Similarly, a particular first-step action (high or low) usually led to a particular second-step state (left or right port active) ("common" transitions, 80% of trials), though sometimes led to the opposite state ("rare" transitions, 20% of trials).
Unlike other recent rodent adaptations of multi-step decision tasks (Groman et al., 2018;Hasz and Redish, 2018;Miller et al., 2017), both the reward probabilities in the second-step states and the transition probabilities linking the first-step actions to the second-step states reversed in blocks. Block transitions were triggered based on the subject's behavior, occurring 20 trials after an exponential moving average (tau = 8 trials) of choices crossed a 75% correct threshold.
Reversals in reward probability occurred twice as often as reversals in transition probability.
Subjects encountered the full trial structure from the first day of training. The only task parameters that were changed over the course of training were the state and reward transition probabilities and the reward sizes; reward size was gradually reduced and the reward and transition probabilities gradually adjusted over 38 days of training as mice became progressively more engaged with the task and learned to perform it better (see Table 1 for details). All animals had at least 12 sessions with the final task parameters prior to drug administration.
Animals were considered fully trained and ready for pharmacological testing when the group average met the following criteria: (1) a 3-day average of >400 trials per session, (2) a 3-day average of >4 reversals per session, and (3) a 3-day average combined reversal learning speed of <30 trials.
Pharmacological manipulations were performed every second day of testing using a withinsubjects design. On intervening days mice were run on the task but received no injection.
Tolcapone or its vehicle control were administered 90min prior to the start of the session; GBR-12909 or its vehicle control were administered 15min prior to the start of the session. Subjects received a total of 8 of each drug and 5 of each vehicle injection, with order counterbalanced across animals.

Fast-scan cyclic voltammetry
A total of 100 mice were used for FCV recordings (NAc n = 43; MFC n = 57). Data from 57 (NAc n = 23; MFC n = 34) of these animals were included in the final analysis; group sizes for each region and drug treatment of the between-subjects design are shown in Table 2. Animals were excluded due to death prior to the completion of the experiment (NAc n = 3; MFC n = 1), failed drug injection (NAc n = 6; MFC n = 4), or failure to meet data quality control criteria (NAc n = 9; MFC n = 13). Quality control criteria were (1) a ratio of standard deviation of prestimulation noise to peak height ≤ 0.5, (2) a detectable peak consistently occurring ≤ 3sec after stimulation, and (3) peak timing varying by ≤ 2sec across 30min bins. Finally, 2 NAc and 5 MFC animals were excluded due to poor fit (R 2 < 0.5) of the exponential decay function in the kinetics analysis.

Electrode fabrication and implantation
Recording and reference electrodes were made in-house (Papageorgiou et al., 2016;Syed et al., 2016) and pre-calibrated in a flow cell (Sinkala et al., 2012) to allow conversion of recorded signals from current (nA) into concentration (nM). When calibration factors were unavailable (9 electrodes), a mean calibration factor was used. The stimulating electrodes were untwisted, bipolar, stainless steel electrodes measuring 0.15mm in diameter (PlasticsOne). 4.80 DV from brain surface) (Lammel et al., 2008(Lammel et al., , 2011(Lammel et al., , 2014 . A reference electrode was placed in the contralateral hemisphere (+4.80 AP and +1.00 ML from bregma).

FCV recordings
Voltammetric recordings were made as previously described (Syed et al., 2016). Dopamine release was induced by passing current through the stimulating electrode via a DS3 Stimulator (Digitimer) (stimulation parameters: pulse number = 60 pulses, frequency = 50Hz, amplitude = 300µA, pulse width = 2ms, pulse phase = biphasic), based on previous literature (Yavich et al., 2007;Yorgason et al., 2011) and on pilot experiments that established the parameters required to reliably evoke detectable dopamine release in both the NAc and MFC. Note that as the cortex is innervated by significant noradrenergic as well as dopaminergic fibers (Lindvall et al., 1978;Slopsema et al., 1982), and because dopamine and noradrenaline have very similar cyclic voltammograms (Adams, 1976;Michael and Wightman, 1999), signals recorded in cortex using FCV can only be identified as catecholaminergic, not as definitively dopaminergic. However, given that previous studies have shown that electrical stimulation of the VTA predominantly evokes dopamine rather than noradrenaline release in MFC (Shnitko and Robinson, 2014), we will refer to such signals as 'dopamine' even though a contribution of noradrenaline cannot be entirely ruled out. Stimuli were generated and recordings collected using Tarheel CV (National Instruments).
Evoked signals decayed over time and so were allowed to stabilize (stimulating every 5-10mins for ~2.5hrs during NAc, and ~1hr during MFC, recordings). After the stabilization period stimulations were made every 5min. Following a 30min pre-drug baseline period, tolcapone or vehicle was administered. GBR-12909 or vehicle was then given after a further 90min of recording, and recordings continued for several hours. Amphetamine was tested in two groups 13 of animals: those that had received tolcapone or vehicle and naïve animals. There were no differences in signal decay between these groups (data not shown), so they were combined.
Atomoxetine was only administered to drug naïve animals. Data were averaged into 15min bins for analysis: the effects of tolcapone were assessed at 85mins, of GBR-12909 and amphetamine at 30mins, and of atomoxetine at 60min post-administration. Once recording was complete, electrode placement was ascertained as previously described (Syed et al., 2016).

Experimental design and statistical analyses
Statistical analyses were conducted in SPSS versions 20 and 24 (IBM Computing), with the exception of the multi-step decision-making task (described further below), with significance set at α = 0.05. With the exceptions noted below, data were analyzed using analysis of variance (ANOVA), with drug group(s) (and time, where relevant) as factors. For repeated-measures ANOVAs, Greenhouse-Geisser corrections were applied where data failed Mauchley's test of sphericity. Simple main effects analyses were conducted as necessary when significant interactions were found. Least square difference pairwise comparison tests were used to assess which groups were driving any significant main or interactive effects.

Progressive ratio task analysis
The main outcome measure was cumulative active lever presses over the session. We also examined cumulative inactive lever presses as a measure of general activity levels, the average reward collection latency, and the average re-engagement latency (the interval between the animal exiting the magazine after reward delivery and its next lever press). The significance of drug effects on these measures was assessed using repeated-measures ANOVAs, with drug 1 (tolcapone or vehicle) and drug 2 (GBR-12909 or vehicle) as within-subjects factors.

Multi-step decision making task analysis
For the multi-step decision making task, analysis of pharmacological manipulations was restricted to the first 90 minutes of each session. Except where stated otherwise drug effects were evaluated using repeated-measures ANOVAs. As the two different drugs each had a corresponding vehicle condition (see above), within subject factors were vehicle/drugdifferentiating both vehicle from both drug conditions -and GBR-12909/tolcaponedifferentiating GBR-12909 and its respective vehicle from tolcapone and its respective vehicle. probabilities and compared using a repeated-measures ANOVA with reversal type (reward or transition) and drug condition as within subject factors. To get a more fine-grained picture of how adaptation to reversals was affected by the drugs, the choice probability trajectory following reversals was fit by a sum of two exponential decays defined by the equation: Where ! ! is the probability on trial ! of choosing the option that was correct following the reversal, ! ! is the asymptotic probability of choosing the correct option, defined as the crosssubject mean fraction of correct choices over the last 15 trials of all blocks, ! ! is the initial probability of choosing the correct option (calculated as 1 − ! ! ! , where ! ! ! is the fraction of correct choices at the end of blocks preceding reversals of type being analyzed), ! ! is the time constant of the fast exponential decay, ! ! is the time constant of the slow exponential decay, and ! ! is the weighting applied to the fast component relative to the slow component.
We used permutation testing to evaluate whether differences between drug and corresponding vehicle condition were significant, with the analysis performed independently for GBR-12909 and tolcapone. The curve was fit using a squared error cost function to the cross-subject mean choice probability trajectory for drug and vehicle conditions, and the difference ∆x !"#$ between drug and vehicle conditions was evaluated for parameters ! ! , ! ! , ! ! . We then constructed an ensemble of 5000 permuted datasets in which the assignments of sessions to the drug and vehicle conditions were randomized. Randomization was performed within subjects, such that the number of sessions from each subject in each condition was preserved. For each permuted dataset we re-ran the analysis and evaluated the difference in each parameter between the two conditions, to give a distribution of ∆x !"#$ , which in the limit of many permutations is the distribution of ∆x under the null hypothesis that there is no difference between the conditions. The two tailed p value for the observed difference is given by: Where ! is the number of permutations and ! is the number of permutations for which The statistical significance of trial-to-trial learning, and its modulation by drug treatment, was assessed using a logistic regression model. The model predicted repeating the previous choice as a function of trial outcome (rewarded or not), transition (common or rare), and their interaction. We additionally included two predictors capturing choice biases: one for a bias towards the high/low poke, and one for rotational bias -i.e. a tendency to choose high / low following trials that ended in the left / right second-step, which is observed in some animals on this task (Akam et al., 2017). We further included a predictor that promotes repeating correct choices. This prevents correlation between action values at the start of the trial and subsequent trial events from biasing on the transition-outcome interaction predictor loading (Akam et al., 2015).
The regression model was fit to a dataset comprising tolcapone, GBR-12909, and their corresponding vehicle sessions. Drug effects were modeled by interacting each predictor with vehicle/drug -differentiating both vehicle from both drug conditions, and GBR-12909/tolcapone -differentiating GBR-12909 and its respective vehicle from tolcapone. All coefficients were treated as independent random effects across subjects and the resulting hierarchical regression was fit using the lme4 mixed effects package (Bates et al., 2007) in the R statistical language (R Development Core Team, 2010). Random effects whose variance fit to zero were removed from the model to enable the fit to converge. This did not remove random effects for any predictors with significant fixed effects. P values were calculated using the LmerTest package (Kuznetsova et al., 2017) using Satterthwaite's method for approximating degrees of freedom.
The regression model fit to the complete dataset indicated significant three way interactions between model predictors, vehicle/drug and GBR-12909/tolcapone. To unpack what was driving this interaction, we subsequently performed separate model fits for each drug with its corresponding vehicle, interacting the base predictors with vehicle/drug condition.
We also explored whether fitting reinforcement learning (RL) models to the multi-step decision making task data could provide insight into the behavioral strategies used by the mice and how the drugs affected these. We first modeled behavioral data from baseline sessions (when no injections were given). To do this, we compared a set of models generated by adding or removing single features from the model found to best describe behavior on this task in Akam et al., 2017. These features included: forgetting about the values and state transitions for notchosen actions, action perseveration effects spanning multiple trials, and representation of actions both at the level of the choice they represent (e.g. high poke) and the motor action they require (e.g. left à high movement) (for full details see Akam et al., 2017). Models were compared using the integrated Bayes Information Criterion (BIC) score. In addition to modeling baseline session data, we compared the signed difference between maximum likelihood parameter estimates after administration of GBR-12909 or tolcapone with their respective vehicles. However, although some changes following drug administration were found in the reinforcement learning model analysis, the complexity of the model meant that no effects survived multiple comparison correction for the number of model parameters. Therefore, the results of the RL modeling analysis on drug session data are not presented.

FCV recording analysis
FCV recordings were processed using software written in LabVIEW and custom Matlab scripts.
Dopamine levels were extracted using a chemometric approach based on training sets from individual animals (Heien et al., 2005;Keithley et al., 2009Keithley et al., , 2010. Cyclic voltammograms were low-pass filtered at 2kHz and background subtracted using the 5 scans prior to stimulation. Drug effects on evoked dopamine were assessed by quantifying several features of the evoked transients, including: the peak height; the latency from the start of stimulation to the peak; and the rate of decay of the signal from the peak to T50 (the time when the signal had decayed to half the peak height) ( Figure 4B). (In cases where the signal did not fall to half its peak height, the decay over the 3sec following the peak was used.) Prior to statistical analysis, parameters were normalized to the pre-drug baseline signal and binned across three individual recordings at the time of interest. The significance of drug effects on each parameter was determined using repeated-measures ANOVAs, with time as the within-subjects factor and drug 1 (tolcapone or vehicle) and drug 2 (GBR-12909 or vehicle) as between-subjects factors. As we found no interactive effects of the two experimental drugs, the effects of COMT inhibition and DAT blockade are presented separately.

DAT blockade, but not COMT inhibition, increases motivation to work for reward
Before beginning behavioral testing, we assessed the effects of DAT blockade and COMT inhibition on basic reward processing using a sucrose preference test. We found no influence of either tolcapone or GBR-12909 on either sucrose preference or absolute sucrose or water consumption (all F < 1.7, p > 0.21, univariate ANOVAs). These results indicate that drug effects on the hedonic properties of sucrose rewards did not influence subsequent behavioral experiments, although it is possible that the sucrose preference test was not sensitive enough to rule out such confounding effects.
We next investigated the influence of DAT blockade and COMT inhibition on reward guided behavior using a progressive ratio ( Both DAT blockade and COMT inhibition modulate value updating during multi-step decision making To assess how regulation of dopamine transmission affects flexible reward-guided decision making, we used a multi-step decision task in which mice had to learn which of two options to select (high / low nose pokes) to gain access to a high probability reward port (left / right ports) ( Figure 2A). Maximizing the reward rate on the task requires choosing the action at the first step that commonly leads to the second step state with high reward probability, and tracking this correct action across reversals in the reward and transition probabilities.
As can be seen in Figure 2c, mice learned to do this proficiently. By the end of training, animals were performing 442.2 ± 28.8 trials and 5.7 ± 0.6 blocks per session, completing 70.5 ± 5.2 trials per block, and obtaining reward on 51.6 ± 0.5 percent of trials (mean ± SEM across animals over the 3 days before the first injection). We fit reinforcement learning models using To obtain a more fine-grained picture of drug effects on reversal behavior we fit a double exponential curve to the choice probability trajectories following reversals in reward and transition probabilities. We then used a permutation test for each drug and reversal type independently to assess whether the fits to each reversal type were changed by the drug manipulations. DAT blockade again significantly increased the time constant of adaptation to reversals in the reward probabilities (p = 0.0012, permutation test, Table 3) but did not affect how quickly subjects adapted to reversals in the transition probabilities (permutation test p > 0.23 for all fit parameters, Table 3) ( Figure 2F). Therefore, while normal DAT function is important for updating of action values, it has limited influence on updating of action-state transition probabilities. In addition, this curve fitting analysis indicated that COMT inhibition reduced the time constant for adaptation to reversals in the second-step state reward probabilities (p = 0.049, permutation test, Table 4) ( Figure 3C). This effect was again specific to reward probability reversals, as COMT inhibition had no effect on how fast subjects adapted to reversals in the transition probabilities (permutation test p > 0.3 for all fit parameters, Table 4).

22
Analysis of stay probabilities indicated that, consistent with our previous work with this task (Akam et al., 2017), both rewarded outcomes and common transitions promoted repeating choices ( Figure 2G, 3D, P < 0.001, mixed effects logistic regression), but the transition-outcome interaction did not (P = 0.78). In online learning of action-state transition probabilities, a modelbased reinforcement learning strategy tends to generate a main effect of transition rather than a transition-outcome interaction (Akam et al., 2015(Akam et al., , 2017. Therefore, the regression analysis supports the RL modeling described earlier in demonstrating that the behavior of the mice involved a model-based component and was not simply driven by model-free reinforcement learning. However, when we assessed drug effects, we found that neither DAT nor COMT manipulations In contrast, COMT inhibition had no effect on any index of evoked dopamine transmission in NAc core ( Figure 4E,G,I). There were no main or interactive effects involving the COMT inhibitor on peak height, latency to peak, or decay from peak (all F < 2.4, p > 0.14).
Given the prominent role of COMT degradation in regulating cortical dopamine transmission, and the potential involvement of cortical as well as striatal regions in behavioral tests such as the multi-step decision making task, we followed up our recordings of VTA-stimulation evoked dopamine release in the NAc with similar recordings in the prelimbic MFC ( Figure 5A). Evoked release in the MFC was, in general, considerably smaller than in the NAc (range of signal sizes at the start of the pre-drug baseline recording period, mean ± standard deviation: NAc = 11.48 ± 12.36 nA, MFC = 2.26 ± 0.96 nA). Nevertheless, we were able to record clear signals in our cortical experiments ( Figure 5B).
Unexpectedly, we found no influence of either DAT blockade ( Figure 5D,

Discussion
Here we demonstrate distinct roles for DAT and COMT in reward-guided behavior. DAT blockade affected multiple facets of reward-guided behavior: it increased motivational drive in the PR task, stimulated task engagement during the multi-step decision making task, and additionally selectively impaired updating of reward values during the latter task. In contrast, COMT inhibition did not alter PR behavior but did improve reward value updating in the decision making task. FCV recordings confirmed a role for DAT recycling, but not COMT degradation, in regulating fast fluctuations in dopamine transmission in NAc. Unexpectedly, neither DAT blockade nor COMT inhibition affected evoked dopamine transients in prelimbic MFC, indicating that clearance mechanisms other than DAT and COMT contribute to regulation of cortical dopamine at sub-second timescales.

26
The significance of DAT for reward-guided learning and motivation Our findings demonstrate that DAT influences several distinct aspects of reward-guided behavior. In the PR task, DAT blockade increased active lever presses and speeded task reengagement latency, consistent with evidence demonstrating its importance for motivational drive and the exertion of effort to obtain reward (Cagniard et al., 2006a(Cagniard et al., , 2006bYoung and Geyer, 2010;Zhuang et al., 2001). However, the nature of the PR task makes it difficult to determine whether differences in learning might also contribute to these behavioral effects.
We therefore also employed a more complex multi-step sequential decision making task as a means of disentangling this confound. Unlike other similar multi-step paradigms in rodents (Groman et al., 2018;Hasz and Redish, 2018;Miller et al., 2017), the version we used included reversals in both reward and action-state transition probabilities. This not only reduced the chance for the animals to depend on a sophisticated habit-like strategy (Akam et al., 2015) but also allowed us to examine the influence of DAT and COMT on different aspects of behavioral flexibility. Consistent with previous work using the same task (Akam et al., 2017), the mice were sensitive to the transition structure of the task and exhibited behavior consistent with them using a mixture of model-based and model-free reinforcement learning.
Our findings suggest that DAT is important for multiple features of reward-guided behavior. In addition to increased trial rate and speeded responding following DAT blockade, we found a selective effect on reward updating: specifically, DAT blockade decreased a subjects' ability to adapt following reversals in second-step reward probabilities. This was not caused by differences in performance on and off the drug prior to reversals. Moreover, as the task included reversals in transitions as well as reward probabilities, we were able to demonstrate 27 that, strikingly, the effect of DAT blockade on learning was not observed following reversals in the transition probabilities linking first-step actions to second-step states, even though these also provided an opportunity for animals to adapt their behavior (in this case, at the first-step choice) in order to maximize rewards obtained. Therefore, the deficit cannot be attributed to a general behavioral inflexibility. Instead, our data are consistent with DAT influencing rapid reward-driven alternations in behavioral strategies and motivational components of rewardguided behavior.
While numerous studies have implicated DAT in the regulation of motivation, there is limited evidence linking it to reward learning, with many studies finding no clear effects of genetic or pharmacological disruption of DAT function on acquisition of either instrumental or Pavlovian associations (Cagniard et al., 2006a(Cagniard et al., , 2006bCosta et al., 2014;Kaiser et al., 2018;Peciña et al., 2003;Yin et al., 2006). Fast fluctuations in striatal dopamine correlate with reward prediction error signals, which are strongly linked to animals' ability to form certain reward-related associations (Day et al., 2007;Flagel et al., 2011;Pessiglione et al., 2006;Saddoris et al., 2015). In agreement with previous studies (Budygin et al., 1999;Huotari et al., 1999Huotari et al., , 2002Nomikos et al., 1990;Raevskii et al., 2002), we found that DAT blockade both increased and extended evoked NAc dopamine transients. Theoretically, these extended striatal dopamine release events could promote reinforcement learning by boosting reward prediction error-like signals. Alternatively, an increase in the duration of such transients could corrupt the precision of this encoding, reducing reinforcement learning efficiency. It is therefore likely that the behavioral consequences of DAT blockade on reward learning will depend on the paradigm.
For instance, impairments might be more likely in paradigms like our multi-step decision making task -where trials are frequent, associations are changeable, and rewards are uncertainbecause performing adaptively on such tasks is facilitated if subjects use precise contingent 28 learning of choice-state-outcome associations rather than non-contingent approximations based on recent choice or reward histories (Walton et al., 2011).
Our voltammetry recordings came from anesthetized animals and used supraphysiological stimulation parameters to compare release in striatum with that in cortex (where the lower density of dopamine terminals reduces signal-to-noise compared to striatum). Nevertheless, the effects of DAT blockade that we observed are similar to those seen on spontaneous dopamine transients in freely-moving animals following administration of nomifensine, a catecholamine transporter blocker, or stimulant drugs such as amphetamine (at least at low to moderate doses) (Daberkow et al., 2013;Robinson and Wightman, 2004).
We found no effect of DAT blockade on evoked release in prelimbic MFC, consistent with the sparse DAT content in this region (Sesack et al., 1998). DAT might still regulate cortical dopamine levels over longer timescales: some previous studies of DAT blockade have observed an effect on cortical dopamine levels measured with microdialysis over minutes (Carboni et al., 2006;Cass and Gerhardt, 1995;Tanda et al., 1997;Valentini et al., 2004), although evidence is mixed (Mazei et al., 2002;Pozzi et al., 1994;Weikop et al., 2007). Nonetheless, our data indicate that DAT recycling only plays a role in regulating fast dopamine transmission in striatum.

The role of COMT degradation
Research on COMT's role in behavior has largely focused on cognitive functions, while its significance for reward-guided behavior remains relatively unstudied ).
Here, we tested the effects of inhibiting COMT in two reward-guided tasks. COMT inhibition had no effect on motivational aspects of behavior, in line with its absence of effect on NAc dopamine transmission (Acquas et al., 1992;Budygin et al., 1999;Garris and Wightman, 1995).

29
Nonetheless, there was a selective effect of COMT inhibition on value updating in the multi-step sequential decision making task. In contrast to DAT blockade, COMT inhibition speeded reversals. Although acute pharmacological inhibition of COMT is different than the chronic changes in enzymatic activity arising from the human COMT Val/Met polymorphism, this finding is concordant with reports of faster reinforcement learning in Met allele carriers, who have lower COMT activity than Val allele homozygotes, statistically significant by meta-analysis (Corral-Frías et al., 2016). Moreover, our data align with the proposal by Frank and colleagues that the behavior of Met allele carriers is more sensitive to single instances of negative feedback (Frank et al., 2007).
Given that COMT inhibition had little to no effect on the size or kinetics of evoked dopamine transients in either NAc or prelimbic MFC, COMT appears not to shape sub-second reinforcement signals. This might at first appear surprising given COMT's well-established role in regulating cortical dopamine transmission (Gogos et al., 1998;Käenmäki et al., 2010;Slifstein et al., 2008;Tunbridge et al., 2004). Indeed, the only previous study that investigated the influence of COMT on fast catecholamine transmission reported an increase in dopamine overflow compared to wild-type controls (Yavich et al., 2007). However, there are a number of important differences between this previous study and ours, notably, the method of COMT manipulation (a constitutive knock-out versus an acute pharmacological challenge) and the analysis approach (amperometric currents combined with periodic cyclic voltammograms versus principal component regression). The latter may be particularly important given the difficulty of separating dopamine from other potential chemical contaminants in cortex.
In addition, COMT's regulation of cortical dopamine transmission is complex and contextdependent: effects of COMT on cortical dopamine are typically only observed under conditions of potentiated dopamine transmission (Lapish et al., 2009;Tammimaki et al., 2016;Tunbridge et 30 al., 2004). While the precise synaptic location of COMT is not fully determined, it is likely situated on postsynaptic membranes or even inside postsynaptic neurons, possibly extrasynaptically (Chen et al., 2011;Myöhänen et al., 2010). This would limit its ability to directly modulate fast fluctuations in dopamine. Furthermore, given the lack of autoreceptors in cortically-projecting dopamine neurons (Gainetdinov and Caron, 2003;Lammel et al., 2008), there may also be less scope in cortex for indirect effects of tonic dopamine levels on evoked transients, as occurs in striatum (Grace, 1991;Sulzer et al., 2016). Instead, our data suggest that clearance mechanisms other than DAT and COMT regulate the kinetics of shorter-lived dopamine transients in the cortex. While FCV currently lacks chemical selectivity to separate dopamine from noradrenaline, the key point is that neither DAT nor COMT appears to be a major regulator of cortical catecholamine levels at the type of fast timescales required for precise reinforcement learning.
In conclusion, we demonstrate that both DAT and COMT regulate specific and distinct aspects of reward-guided behavior, although they had little influence on the balance of reinforcement learning strategies. While DAT regulates fast fluctuations of dopamine in NAc, these fluctuations are unaffected by both DAT and COMT in MFC. Taken together, our findings demonstrate the complex role of dopamine in multiple aspects of reward-guided learning, which appears to operate over distinct timescales and in different brain regions to mediate its effects. DAT blocker (purple). Each data point shows the lever press session total for one animal.

Figure Legends
Boxplots show median and 25 th and 75 th percentiles; whiskers extend from the minimum to maximum value. Lever press data are shown on a log 10 scale for clarity. D) As in (C), but for responses on the inactive lever. Data points again show lever press session totals for individual animals and are displayed on a log 10 scale. E) As in (C), but for the latency to collect reward following its delivery. Each data point shows the cross-trial average latency for one animal. F) As in (C), but for the latency to re-engage with the task by recommencing lever pressing following the consumption of reward. Data points show cross-trial average latencies for individual animals.  The set of models considered were generated by adding or removing single features from the model found to best describe behavior on this task in Akam et al., 2017. Models are labeled on the x-axis by the feature that was added or removed.  release in the NAc in animals given the COMT inhibitor as drug 1 (red) compared to release in those given vehicle as drug 1 (black) 85min after the first injection. Release is normalized to the average pre-drug baseline peak height (equivalent to 100% on the y axis), binned over 15min centered on the time point of interest, and presented as mean ± SEM across animals within each drug group. Timing and duration of stimulation indicated by thick black bar. F) As in (E), but comparing release in animals given the DAT blocker as drug 2 (blue) with release in those given vehicle as drug 2 (black) 30min after the second injection. G) Quantification of the peak height of evoked dopamine release in NAc following administration of the COMT inhibitor. Left: peak height at the same time point shown in (E). Each animal's data is shown individually and is normalized to its average pre-drug baseline peak height (equivalent to 100% on the y axis). Box plots show median and 25th and 75 th percentiles; whiskers extend from the minimum to maximum value. Right: normalized peak height (mean ± SEM for each drug group) over the 90min following the first injection in animals that received the COMT inhibitor compared with those that received vehicle as drug 1. H) As in (G), but comparing data from animals that received the DAT blocker with data from those that received vehicle as drug 2; right-hand plot shows peak height over the 90min following the second injection. I) As in the Ieft-hand plot of (G), but showing the quantification of the latency from stimulation to peak (left) and of the decay from the peak to T50 (right). Decay constant data are shown on a log 10 scale for clarity. J) As in (I), but for the DAT blocker.  dopamine release in the MFC in animals given the COMT inhibitor as drug 1 (red) compared to release in those given vehicle as drug 1 (black) 85min after the first injection. Release is normalized to the average pre-drug baseline peak height (equivalent to 100% on the y axis), binned over 15min centered on the time point of interest, and presented as mean ± SEM across animals within each drug group. Timing and duration of stimulation indicated by thick black bar.

D)
As in (C), but comparing release in animals given the DAT blocker as drug 2 (blue) with release in those given vehicle as drug 2 (black) 30min after the second injection. E) Quantification of the peak height of evoked dopamine release in MFC following administration of the COMT inhibitor. Left: peak height at the same time point shown in (C). Each animal's data is shown individually and is normalized to its average pre-drug baseline peak height (equivalent to 100% on the y axis). Box plots show median and 25th and 75th percentiles; whiskers extend from the minimum to maximum value. Right: normalized peak height (mean ± SEM for each drug group) over the 90min following the first injection in animals that received the COMT inhibitor compared with those that received vehicle as drug 1. F) As in (E), but comparing data from animals that received the DAT blocker with data from those that received vehicle as drug 2; right-hand plot shows peak height over the 90min following the second injection. G) As in the left-hand plot of (E), but showing the quantification of the latency from stimulation to peak (left) and of the decay from the peak to T50 (right). presented as mean ± SEM across animals within each group. Right: quantification of the signal decay, with each animal's data shown individually and normalized to its average pre-drug baseline decay constant.  Left: schematic of experiment structure. Middle: evoked dopamine release normalized to the average pre-drug baseline peak height, binned over 15min centered on the time point of interest (60min after drug injection), and presented as mean ± SEM across animals within each group.
Right: quantification of the signal decay, with each animal's data shown individually and normalized to its average pre-drug baseline decay constant.
50 Tables   Table 1: Multi-step decision making task parameter changes over training.