Episode #9 | December 25, 2025 @ 7:00 PM EST

The Teaching Signal: Dopamine's Role in Value Learning

Guest

Dr. Wolfram Schultz (Neuroscientist, University of Cambridge)
Announcer The following program features simulated voices generated for educational and philosophical exploration.
Adam Ramirez Good evening. I'm Adam Ramirez.
Jennifer Brooks And I'm Jennifer Brooks. Welcome to Simulectics Radio.
Adam Ramirez Tonight we're examining dopamine signaling and its relationship to reinforcement learning theory. Dopamine has become synonymous with reward in both neuroscience and popular culture, but the actual story is more nuanced. In computational reinforcement learning, an agent learns by computing prediction errors—the difference between expected and received reward. When the reward is better than expected, the error is positive and the agent strengthens the actions that led to it. When reward is worse than expected, the error is negative and those actions are weakened. The claim is that dopamine neurons encode these prediction errors. They fire when reward exceeds expectation, pause when reward falls short, and show no change when reward matches prediction. This would make dopamine the brain's teaching signal for learning about value.
Jennifer Brooks The elegance of this correspondence is striking, but it raises questions about mechanism and interpretation. Dopamine neurons are few in number—around twenty thousand in the mouse, perhaps half a million in the human. Yet they project broadly throughout the striatum and cortex, releasing dopamine over large volumes of tissue. How does such a diffuse signal carry specific information about which actions to reinforce? And is dopamine really encoding reward prediction error in the formal sense, or are we imposing a computational interpretation on a biological signal that may have multiple functions?
Adam Ramirez To explore these questions, we're joined by Dr. Wolfram Schultz, a neuroscientist at the University of Cambridge whose work discovered the reward prediction error signal in dopamine neurons. His recordings from behaving primates revealed the temporal difference learning algorithm operating in the brain. Dr. Schultz, welcome.
Dr. Wolfram Schultz Thank you. It's a pleasure to discuss these issues.
Jennifer Brooks Let's start with the foundational observations. What did you see in dopamine neuron activity that suggested they were encoding prediction errors?
Dr. Wolfram Schultz We recorded from dopamine neurons in the substantia nigra and ventral tegmental area while monkeys performed tasks involving predicted rewards. Initially, when a reward was unexpected, dopamine neurons showed a phasic burst of activity at the time of reward delivery. But after the animal learned to predict the reward through a conditioned stimulus—a light or tone that reliably preceded reward—the dopamine response shifted. The neurons stopped responding to the reward itself and instead responded to the predictive stimulus. If the predicted reward was then omitted, the neurons showed a pause in firing at the expected time of reward. This pattern—activation for better-than-expected outcomes, depression for worse-than-expected outcomes, no response for fully predicted outcomes—precisely matches the reward prediction error in temporal difference learning algorithms.
Adam Ramirez Temporal difference learning was developed in machine learning as an efficient algorithm for estimating value functions. The key insight is that you can learn from the difference between successive predictions, rather than waiting for the final outcome. If your prediction of future reward increases from one time step to the next, that increase serves as a teaching signal. Seeing this algorithm implemented in dopamine neurons is remarkable. But how literal is the correspondence? Are dopamine neurons actually computing TD errors through the mathematical operations specified by the algorithm?
Dr. Wolfram Schultz The correspondence is at the algorithmic level, not necessarily the implementational level. Dopamine neurons receive inputs that carry information about predicted value and actual outcomes. The circuitry that computes the difference—the prediction error—likely involves interactions between these inputs, mediated by synaptic weights, recurrent connections, and inhibitory interneurons. We don't yet have a complete mechanistic account of how the subtraction is performed. But the functional signal that dopamine neurons broadcast matches what the TD algorithm requires, and downstream circuits that receive dopamine can use this signal to adjust synaptic weights according to reinforcement learning rules.
Jennifer Brooks What's the evidence that dopamine actually drives learning, rather than just correlating with prediction errors?
Dr. Wolfram Schultz There's substantial causal evidence. Optogenetic experiments in rodents show that artificially activating dopamine neurons can drive learning even in the absence of external reward. If you pair a stimulus with dopamine activation, animals learn to prefer that stimulus, as if it had been associated with reward. Conversely, blocking dopamine signaling impairs learning of reward associations. Pharmacological and genetic manipulations that disrupt dopamine transmission prevent animals from forming new associations or adjusting behavior based on changing reward contingencies. So dopamine isn't just correlated with learning—it's necessary and sufficient for certain forms of reinforcement learning.
Adam Ramirez How does a global dopamine signal, broadcast diffusely across the striatum, specify which particular synapses to strengthen or weaken? In supervised learning, you have error signals specific to each output unit. But dopamine seems to be a scalar signal—a single number indicating how good or bad the current state is.
Dr. Wolfram Schultz This is a key challenge. The solution involves eligibility traces—the idea that synapses that were recently active are 'eligible' for modification when the dopamine signal arrives. Dopamine gates plasticity, but it doesn't specify which synapses change. That's determined by the pattern of recent activity. Synapses that were activated just before the dopamine signal are strengthened if dopamine is elevated, weakened if dopamine is depressed. This combines Hebbian plasticity—'neurons that fire together wire together'—with reinforcement—the dopamine signal indicates whether the recent activity was valuable. It's a three-factor learning rule: presynaptic activity, postsynaptic activity, and dopamine.
Jennifer Brooks How much heterogeneity is there in dopamine neuron populations? Are all dopamine neurons encoding the same signal, or do different subpopulations encode different aspects of value or prediction error?
Dr. Wolfram Schultz There is heterogeneity. While many dopamine neurons show canonical reward prediction error responses, there are also neurons that respond preferentially to motivational salience—anything that's behaviorally important, whether rewarding or aversive. Some neurons show stronger responses to certain reward types—food versus liquid versus social rewards. And there's anatomical heterogeneity—neurons in different regions of the midbrain project to different striatal targets and may carry somewhat different signals. This diversity suggests that dopamine isn't a single monolithic teaching signal but a collection of related signals tailored to different learning contexts and behavioral demands.
Adam Ramirez In reinforcement learning theory, there's a distinction between model-free and model-based learning. Model-free learning uses prediction errors to cache value estimates without explicitly representing the environment's dynamics. Model-based learning builds an internal model of the world and uses it for planning. Does dopamine support both, or primarily model-free learning?
Dr. Wolfram Schultz Dopamine is most clearly implicated in model-free learning—the direct strengthening of stimulus-response associations based on reward outcomes. But there's evidence that dopamine also influences model-based systems. Dopamine projections to prefrontal cortex may modulate working memory and planning processes. And the distinction between model-free and model-based isn't perfectly clean in the brain. Real behavior likely involves coordination between multiple systems, some more habitual and cached, others more deliberative and model-based. Dopamine may play different roles in these different systems.
Jennifer Brooks How does dopamine signaling change across time as learning progresses? Does the prediction error signal eventually disappear once the task is fully learned?
Dr. Wolfram Schultz In well-learned, stable tasks, the dopamine response to predictable rewards does diminish, consistent with the prediction error account. But learning is rarely complete in natural environments. There's always some residual uncertainty, occasional unexpected events, changes in reward magnitude or timing. Dopamine continues to track these variations, signaling when outcomes deviate from expectation. Also, even after learning, dopamine may play a role in maintaining motivation and vigor. Its function isn't solely instructive—it also has modulatory effects on ongoing neural activity and behavioral activation.
Adam Ramirez One criticism of the reward prediction error hypothesis is that dopamine responses sometimes occur to non-rewarding stimuli, or show properties that don't fit the pure TD error framework. How do you reconcile these observations with the computational model?
Dr. Wolfram Schultz The reward prediction error account is a framework, not an absolute law. It captures a core computational principle—learning through comparison of predictions and outcomes—but biological dopamine signaling is richer than any single algorithm. Dopamine neurons respond to novelty, uncertainty, and salience in ways that may extend beyond simple scalar reward. Some of these responses might be explained by extensions to the basic model—for instance, incorporating uncertainty into the prediction error computation. Others may reflect additional functions that dopamine serves beyond reinforcement learning. The key is to distinguish between the core computational role and auxiliary functions, and to refine the model to account for discrepancies without abandoning the fundamental insight.
Jennifer Brooks What about negative prediction errors? You mentioned that dopamine neurons pause when expected rewards are omitted. But pauses are just reductions in tonic firing. How does a decrease in dopamine signal the need to weaken associations, and is this as effective as the positive signal from bursts?
Dr. Wolfram Schultz This is an important asymmetry. Dopamine neurons have a baseline firing rate—around three to five spikes per second—and can signal negative prediction errors by pausing below this baseline. But the pause is limited by zero firing rate, whereas bursts can go much higher. This creates an asymmetry in the teaching signal. Some experiments suggest that negative prediction errors may be signaled by other systems in parallel—for instance, habenula neurons that inhibit dopamine neurons, or local GABAergic interneurons in the striatum. The full story of how positive and negative prediction errors are encoded likely involves multiple interacting signals, with dopamine providing one component.
Adam Ramirez How do dopamine systems handle delayed rewards? In many real-world tasks, the reward comes seconds or minutes after the action. TD learning handles this through temporal credit assignment—propagating value estimates backward through time. Does the brain implement something analogous?
Dr. Wolfram Schultz Temporal credit assignment is solved in part by shifting dopamine responses to earlier predictive stimuli, as I described earlier. The dopamine signal moves from the reward to the cue that predicts it, and with further training it can move to even earlier cues. This allows the learning signal to bridge temporal gaps. Additionally, sustained neural activity or sequential patterns of activity in cortex and hippocampus may maintain representations of past actions until reward arrives, providing the substrate for connecting actions to delayed outcomes. Eligibility traces at synapses extend the window during which dopamine can modify connections. It's a multifaceted solution involving activity dynamics, synaptic mechanisms, and network architecture.
Jennifer Brooks What are the implications for understanding disorders like addiction, Parkinson's disease, and depression, which involve dopamine dysfunction?
Dr. Wolfram Schultz Understanding dopamine as a teaching signal helps explain these disorders. In addiction, drugs hijack the dopamine system, producing prediction errors that drive compulsive drug-seeking even when the drug no longer provides genuine reward. The system learns that drug cues predict reward, and these learned associations are very resistant to extinction. In Parkinson's disease, dopamine neuron degeneration impairs both motor control and reinforcement learning. Patients have difficulty learning from feedback and updating behavior based on changing contingencies. In depression, there may be blunted dopamine responses to reward, leading to anhedonia—the inability to experience pleasure—and impaired motivation. These disorders involve disruptions to the core computational function of dopamine, with cascading effects on behavior.
Adam Ramirez How well do artificial reinforcement learning systems that use TD learning match biological performance on comparable tasks?
Dr. Wolfram Schultz In simple tasks, TD learning algorithms perform very well and can match or exceed biological learning speeds. Deep reinforcement learning systems using TD errors have achieved superhuman performance in games and certain control tasks. But these systems often require far more training data than animals, and they struggle with generalization, transfer, and learning from sparse rewards. Animals are much better at extracting structure from environments, using prior knowledge, and adapting quickly to new situations. So while the core algorithm is similar, biological systems have additional mechanisms—model-based reasoning, hierarchical representations, curiosity-driven exploration—that enhance learning efficiency beyond pure model-free TD learning.
Jennifer Brooks Are there other neuromodulators that might encode prediction errors for other dimensions of value, like aversive prediction errors or uncertainty?
Dr. Wolfram Schultz Yes. Serotonin has been proposed to signal aversive prediction errors or punishment. Acetylcholine may signal unexpected uncertainty or changes in the environment that warrant attention. Norepinephrine might encode arousal or the importance of current events. These systems likely work in parallel with dopamine, providing complementary teaching signals for different aspects of learning and decision-making. The brain doesn't rely on a single neuromodulator to solve all learning problems—it uses a diverse toolkit.
Adam Ramirez What are the key open questions? Where does the field need more evidence?
Dr. Wolfram Schultz We need better understanding of the circuit mechanisms that compute prediction errors—how inputs encoding predictions and outcomes are combined to generate the dopamine signal. We need to characterize the heterogeneity of dopamine neurons more thoroughly—are there functionally distinct subtypes, and how do they coordinate? We need to understand how dopamine interacts with other learning systems—model-based planning, working memory, episodic memory. And we need to connect dopamine signaling to detailed synaptic plasticity mechanisms in target structures. Finally, there's the question of how abstract the dopamine signal really is. Does it encode a unified currency of reward, or do different rewards activate distinct circuits?
Jennifer Brooks Dr. Schultz, thank you for clarifying what dopamine actually encodes and what remains uncertain.
Dr. Wolfram Schultz Thank you. These are fundamental questions about how we learn and adapt.
Adam Ramirez That's our program. Until tomorrow, stay critical.
Jennifer Brooks And keep questioning. Good night.
Sponsor Message

Prediction Error Derivatives

When expectations meet reality, value shifts. Prediction Error Derivatives let you trade the difference between anticipated and actual outcomes. Our instruments include temporal difference futures, reward forecast hedging, and dopamine signal arbitrage. Speculate on learning rate convergence, hedge against unexpected omissions, and exploit asymmetries between positive bursts and negative pauses. Real-time pricing based on stimulus-reward contingency tracking and behavioral adaptation metrics. From conditioned stimuli to cached values. Prediction Error Derivatives—because learning is the difference between what you expected and what you got.

Because learning is the difference between what you expected and what you got