Our brains make decisions based on offer above and beyond other possible propositions


UNIGE researchers demonstrate that our brains do not make decisions based on their inherent value but for what they offer above and beyond other possible propositions.

Our brains are constantly faced with different choices:

Should I have a chocolate éclair or macaroon?

Should I take the bus or go by car?

What should I wear: a woollen sweater or one made of cashmere?

When the difference in quality between two choices is great, the choice is made very quickly.

But when this difference is negligible, we can get stuck for minutes at a time – or even longer – before we’re capable of making a decision.

Why is it so difficult to make up our mind when faced with two or more choices?

Is it because our brains are not optimised for taking decisions?

In an attempt to answer these questions, neuroscientists from the University of Geneva (UNIGE), Switzerland – in partnership with Harvard Medical School – developed a mathematical model of the optimal choice strategy.

They demonstrated that optimal decisions must be based not on the true value of the possible choices but on the difference in value between them.

The results, which you can read all about in the journal Nature Neuroscience, show that this decision-making strategy maximises the amount of reward received.

There are two types of decision-making:

First, there is perceptual decision-making, which is based on sensory information: Do I have time to cross the road before that car comes nearer?

.Then there is value-based decision-making, when there is no good or bad decision as such but a choice needs to be made between several proposals: Do I want to eat apples or apricots?

When taking value-based decisions, choices are made very quickly if there is a large difference in value between the different proposals.

But when the propositions are similar, decision-making becomes very complex even though, in reality, none of the choices is worse than any other. Why is this?

The value of a choice lies in the difference

Satohiro Tajima, a researcher in the Department of Basic Neurosciences in UNIGE’s Faculty of Medicine, designed a simple mathematical model that demonstrates the following: The optimal strategy when faced with two propositions is to sum up the values associated with the memories you have of each choice, then calculate the difference between these two sums (do I have more positive memories linked to chocolate eclairs or macaroons?).

The decision is made when this difference reaches a threshold value, fixed in advance, which determines the time taken in making the decision.

This model leads to rapid decision-making when the values of the two possibilities are very far apart.

But when two choices have almost the same value, we need more time, because we need to draw on more memories so that this difference reaches the decision threshold.

Is the same process at work when we have to choose between three or more possibilities?

The average of the values for each choice decides the winner

For each choice, we want to maximise the possible gain in the minimum amount of time. So, how do we proceed?

“The first step is exactly the same as when making a binary choice:

We amass the memories for each choice so we can estimate their combined value,” explains Alexandre Pouget, a professor in the Department of Basic Neurosciences at UNIGE.

Then, using a mathematical model based on the theory of optimal stochastic control, instead of looking at the cumulative value associated with each choice independently, the decision rests on the difference between the cumulative value of each choice and the average value of the accumulated values over all the choices.

As in the earlier case, the decision is made when one of these differences reaches a pre-determined threshold value.

“The fact that the decision is based on the cumulative value minus the average of the values of all the possibilities explains why the choices interfere with each other, even when some differences are glaring,” continues professor Pouget.

If the different possible choices have similar values, the average will be almost identical to the value of each choice, resulting in a very lengthy decision-making time.

“Making a simple choice can take 300 milliseconds but a complicated choice sometimes lasts a lifetime,” notes the Geneva-based researcher.

The UNIGE study shows that the brain does not make decisions according to the value of each opportunity but based on the difference between them.

“This highlights the importance of the feeling of having to maximise the possible gains that can be obtained,” says professor Pouget.

The neuroscientists will now focus on how the brain revisits memory to call on the memories associated with every possible choice, and how it simulates information when faced with the unknown and when it cannot make a decision based on memories.

Everyday life features uncertain and changing situations associated with distinct reward contingencies.

In such environments, optimal adaptive behaviour requires detecting changes in external situations, which relies on making probabilistic inferences about the latent causes or hidden states generating the external contingencies of the agent experiences.

Previous studies show that humans make such inferences, i.e., they develop state beliefs to guide their behaviour in uncertain and changing environments14.

More specifically, the prefrontal cortex (PFC) that subserves reward-based decision-making is involved in inferring state beliefs about how reward contingencies map onto choice options58.

Optimal decision-making for driving behaviour then requires integrating these state beliefs and reward expectations through probabilistic marginalisation processes9.

This integration is required to derive reward probabilities associated with choice options and to choose the option maximising the expected utility10.

Consistently, PFC regions involved in inferring these state beliefs also exhibit activations associated with reward expectations1116.

However, human choices often differ from optimal choices systematically17, raising the open issue of how the PFC combines these state beliefs and reward expectations to drive behaviour.

A common hypothesis is that these quantities are integrated as posited in the expected utility theory, but choice computations derive from distorted representations of reward probabilities, usually named subjective probabilities1721. Yet the origin of subjective probability remains unclear.

As marginalisation processes are complex cross-product processes9, the notion of subjective probability might then reflect that state beliefs and reward expectations are actually combined in a suboptimal way at variance with the expected utility theory22.

Thus, an alternative plausible hypothesis is that state beliefs about reward contingencies are processed as an additional value component that contributes to choices independently of reward expectations rather than through marginalisation processes, i.e., state beliefs about reward contingencies act in decision-making as affective values that combine linearly with the appetitive value of reward expectations.

Here, we address this open issue using computational modelling and functional magnetic resonance imaging (fMRI). We confirm here that participants make decisions as if they marginalise reward expectations over state beliefs and compute choices based on distorted subjective probabilities.

Using a model falsification approach23, however, we show that participants’ performance varies with these subjective probabilities in a way contradicting this theoretical construct. We then provide evidence that participants’ choices actually derive from the independent contribution of state beliefs regarding the most frequently rewarded option and reward expectations based on an efficient coding mechanism of context-dependent value normalisation2426.

We identify the PFC regions involved in this decision process combining linearly these state beliefs and reward expectations, which at variance with the standard expected utility theory, results in (1) the mutual dependence of option utilities and (2) the processing of state beliefs as affective values rather than probability measures in decision-making.


Behavioural protocol

Twenty-two participants were asked to make successive choices between two visually presented one-armed bandits (square vs. diamond bandit, Fig. 1a) (Methods).

In every trial, each bandit proposed a potential monetary reward varying pseudo-randomly from 2 to 10 €. One bandit led to rewards more frequently (frequencies: qM = 80% vs. qm = 20%).

Following participants’ choices, the chosen-bandit outcome was visually revealed, with zero indicating no rewards. Otherwise, participants received the proposed reward approximately (±1 €).

Reward frequencies episodically reversed between the two bandits (unpredictably every 16–28 trials), so that bandits’ reward frequencies remained uncertain to participants. This uncertainty induces the formation of probabilistic state beliefs about the identity of the 80% and 20% rewarded bandit.

In the neutral condition, proposed rewards were independent of bandits, so that beliefs could be inferred only from previous choice outcomes (Fig. 1b).

To properly dissociate belief probabilistic inferences from reinforcement learning processes (RL), the protocol included two additional conditions (administered in separate days): in the congruent condition, proposed rewards were biased towards higher values for the more frequently rewarded bandit (and vice versa), whereas in the incongruent condition, proposed rewards were biased in the exact opposite direction. In both these conditions, thus, proposed rewards identically convey some additional information about bandits’ reward frequencies dissociable from reward values: beliefs could be inferred in every trial from both previous choice outcomes and proposed rewards.

Thus, the protocol properly dissociated belief inferences from RL processes over trials. Note also that due to the reversal/symmetrical structure of the protocol, the task required no exploration for maximising rewards: participants got the same information about bandits’ reward frequencies, whatever the option they choose in every trial.

An external file that holds a picture, illustration, etc.
Object name is 41467_2018_8121_Fig1_HTML.jpg
Fig. 1
Behavioural protocol. a trial structure. A square and diamond one-armed bandit with the offered rewards (euros) were presented on the screen (maximal duration: 2.5 s) until participants chose one bandit by pressing one response button. The chosen bandit remained on display. A feedback centred on the screen was then presented to reveal the bandit outcome (duration 1 s). One bandit led to proposed rewards (±1 €) more frequently (blue arrows 80% vs. 20%) but this advantage reversed episodically. The next trial started with the presentation of both bandits again. Response-feedback onset asynchronies and intertrial intervals were uniformly and independently jittered (ranges: 0.1–4.1 s and 0.5–4.5 s, resp.). b Proposed rewards were biased in opposite directions in the congruent and incongruent condition (exponential biases: slope=±0.13). c Proportions of choosing true best bandits (maximising reward frequencies x proposed rewards) following reversals. Mean proportions over participants (blue, ±s.e.m, N = 22) and for optimal model OPT (black line) are shown in the congruent, neutral and incongruent condition. Dashed line corresponds to the 80% reward frequency. **p < 0.01 (T-tests)

In every trial, the optimal performance model (named model OPT) forms probabilistic beliefs from previous trials about how reward frequencies map onto bandits, updates these beliefs according to proposed rewards and finally, chooses the bandit maximising the (objective) expected utility by marginalising reward expectations over beliefs27 (Methods).

After reversals, model OPT gradually acquires almost perfect beliefs and regardless of conditions, starts selecting the true best bandit (i.e., maximising reward frequencies x proposed rewards) almost systematically (Fig. 1c).

This optimal performance is reached similarly in the congruent and incongruent conditions, but is slower in the neutral condition.

As expected, participants performed suboptimally: after reversals, their performance gradually reached a plateau, selecting the true best bandits with a maximal frequency close to ~80% (corresponding to probability matching) which in contrast to optimal performance, further decreased monotonically from the congruent to neutral and incongruent condition (mean over trials from trial#10, paired T-tests: both Ts(21) > 2.97, ps < 0.01) (Fig. 1c).

More information: Satohiro Tajima et al. Optimal policy for multi-alternative decisions, Nature Neuroscience (2019). DOI: 10.1038/s41593-019-0453-9

Journal information: Nature Neuroscience
Provided by University of Geneva


Please enter your comment!
Please enter your name here

Questo sito usa Akismet per ridurre lo spam. Scopri come i tuoi dati vengono elaborati.