Indeed, model-based learners do not rely on model-based RPEs: the

Indeed, model-based learners do not rely on model-based RPEs: the learning problem they face—tracking state transition probabilities and immediate rewards rather than cumulative future rewards—demands different training signals (Gläscher et al., 2010). This apparent mismatch encourages consideration selleck screening library of a hybrid of a different sort. We have so far examined theories in which model-based and model-free predictions compete directly to select actions (Daw et al., 2005). However, model-based and model-free RPEs could also usefully be integrated for training. For instance, consider the standard

actor-critic account (Barto et al., 1983 and Barto, 1995). This uses RPEs derived from model-free predictions (the critic) to reinforce action selection policies (the actor). Errors in model-based predictions, if available, could serve the same purpose. A model-free actor trained, in part, by such a model-based critic would, in effect, cache (Daw et al.,

2005) or memorize the recommendations of a model-based planner, and could execute them subsequently without additional planning. The computational literature on RL includes some related ideas in algorithms, such as prioritized http://www.selleckchem.com/products/tariquidar.html sweeping (Moore and Atkeson, 1993), which caches the results of model-based evaluation (albeit without a model-free component), and Dyna (Johnson and Redish, 2005 and Sutton, 1990), which trains a model-free algorithm (though offline) using simulated experiences generated from a world model. In neuroscience, these various theories have been proposed in which a world model impacts the input to the model-free system (Bertin et al., 2007, Daw et al., 2006a, Doya, 1999 and Doya et al., 2002). The architecture suggested here more closely resembles the “biased” learning hypothesized by Doll et al. (2009), according to which top-down information (there provided by experimenter instructions rather than a learned world model) modifies the target of model-free RL. Outside the domain of learning, striatal BOLD responses are indeed affected by values communicated by instruction rather than experience (Fitzgerald

et al., 2010 and Tom et al., 2007) and also by emotional self-regulation (Delgado et al., 2008). Further theoretical work is needed to characterize the different algorithms suggested by this general architecture. However, in general, by preserving the overall structure of parallel model-based and model-free systems—albeit systems that would exchange information at an earlier level—the proposal of a model-based critic would appear to remain consistent with the lesion data suggesting that the systems can function in isolation (Killcross and Coutureau, 2003, Yin et al., 2004 and Yin et al., 2005), and with behavioral data demonstrating that distinct decision systems may have different properties and can be differentially engaged in different circumstances (Doeller and Burgess, 2008, Frank et al., 2007 and Fu and Anderson, 2008).

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>