Google AI Weblog: Self-Supervised Reversibility-Conscious Reinforcement Studying


An strategy generally used to coach brokers for a variety of functions from robotics to chip design is reinforcement studying (RL). Whereas RL excels at discovering methods to clear up duties from scratch, it could actually wrestle in coaching an agent to know the reversibility of its actions, which will be essential to make sure that brokers behave in a protected method inside their surroundings. As an example, robots are usually pricey and require upkeep, so one needs to keep away from taking actions that may result in damaged elements. Estimating if an motion is reversible or not (or higher, how simply it may be reversed) requires a working data of the physics of the surroundings wherein the agent is working. Nonetheless, in the usual RL setting, brokers don’t possess a mannequin of the surroundings enough to do that.

In “There Is No Turning Again: A Self-Supervised Strategy to Reversibility-Conscious Reinforcement Studying”, accepted at NeurIPS 2021, we current a novel and sensible method of approximating the reversibility of agent actions within the context of RL. This strategy, which we name Reversibility-Conscious RL, provides a separate reversibility estimation element to the RL process that’s self-supervised (i.e., it learns from unlabeled knowledge collected by the brokers). It may be skilled both on-line (collectively with the RL agent) or offline (from a dataset of interactions). Its position is to information the RL coverage in direction of reversible conduct. This strategy will increase the efficiency of RL brokers on a number of duties, together with the difficult Sokoban puzzle sport.

Reversibility-Conscious RL

The reversibility element added to the RL process is discovered from interactions, and crucially, is a mannequin that may be skilled separate from the agent itself. The mannequin coaching is self-supervised and doesn’t require that the info be labeled with the reversibility of the actions. As a substitute, the mannequin learns about which varieties of actions are usually reversible from the context offered by the coaching knowledge alone.We name the theoretical rationalization for this empirical reversibility, a measure of the chance that an occasion A precedes one other occasion B, understanding that A and B each occur. Priority is a helpful proxy for true reversibility as a result of it may be discovered from a dataset of interactions, even with out rewards.

Think about, for instance, an experiment the place a glass is dropped from desk peak and when it hits the ground it shatters. On this case, the glass goes from place A (desk peak) to place B (ground) and whatever the variety of trials, A at all times precedes B, so when randomly sampling pairs of occasions, the chance of discovering a pair wherein A precedes B is 1. This is able to point out an irreversible sequence. Assume, as a substitute, a rubber ball was dropped as a substitute of the glass. On this case, the ball would begin at A, drop to B, after which (roughly) return to A. So, when sampling pairs of occasions, the chance of discovering a pair wherein A precedes B would solely be 0.5 (the identical because the chance {that a} random pair confirmed B previous A), and would point out a reversible sequence.

Reversibility estimation depends on the data of the dynamics of the world. A proxy to reversibility is priority, which establishes which of two occasions comes first on common,provided that each are noticed.

In follow, we pattern pairs of occasions from a group of interactions, shuffle them, and practice the neural community to reconstruct the precise chronological order of the occasions. The community’s efficiency is measured and refined by evaluating its predictions towards the bottom fact derived from the timestamps of the particular knowledge. Since occasions which are temporally distant are usually both trivial or unattainable to order, we pattern occasions in a temporal window of mounted measurement. We then use the prediction chances of this estimator as a proxy for reversibility: if the neural community’s confidence that occasion A occurs earlier than occasion B is larger than a selected threshold, then we deem that the transition from occasion A to B is irreversible.

Priority estimation consists of predicting the temporal order of randomly shuffled occasions.

Integrating Reversibility into RL

We suggest two concurrent methods of integrating reversibility in RL:

  1. Reversibility-Conscious Exploration (RAE): This strategy penalizes irreversible transitions, through a modified reward operate. When the agent picks an motion that’s thought of irreversible, it receives a reward similar to the surroundings’s reward minus a constructive, mounted penalty, which makes such actions much less possible, however doesn’t exclude them.
  2. Reversibility-Conscious Management (RAC): Right here, all irreversible actions are filtered out, a course of that serves as an intermediate layer between the coverage and the surroundings. When the agent picks an motion that’s thought of irreversible, the motion choice course of is repeated, till a reversible motion is chosen.
The proposed RAE (left) and RAC (proper) strategies for reversibility-aware RL.

An necessary distinction between RAE and RAC is that RAE solely encourages reversible actions, it doesn’t prohibit them, which signifies that irreversible actions can nonetheless be carried out when the advantages outweigh prices (as within the Sokoban instance under). In consequence, RAC is best fitted to protected RL the place irreversible side-effects induce dangers that must be prevented fully, and RAE is best fitted to duties the place it’s suspected that irreversible actions are to be prevented more often than not.

As an example the excellence between RAE and RAC, we consider the capabilities of each proposed strategies. A number of instance situations comply with:

  • Avoiding (however not prohibiting) irreversible side-effects

    A basic rule for protected RL is to attenuate irreversible interactions when potential, as a precept of warning. To check such capabilities, we introduce an artificial surroundings the place an agent in an open area is tasked with reaching a purpose. If the agent follows the established pathway, the surroundings stays unchanged, but when it departs from the pathway and onto the grass, the trail it takes turns to brown. Whereas this adjustments the surroundings, no penalty is issued for such conduct.

    On this situation, a typical model-free agent, similar to a Proximal Coverage Optimization (PPO) agent, tends to comply with the shortest path on common and spoils among the grass, whereas a PPO+RAE agent avoids all irreversible side-effects.

    High-left: The artificial surroundings wherein the agent (blue) is tasked with reaching a purpose (pink). A pathway is proven in gray main from the agent to the purpose, but it surely doesn’t comply with essentially the most direct route between the 2. High-right: An motion sequence with irreversible side-effects of an agent’s actions. When the agent departs from the trail, it leaves a brown path by the sphere. Backside-left: The visitation heatmap for a PPO agent. Brokers are inclined to comply with a extra direct path than that proven in gray. Backside-right: The visitation heatmap for a PPO+RAE agent. The irreversibility of going off-path encourages the agent to remain on the established gray path.
  • Secure interactions by prohibiting irreversibility

    We additionally examined towards the basic Cartpole process, wherein the agent controls a cart with the intention to stability a pole standing precariously upright on prime of it. We set the utmost variety of interactions to 50k steps, as a substitute of the standard 200. On this process, irreversible actions are inclined to trigger the pole to fall, so it’s higher to keep away from such actions in any respect.

    We present that combining RAC with any RL agent (even a random agent) by no means fails, provided that we choose an acceptable threshold for the chance that an motion is irreversible. Thus, RAC can assure protected, reversible interactions from the very first step within the surroundings.

    We present how the Cartpole efficiency of a random coverage geared up with RAC evolves with totally different threshold values (ꞵ). Normal model-free brokers (DQN, M-DQN) sometimes rating lower than 3000, in comparison with 50000 (the utmost rating) for an agent ruled by a random+RAC coverage at a threshold worth of β=0.4.
  • Avoiding deadlocks in Sokoban

    Sokoban is a puzzle sport wherein the participant controls a warehouse keeper and has to push bins onto goal areas, whereas avoiding unrecoverable conditions (e.g., when a field is in a nook or, in some instances, alongside a wall).

    An motion sequence that completes a Sokoban stage. Containers (yellow squares with a crimson “x”) have to be pushed by an agent onto targets (crimson outlines with a dot within the center). As a result of the agent can’t pull the bins, any field pushed towards a wall will be tough, if not unattainable to get away from the wall, i.e., it turns into “deadlocked”.

    For the standard RL mannequin, early iterations of the agent sometimes act in a near-random trend to discover the surroundings, and consequently, get caught fairly often. Such RL brokers both fail to unravel Sokoban puzzles, or are fairly inefficient at it.

    Brokers that discover randomly rapidly have interaction themselves in deadlocks that stop them from finishing ranges (for example right here, pushing the rightmost field on the wall can’t be reversed).

    We in contrast the efficiency within the Sokoban surroundings of IMPALA, a state-of-the-art model-free RL agent, to that of an IMPALA+RAE agent. We discover that the agent with the mixed IMPALA+RAE coverage is deadlocked much less regularly, leading to superior scores.

    The scores of IMPALA and IMPALA+RAE on a set of 1000 Sokoban ranges. A brand new stage is sampled in the beginning of every episode.The perfect rating is stage dependent and near 10.

    On this process, detecting irreversible actions is tough as a result of it’s a extremely imbalanced studying downside — solely ~1% of actions are certainly irreversible, and lots of different actions are tough to flag as reversible, as a result of they’ll solely be reversed by plenty of further steps by the agent.

    Reversing an motion is typically non-trivial. Within the instance proven right here, a field has been pushed towards the wall, however remains to be reversible. Nonetheless, reversing the scenario takes not less than 5 separate actions comprising 17 distinct actions by the agent (every numbered transfer being the results of a number of actions from the agent).

    We estimate that roughly half of all Sokoban ranges require not less than one irreversible motion to be accomplished (e.g., as a result of not less than one goal vacation spot is adjoining to a wall). Since IMPALA+RAE solves nearly all ranges, it implies that RAE doesn’t stop the agent from taking irreversible actions when it’s essential to take action.


We current a way that permits RL brokers to foretell the reversibility of an motion by studying to mannequin the temporal order of randomly sampled trajectory occasions, which ends up in higher exploration and management. Our proposed methodology is self-supervised, that means that it doesn’t necessitate any prior data concerning the reversibility of actions, making it nicely suited to a wide range of environments. Sooner or later, we’re all for learning additional how these concepts may very well be utilized in bigger scale and safety-critical functions.


We wish to thank our paper co-authors Nathan Grinsztajn, Philippe Preux, Olivier Pietquin and Matthieu Geist. We’d additionally wish to thank Bobak Shahriari, Théophane Weber, Damien Vincent, Alexis Jacq, Robert Dadashi, Léonard Hussenot, Nino Vieillard, Lukasz Stafiniak, Nikola Momchev, Sabela Ramos and all those that offered useful dialogue and suggestions on this work.


Leave a Reply

Your email address will not be published. Required fields are marked *