Permutation-Invariant Neural Networks for Reinforcement Studying


“The mind is ready to use info coming from the pores and skin as if it have been coming from the eyes. We don’t see with the eyes or hear with the ears, these are simply the receptors, seeing and listening to actually goes on within the mind.”
Paul Bach-y-Rita, quoted in Livewired

Folks have the superb potential to make use of one sensory modality (e.g., contact) to provide environmental info usually gathered by one other sense (e.g., imaginative and prescient). This adaptive potential, known as sensory substitution, is a phenomenon well-known to neuroscience. Whereas tough variations — similar to adjusting to seeing issues upside-down, studying to experience a “backwards” bicycle, or studying to “see” by deciphering visible info emitted from a grid of electrodes positioned on one’s tongue — require anyplace from weeks, months and even years to realize mastery, individuals are capable of finally modify to sensory substitutions.

In distinction, most neural networks usually are not capable of adapt to sensory substitutions in any respect. For example, most reinforcement studying (RL) brokers require their inputs to be in a pre-specified format, or else they are going to fail. They count on fixed-size inputs and assume that every aspect of the enter carries a exact that means, such because the pixel depth at a specified location, or state info, like place or velocity. In standard RL benchmark duties (e.g., Ant or Cart-pole), an agent skilled utilizing present RL algorithms will fail if its sensory inputs are modified or if the agent is fed further noisy inputs which might be unrelated to the duty at hand.

In “The Sensory Neuron as a Transformer: Permutation-Invariant Neural Networks for Reinforcement Studying”, a highlight paper at NeurIPS 2021, we discover permutation invariant neural community brokers, which require every of their sensory neurons (receptors that obtain sensory inputs from the setting) to determine the that means and context of its enter sign, reasonably than explicitly assuming a set that means. Our experiments present that such brokers are strong to observations that comprise further redundant or noisy info, and to observations which might be corrupt and incomplete.

Permutation invariant reinforcement studying brokers adapting to sensory substitutions. Left: The ordering of the ant’s 28 observations are randomly shuffled each 200 time-steps. Not like the usual coverage, our coverage just isn’t affected by the immediately permuted inputs. Proper: Cart-pole agent given many redundant noisy inputs (Interactive web-demo).

Along with adapting to sensory substitutions in state-observation environments (just like the ant and cart-pole examples), we present that these brokers can even adapt to sensory substitutions in advanced visual-observation environments (similar to a CarRacing sport that makes use of solely pixel observations) and might carry out when the stream of enter photos is consistently being reshuffled:

We partition the visible enter from CarRacing right into a 2D grid of small patches, and shuffled their ordering. With none further coaching, our agent nonetheless performs even when the unique coaching background (left) is changed with new photos (proper).


Our strategy takes observations from the setting at every time-step and feeds every aspect of the statement into distinct, however equivalent neural networks (known as “sensory neurons”), every with no mounted relationship with each other. Every sensory neuron integrates over time info from solely their specific sensory enter channel. As a result of every sensory neuron receives solely a small a part of the total image, they should self-organize by communication to ensure that a worldwide coherent conduct to emerge.

Illustration of statement segmentation.We phase every enter into parts, that are then fed to unbiased sensory neurons. For non-vision duties the place the inputs are normally 1D vectors, every aspect is a scalar. For imaginative and prescient duties, we crop every enter picture into non-overlapping patches.

We encourage neurons to speak with one another by coaching them to broadcast messages. Whereas receiving info domestically, every particular person sensory neuron additionally regularly broadcasts an output message at every time-step. These messages are consolidated and mixed into an output vector, known as the world latent code, utilizing an consideration mechanism just like that utilized within the Transformer structure. A coverage community then makes use of the worldwide latent code to provide the motion that the agent will use to work together with the setting. This motion can also be fed again into every sensory neuron within the subsequent time-step, closing the communication loop.

Overview of the permutation-invariant RL technique. We first feed every particular person statement (ot) into a specific sensory neuron (together with the agent’s earlier motion, at-1). Every neuron then produces and broadcasts a message independently, and an consideration mechanism summarizes them into a worldwide latent code (mt) that’s given to the agent’s downstream coverage community (?) to provide the agent’s motion at.

Why is this method permutation invariant? Every sensory neuron is the same neural community that isn’t confined to solely course of info from one specific sensory enter. In truth, in our setup, the inputs to every sensory neuron usually are not outlined. As an alternative, every neuron should determine the that means of its enter sign by being attentive to the inputs acquired by the opposite sensory neurons, reasonably than explicitly assuming a set that means. This encourages the agent to course of the whole enter as an unordered set, making the system to be permutation invariant to its enter. Moreover, in precept, the agent can use as many sensory neurons as required, thus enabling it to course of observations of arbitrary size. Each of those properties will assist the agent adapt to sensory substitutions.


We show the robustness and adaptability of this strategy in less complicated, state-observation environments, the place the observations the agent receives as inputs are low-dimensional vectors holding details about the agent’s states, such because the place or velocity of its parts. The agent within the standard Ant locomotion activity has a complete of 28 inputs with info that features positions and velocities. We shuffle the order of the enter vector a number of instances throughout a trial and present that the agent is quickly capable of adapt and continues to be capable of stroll ahead.

In cart-pole, the agent’s objective is to swing up a cart-pole mounted on the heart of the cart and steadiness it upright. Usually the agent sees solely 5 inputs, however we modify the cartpole setting to supply 15 shuffled enter alerts, 10 of that are pure noise, and the rest of that are the precise observations from the setting. The agent continues to be capable of carry out the duty, demonstrating the system’s capability to work with a lot of inputs and attend solely to channels it deems helpful. Such flexibility could discover helpful functions for processing a big unspecified variety of alerts, most of that are noise, from ill-defined techniques.

We additionally apply this strategy to high-dimensional vision-based environments the place the statement is a stream of pixel photos. Right here, we examine screen-shuffled variations of vision-based RL environments, the place every statement body is split right into a grid of patches, and like a puzzle, the agent should course of the patches in a shuffled order to find out a plan of action to take. To show our strategy on vision-based duties, we created a shuffled model of Atari Pong.

Shuffled Pong outcomes. Left: Pong agent skilled to play utilizing solely 30% of the patches matches efficiency of Atari opponent. Proper: With out additional coaching, once we give the agent extra puzzle items, its efficiency will increase.

Right here the agent’s enter is a variable-length listing of patches, so not like typical RL brokers, the agent solely will get to “see” a subset of patches from the display screen. Within the puzzle pong experiment, we cross to the agent a random pattern of patches throughout the display screen, that are then mounted by the rest of the sport. We discover that we are able to discard 70% of the patches (at these fixed-random areas) and nonetheless prepare the agent to carry out effectively in opposition to the built-in Atari opponent. Curiously, if we then reveal further info to the agent (e.g., permitting it entry to extra picture patches), its efficiency will increase, even with out further coaching. When the agent receives all of the patches, in shuffled order, it wins 100% of the time, attaining the identical end result with brokers which might be skilled whereas seeing the whole display screen.

We discover that imposing further issue throughout coaching by utilizing unordered observations has further advantages, similar to bettering generalization to unseen variations of the duty, like when the background of the CarRacing coaching setting is changed with a novel picture.

Shuffled CarRacing outcomes. The agent has realized to focus its consideration (indicated by the highlighted patches) on the street boundaries. Left: Coaching setting. Proper: Take a look at setting with new background.


The permutation invariant neural community brokers offered right here can deal with ill-defined, various statement areas. Our brokers are strong to observations that comprise redundant or noisy info, or observations which might be corrupt and incomplete. We imagine that permutation invariant techniques open up quite a few potentialities in reinforcement studying.

In case you’re to be taught extra about this work, we invite readers to learn our interactive article (pdf model) or watch our video. We additionally launched code to breed our experiments.


Leave a Reply

Your email address will not be published. Required fields are marked *