In the latest machine learning research, a group at CMU publishes a simple and efficient implementation of reinforcement learning without recurrent model (RL) for future work to use as a reference for POMDP algorithms

Most real-world situations involve noise and incomplete information, unlike decision-making algorithms, which often focus on simple problems where most of the information is already available. To solve these problems, complex algorithms have been developed. However, a simple strategy also exists in theory that can be used in basic and complex situations. The canonical problem formulation for reinforcement learning (RL) as a Markov decision process often leaves out various uncertainties, can be complex, noisy, and include real-world decision-making tasks. In contrast, the uncertainty of states, incentives and dynamics can be captured by partially observable MDPs (POMDPs). These uncertainties appear in fields such as robotics, medicine, natural language processing and finance. Researchers from CMU and the University of Montreal have created a methodology that demonstrates the practical application of this simple strategy.

POMDPs can be used to model a variety of RL problems, including meta-RL, resilient RL, generalization in RL, and temporal crediting. It is theoretically possible to solve all types of POMDPs simply by complementing model-free RLs with memory-based architectures, such as recurrent neural networks (RNNs). Previous research has also confirmed that these recurrent model-free RL approaches frequently outperform more specialized algorithms created for particular types of POMDPs. The team’s research included examining this claim. Due to the agent’s requirement to simultaneously learn inference and control tasks, solving POMDPs is a challenge. The purpose of inference is to infer the posterior about current states given the past. The control aims to run RL and planning algorithms on the inferred state space. Deep learning gives us a generic and simple baseline: mix an off-the-shelf RL algorithm with a recurrent neural network and a GRU. Previous methods often decouple the two jobs using separate models.

RNNs can analyze histories in POMDPs and learn implicit inference on the control state space through backpropagation. Initially, the RL without a recurrent model seems to have several advantages. They are simple to implement and theoretically simple. RNNs have also been shown to be universal function approximators, allowing them to describe all memory-based policies. There is a ton of existing literature on various RL algorithms and RNN architectures of RL without a recursive pattern due to its expressiveness and simplicity. Previous research has demonstrated that it frequently fails in practice with lower or unstable performance. The poor performance of this fundamental baseline has been the driving force behind earlier research suggesting more advanced techniques. Others incorporate the presumptions employed in the POMDP subfield as an inductive bias, while others introduce model-based goals that explicitly teach inference. Although model-based methods can have difficulties with deprecation, and specialized methods require more assumptions than RL without a recurrent model, both produce good results in various applications.

The team found that the recurring Model-Free RL only requires a different implementation and is not fatally flawed. Two important changes separate RNNs into the Actor and Reviewer networks and adjust the duration of the RNN context. It has also been noted that using an out-of-policy RL algorithm can increase sampling efficiency. With the implementation of these changes, the team’s recurrent pattern-free RL is competitive with prior techniques, especially for the tasks for which those prior techniques were intended. Recursive model-free RL applies to all types of POMDPs, unlike earlier techniques primarily developed to deal with specific POMDP scenarios. The team’s technique can frequently outperform existing methods in meta-RL challenges compared to prototypical methods and out-of-policy variBAD. The present model-free recurrent technique outperforms all other approaches in the robust RL framework, which is often handled by algorithms that explicitly maximize worst-case returns.

According to the researchers, the results could be understood as supporting the rationale for deep learning: to achieve superior results by condensing a method into a single differentiable architecture optimized end-to-end for a single loss. This is intriguing since RL systems are sometimes made up of interconnected components formed for various purposes. If end-to-end techniques have sufficiently expressive structures, they could displace them in the near future. To promote reproducibility and help improve POMDP algorithms in the future, the team has also made its code publicly available. Moreover, in 2022, they presented their results at the ICML conference.

This Article is written as a research summary article by Marktechpost Staff based on the research paper 'Recurrent Model-Free RL Can Be a Strong Baseline for Many POMDPs'. All Credit For This Research Goes To Researchers on This Project. Check out the paper, github link, project and reference article.

Please Don't Forget To Join Our ML Subreddit

Khushboo Gupta is an intern consultant at MarktechPost. She is currently pursuing her B.Tech from Indian Institute of Technology (IIT), Goa. She is passionate about the fields of machine learning, natural language processing and web development. She likes to learn more about the technical field by participating in several challenges.

Sherry J. Basler