University of Worcester Worcester Research and Publications
 
  USER PANEL:
  ABOUT THE COLLECTION:
  CONTACT DETAILS:

Reward Shaping Based on Optimal-Policy-Free

Sang, J. ORCID logoORCID: https://orcid.org/0000-0001-7537-6844, Wang, Y. ORCID logoORCID: https://orcid.org/0000-0003-2219-067X, Ahmed Khan, Zaki ORCID logoORCID: https://orcid.org/0000-0002-8776-2109 and Zhou, X. ORCID logoORCID: https://orcid.org/0009-0007-5665-845X (2024) Reward Shaping Based on Optimal-Policy-Free. IEEE Transactions on Big Data. pp. 1-12. ISSN 2372-2096; eISSN 2332-7790

Full text not available from this repository. (Request a copy)

Abstract

Existing research on potential-based reward shaping (PBRS) relies on optimal policy in Markov decision process (MDP) where optimal policy is regarded as the ground truth. However, in some practical application scenarios, there is an extrapolation error challenge between the computed optimal policy and the real-world optimal policy. At this time, the optimal policy is unreliable. To address this challenge, we design a Reward Shaping based on Optimal-Policy-Free to get rid of the dependence on the optimal policy. We view reinforcement learning as probabilistic inference on a directed graph. Essentially, this inference propagates information from the rewarding states in the MDP and results in a function which is leveraged as a potential function for PBRS. Our approach utilizes a contrastive learning technique on directed graph Laplacian. Here, this technique does not change the structure of the directed graph. Then, the directed graph Laplacian is used to approximate the true state transition matrix in MDP. The potential function in PBRS can be learned through the message passing mechanism which is built on this directed graph Laplacian. The experiments on Atari, MuJoCo and MiniWorld show that our approach outperforms the competitive algorithms.

Item Type: Article
Uncontrolled Discrete Keywords: Directed graphs, Laplace equations, Hidden Markov models, Message passing, Trajectory, Perturbation methods, Symbols, Reinforcement learning, Probabilistic logic, Convolutional neural networks, Contrastive learning, directed graph convolutional neural network, markov decision process, reinforcement learning, reward shaping
Divisions: College of Business, Psychology and Sport > Worcester Business School
Related URLs:
Copyright Info: Copyright © 2024, IEEE
Depositing User: Katherine Small
Date Deposited: 25 Nov 2024 19:34
Last Modified: 25 Nov 2024 19:34
URI: https://eprints.worc.ac.uk/id/eprint/14426

Actions (login required)

View Item View Item
 
     
Worcester Research and Publications is powered by EPrints 3 which is developed by the School of Electronics and Computer Science at the University of Southampton. More information and software credits.