Worcester Research and Publications

Saturday, 10 May, 2025

SEARCH EPRINTS:

USER PANEL:

Registered Users - Log in using your University of Worcester ID and Password to submit items to the repository.

ABOUT THE COLLECTION:

WRaP is a collection of research papers and university publications. It presents the academic and creative work of the university. You are welcome to look for and obtain items of interest and make contact with the authors and creators.

CONTACT DETAILS:

All correspondence about WRaP should be sent to the Repository Manager.

Reward Shaping Based on Optimal-Policy-Free

Tools

Lists

Sang, J. ORCID: https://orcid.org/0000-0001-7537-6844, Wang, Y. ORCID: https://orcid.org/0000-0003-2219-067X, Ahmed Khan, Zaki ORCID: https://orcid.org/0000-0002-8776-2109 and Zhou, X. ORCID: https://orcid.org/0009-0007-5665-845X (2024) Reward Shaping Based on Optimal-Policy-Free. IEEE Transactions on Big Data. pp. 1-12. ISSN 2372-2096; eISSN 2332-7790

Full text not available from this repository. (Request a copy)

Official URL: https://ieeexplore.ieee.org/document/10740180

Abstract

Existing research on potential-based reward shaping (PBRS) relies on optimal policy in Markov decision process (MDP) where optimal policy is regarded as the ground truth. However, in some practical application scenarios, there is an extrapolation error challenge between the computed optimal policy and the real-world optimal policy. At this time, the optimal policy is unreliable. To address this challenge, we design a Reward Shaping based on Optimal-Policy-Free to get rid of the dependence on the optimal policy. We view reinforcement learning as probabilistic inference on a directed graph. Essentially, this inference propagates information from the rewarding states in the MDP and results in a function which is leveraged as a potential function for PBRS. Our approach utilizes a contrastive learning technique on directed graph Laplacian. Here, this technique does not change the structure of the directed graph. Then, the directed graph Laplacian is used to approximate the true state transition matrix in MDP. The potential function in PBRS can be learned through the message passing mechanism which is built on this directed graph Laplacian. The experiments on Atari, MuJoCo and MiniWorld show that our approach outperforms the competitive algorithms.

Item Type:	Article
Uncontrolled Discrete Keywords:	Directed graphs, Laplace equations, Hidden Markov models, Message passing, Trajectory, Perturbation methods, Symbols, Reinforcement learning, Probabilistic logic, Convolutional neural networks, Contrastive learning, directed graph convolutional neural network, markov decision process, reinforcement learning, reward shaping
Divisions:	College of Business, Psychology and Sport > Worcester Business School
Related URLs:	https://ieeexplore.ieee.org/xpl/RecentIs... Publisher
Copyright Info:	Copyright © 2024, IEEE
Depositing User:	Katherine Small
Date Deposited:	25 Nov 2024 19:34
Last Modified:	25 Nov 2024 19:34
URI:	https://eprints.worc.ac.uk/id/eprint/14426

Actions (login required)

View Item

Altmetric

CORE (COnnecting REpositories)

© University of Worcester Henwick Grove, WR2 6AJ Tel: 01905 855000 | Materials in WRaP are protected by copyright and other intellectual property rights. By using WRaP you agree to abide by UK copyright laws.

Worcester Research and Publications is powered by EPrints 3 which is developed by the School of Electronics and Computer Science at the University of Southampton. More information and software credits.