Sang, J. ORCID: https://orcid.org/0000-0001-7537-6844, Wang, Y. ORCID: https://orcid.org/0000-0003-2219-067X, Ahmed Khan, Zaki ORCID: https://orcid.org/0000-0002-8776-2109 and Zhou, X. ORCID: https://orcid.org/0009-0007-5665-845X (2024) Reward Shaping Based on Optimal-Policy-Free. IEEE Transactions on Big Data. pp. 1-12. ISSN 2372-2096; eISSN 2332-7790
Full text not available from this repository. (Request a copy)Abstract
Existing research on potential-based reward shaping (PBRS) relies on optimal policy in Markov decision process (MDP) where optimal policy is regarded as the ground truth. However, in some practical application scenarios, there is an extrapolation error challenge between the computed optimal policy and the real-world optimal policy. At this time, the optimal policy is unreliable. To address this challenge, we design a Reward Shaping based on Optimal-Policy-Free to get rid of the dependence on the optimal policy. We view reinforcement learning as probabilistic inference on a directed graph. Essentially, this inference propagates information from the rewarding states in the MDP and results in a function which is leveraged as a potential function for PBRS. Our approach utilizes a contrastive learning technique on directed graph Laplacian. Here, this technique does not change the structure of the directed graph. Then, the directed graph Laplacian is used to approximate the true state transition matrix in MDP. The potential function in PBRS can be learned through the message passing mechanism which is built on this directed graph Laplacian. The experiments on Atari, MuJoCo and MiniWorld show that our approach outperforms the competitive algorithms.
Item Type: | Article |
---|---|
Uncontrolled Discrete Keywords: | Directed graphs, Laplace equations, Hidden Markov models, Message passing, Trajectory, Perturbation methods, Symbols, Reinforcement learning, Probabilistic logic, Convolutional neural networks, Contrastive learning, directed graph convolutional neural network, markov decision process, reinforcement learning, reward shaping |
Divisions: | College of Business, Psychology and Sport > Worcester Business School |
Related URLs: | |
Copyright Info: | Copyright © 2024, IEEE |
Depositing User: | Katherine Small |
Date Deposited: | 25 Nov 2024 19:34 |
Last Modified: | 25 Nov 2024 19:34 |
URI: | https://eprints.worc.ac.uk/id/eprint/14426 |
Actions (login required)
View Item |