QPRL : Learning Optimal Policies with Quasi-Potential Functions for Asymmetric Traversal

Jumman Hossain1, Nirmalya Roy1
1Department of Information Systems, University of Maryland, Baltimore County, USA

Abstract

Reinforcement learning (RL) in real-world tasks such as robotic navigation often encounters environments with asymmetric traversal costs, where actions like climbing uphill versus moving downhill incur distinctly different penalties, or transitions may become irreversible.

While recent quasimetric RL methods relax symmetry assumptions, they typically do not explicitly account for path-dependent costs or provide rigorous safety guarantees.

We introduce Quasi-Potential Reinforcement Learning (QPRL), a novel framework that explicitly decomposes asymmetric traversal costs into a path-independent potential function \( \Phi \) and a path-dependent residual \( \Psi \). This decomposition allows efficient learning and stable policy optimization via a Lyapunov-based safety mechanism.

Theoretically, we prove that QPRL achieves convergence with improved sample complexity of \( \tilde{O}(\sqrt{T}) \), surpassing prior quasimetric RL bounds of \( \tilde{O}(T) \).

Empirically, our experiments demonstrate that QPRL attains state-of-the-art performance across various navigation and control tasks, significantly reducing irreversible constraint violations by approximately compared to baselines.

QPRL Framework

QPRL decomposes asymmetric costs into path-independent potential \( \Phi \) and path-dependent residual \( \Psi \). Integrates Lyapunov safety constraints (yellow) for stable exploration.

QPRL framework diagram

Results

We compare QPRL against state-of-the-art baselines across a range of environments with asymmetric dynamics and cost structures. Our method consistently outperforms alternatives in terms of success rate and return, while maintaining fewer constraint violations.

Environment Metric QPRL (Ours) QRL Contrastive RL DDPG+HER SAC+HER
Asymmetric GridWorld Success Rate (%) 92.5 \( \pm \) 2.2 87.3 \( \pm \) 3.0 82.4 \( \pm \) 3.5 78.9 \( \pm \) 4.2 80.3 \( \pm \) 4.0
MountainCar Normalized Return -95.6 \( \pm \) 4.1 -108.4 \( \pm \) 6.7 -118.3 \( \pm \) 8.1 -125.5 \( \pm \) 7.6 -121.2 \( \pm \) 7.0
FetchPush Success Rate (%) 91.2 \( \pm \) 3.0 85.5 \( \pm \) 3.6 79.3 \( \pm \) 4.1 73.8 \( \pm \) 4.5 77.0 \( \pm \) 4.3
LunarLander Success Rate (%) 88.9 \( \pm \) 3.4 81.4 \( \pm \) 4.0 76.7 \( \pm \) 4.5 72.5 \( \pm \) 5.0 74.2 \( \pm \) 4.8
Maze2D Success Rate (%) 85.3 \( \pm \) 3.7 78.1 \( \pm \) 4.3 72.6 \( \pm \) 4.7 68.9 \( \pm \) 5.2 70.1 \( \pm \) 4.9

We further analyze the robustness of QPRL to asymmetric cost perturbations by comparing performance across symmetric and asymmetric variants of each environment. A smaller performance gap indicates greater robustness to asymmetry in the transition dynamics or reward structure.

Environment Method Symmetric (%) Asymmetric (%) Gap (%)
Asymmetric GridWorld QPRL 94.1 \( \pm \) 1.8 88.7 \( \pm \) 2.5 5.4
QRL 92.3 \( \pm \) 2.0 83.5 \( \pm \) 2.8 8.8
SAC + HER 90.2 \( \pm \) 2.3 81.0 \( \pm \) 3.2 9.2
DDPG + HER 89.8 \( \pm \) 2.5 80.5 \( \pm \) 3.5 9.3
MountainCar QPRL -90.5 \( \pm \) 4.3 -98.2 \( \pm \) 5.0 7.7
QRL -88.2 \( \pm \) 4.1 -96.5 \( \pm \) 5.2 8.3
SAC + HER -87.0 \( \pm \) 4.0 -95.8 \( \pm \) 5.3 8.8
DDPG + HER -86.5 \( \pm \) 4.2 -94.5 \( \pm \) 5.1 8.0
FetchPush QPRL 92.0 \( \pm \) 2.2 85.3 \( \pm \) 3.1 6.7
QRL 90.5 \( \pm \) 2.3 81.0 \( \pm \) 3.2 9.5
SAC + HER 89.8 \( \pm \) 2.5 79.8 \( \pm \) 3.5 10.0
DDPG + HER 88.5 \( \pm \) 2.4 78.5 \( \pm \) 3.4 10.0
LunarLander QPRL 88.6 \( \pm \) 3.4 82.4 \( \pm \) 3.7 6.2
QRL 87.0 \( \pm \) 3.5 80.0 \( \pm \) 4.0 7.0
SAC + HER 85.5 \( \pm \) 3.8 77.5 \( \pm \) 4.2 8.0
DDPG + HER 84.0 \( \pm \) 3.6 76.0 \( \pm \) 4.1 8.0

Table: Performance on symmetric vs. asymmetric variants of each environment (mean \( \pm \) 1 s.d. over 5 seeds). Gap (%) indicates performance degradation—lower is better, indicating robustness to asymmetric traversal costs.

QPRL performance learning curves

Sample-efficiency and stability across tasks.
Success-rate learning curves for all five asymmetric environments.
The \( x \)-axis shows environment interactions (in millions of steps); the \( y \)-axis shows mean success rate.
Solid lines are the mean over 5 random seeds; shaded bands denote \( \pm 1 \) standard deviation.

BibTeX


      @InProceedings{hossain2025qprl,
      title     = {{QPRL}: Learning Optimal Policies with Quasi-Potential Functions for Asymmetric Traversal},
      author    = {Jumman Hossain and Nirmalya Roy},
      booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
      series    = {Proceedings of Machine Learning Research},
      publisher = {PMLR},
      year      = {2025}
      }