QPRL decomposes asymmetric costs into path-independent potential \( \Phi \) and path-dependent residual \( \Psi \). Integrates Lyapunov safety constraints (yellow) for stable exploration.
Reinforcement learning (RL) in real-world tasks such as robotic navigation often encounters environments with asymmetric traversal costs, where actions like climbing uphill versus moving downhill incur distinctly different penalties, or transitions may become irreversible.
While recent quasimetric RL methods relax symmetry assumptions, they typically do not explicitly account for path-dependent costs or provide rigorous safety guarantees.
We introduce Quasi-Potential Reinforcement Learning (QPRL), a novel framework that explicitly decomposes asymmetric traversal costs into a path-independent potential function \( \Phi \) and a path-dependent residual \( \Psi \). This decomposition allows efficient learning and stable policy optimization via a Lyapunov-based safety mechanism.
Theoretically, we prove that QPRL achieves convergence with improved sample complexity of \( \tilde{O}(\sqrt{T}) \), surpassing prior quasimetric RL bounds of \( \tilde{O}(T) \).
Empirically, our experiments demonstrate that QPRL attains state-of-the-art performance across various navigation and control tasks, significantly reducing irreversible constraint violations by approximately 4× compared to baselines.
QPRL decomposes asymmetric costs into path-independent potential \( \Phi \) and path-dependent residual \( \Psi \). Integrates Lyapunov safety constraints (yellow) for stable exploration.
In QPRL, we maintain a learned encoder \(f_\phi\) and transition model \(T_\psi\) (to embed states in a latent space), plus the two quasi-potential functions \(\Phi_\theta\) and \(\Psi_\theta\). Training proceeds as follows:
State and transition model:
We encode each state \(s_i\) as \(z_i = f_\phi(s_i)\) and predict the next latent \(z'_i = T_\psi(z_i, a_i)\). We train \((f_\phi, T_\psi)\) by minimizing reconstruction loss:
\[\|\hat z'_i - f_\phi(s'_i)\|^2\]
This yields a compact latent representation for planning.
Quasi-Potential learning:
We fit \(\Phi_\theta, \Psi_\theta\) so that \(\Phi(g) - \Phi(s) + \Psi(s \to g) \approx c(s \to g)\). Concretely, for observed transitions \((s_i, a_i, s'_i, c_i)\) towards goal \(g_i\), we minimize the squared “inverse Bellman” loss:
\[ L_U = \frac{1}{B} \sum_i \left( \Phi_\theta(g_i) - \Phi_\theta(s_i) + \Psi_\theta(s_i \to g_i) - c_i \right)^2 \]
to enforce the decomposition. We also add a constraint loss to enforce the quasimetric property: \(\Psi(s \to s') \ge c(s \to s') - (\Phi(s') - \Phi(s))\). In practice this is implemented via a ReLU-based term in the loss. These updates ensure \(\Phi, \Psi\) capture both reversible and irreversible cost components.
Lyapunov Safety Constraint:
QPRL enforces a Lyapunov-based safety constraint on \(\Phi\). Specifically, for any state \(s\) and safe action \(a\), the expected potential at the next state should not exceed the current potential by more than \(\epsilon\):
\[ \mathbb{E}_{s' \sim P(\cdot|s,a)}[\Phi_\theta(s')] \leq \Phi_\theta(s) + \epsilon \]
This is a Lyapunov condition indicating that \(\Phi\) is a Lyapunov function bounding unsafe exploration. We implement this via a Lagrangian penalty: if the predicted next latent state \(z'\) violates \(\Phi(z') > \Phi(s) + \epsilon\), we add a ReLU penalty:
\[ \max(0, \Phi(z') - \Phi(s) - \epsilon) \]
to the policy loss. A dynamic multiplier \(\lambda\) is adjusted to enforce this constraint adaptively during training.
Policy Update:
Finally, the policy \(\pi_\omega\) is updated to maximize long-term success under the quasi-potential costs, subject to the safety layer. We form a (cost-to-go) estimate \(\hat d_i = \Phi(g_i) - \Phi(s_i) + \Psi(s_i \to g_i)\) and minimize:
\[ \sum_i \left[ \hat d_i + \lambda\,\mathrm{ReLU}(\Phi(z'_i) - \Phi(s_i) - \epsilon) \right] \]
with respect to \(\omega\). This encourages low-cost paths while penalizing actions that would likely increase the Lyapunov potential beyond \(\epsilon\).
We compare QPRL against state-of-the-art baselines across a range of environments with asymmetric dynamics and cost structures. Our method consistently outperforms alternatives in terms of success rate and return, while maintaining fewer constraint violations.
Environment | Metric | QPRL (Ours) | QRL | Contrastive RL | DDPG+HER | SAC+HER |
---|---|---|---|---|---|---|
Asymmetric GridWorld | Success Rate (%) | 92.5 \( \pm \) 2.2 | 87.3 \( \pm \) 3.0 | 82.4 \( \pm \) 3.5 | 78.9 \( \pm \) 4.2 | 80.3 \( \pm \) 4.0 |
MountainCar | Normalized Return | -95.6 \( \pm \) 4.1 | -108.4 \( \pm \) 6.7 | -118.3 \( \pm \) 8.1 | -125.5 \( \pm \) 7.6 | -121.2 \( \pm \) 7.0 |
FetchPush | Success Rate (%) | 91.2 \( \pm \) 3.0 | 85.5 \( \pm \) 3.6 | 79.3 \( \pm \) 4.1 | 73.8 \( \pm \) 4.5 | 77.0 \( \pm \) 4.3 |
LunarLander | Success Rate (%) | 88.9 \( \pm \) 3.4 | 81.4 \( \pm \) 4.0 | 76.7 \( \pm \) 4.5 | 72.5 \( \pm \) 5.0 | 74.2 \( \pm \) 4.8 |
Maze2D | Success Rate (%) | 85.3 \( \pm \) 3.7 | 78.1 \( \pm \) 4.3 | 72.6 \( \pm \) 4.7 | 68.9 \( \pm \) 5.2 | 70.1 \( \pm \) 4.9 |
We further analyze the robustness of QPRL to asymmetric cost perturbations by comparing performance across symmetric and asymmetric variants of each environment. A smaller performance gap indicates greater robustness to asymmetry in the transition dynamics or reward structure.
Environment | Method | Symmetric (%) | Asymmetric (%) | Gap (%) |
---|---|---|---|---|
Asymmetric GridWorld | QPRL | 94.1 \( \pm \) 1.8 | 88.7 \( \pm \) 2.5 | 5.4 |
QRL | 92.3 \( \pm \) 2.0 | 83.5 \( \pm \) 2.8 | 8.8 | |
SAC + HER | 90.2 \( \pm \) 2.3 | 81.0 \( \pm \) 3.2 | 9.2 | |
DDPG + HER | 89.8 \( \pm \) 2.5 | 80.5 \( \pm \) 3.5 | 9.3 | |
MountainCar | QPRL | -90.5 \( \pm \) 4.3 | -98.2 \( \pm \) 5.0 | 7.7 |
QRL | -88.2 \( \pm \) 4.1 | -96.5 \( \pm \) 5.2 | 8.3 | |
SAC + HER | -87.0 \( \pm \) 4.0 | -95.8 \( \pm \) 5.3 | 8.8 | |
DDPG + HER | -86.5 \( \pm \) 4.2 | -94.5 \( \pm \) 5.1 | 8.0 | |
FetchPush | QPRL | 92.0 \( \pm \) 2.2 | 85.3 \( \pm \) 3.1 | 6.7 |
QRL | 90.5 \( \pm \) 2.3 | 81.0 \( \pm \) 3.2 | 9.5 | |
SAC + HER | 89.8 \( \pm \) 2.5 | 79.8 \( \pm \) 3.5 | 10.0 | |
DDPG + HER | 88.5 \( \pm \) 2.4 | 78.5 \( \pm \) 3.4 | 10.0 | |
LunarLander | QPRL | 88.6 \( \pm \) 3.4 | 82.4 \( \pm \) 3.7 | 6.2 |
QRL | 87.0 \( \pm \) 3.5 | 80.0 \( \pm \) 4.0 | 7.0 | |
SAC + HER | 85.5 \( \pm \) 3.8 | 77.5 \( \pm \) 4.2 | 8.0 | |
DDPG + HER | 84.0 \( \pm \) 3.6 | 76.0 \( \pm \) 4.1 | 8.0 |
Table: Performance on symmetric vs. asymmetric variants of each environment (mean \( \pm \) 1 s.d. over 5 seeds). Gap (%) indicates performance degradation—lower is better, indicating robustness to asymmetric traversal costs.
Sample-efficiency and stability across tasks.
Success-rate learning curves for all five asymmetric environments.
The \( x \)-axis shows environment interactions (in millions of steps); the \( y \)-axis shows mean success rate.
Solid lines are the mean over 5 random seeds; shaded bands denote \( \pm 1 \) standard deviation.
@InProceedings{hossain2025qprl,
title = {{QPRL}: Learning Optimal Policies with Quasi-Potential Functions for Asymmetric Traversal},
author = {Jumman Hossain and Nirmalya Roy},
booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
series = {Proceedings of Machine Learning Research},
publisher = {PMLR},
year = {2025}
}