QPRL: Learning Optimal Policies with Quasi-Potential Functions for Asymmetric Traversal

Abstract

Reinforcement learning (RL) in real-world tasks such as robotic navigation often encounters environments with asymmetric traversal costs, where actions like climbing uphill versus moving downhill incur distinctly different penalties, or transitions may become irreversible.

While recent quasimetric RL methods relax symmetry assumptions, they typically do not explicitly account for path-dependent costs or provide rigorous safety guarantees.

We introduce Quasi-Potential Reinforcement Learning (QPRL), a novel framework that explicitly decomposes asymmetric traversal costs into a path-independent potential function \( \Phi \) and a path-dependent residual \( \Psi \). This decomposition allows efficient learning and stable policy optimization via a Lyapunov-based safety mechanism.

Theoretically, we prove that QPRL achieves convergence with improved sample complexity of \( \tilde{O}(\sqrt{T}) \), surpassing prior quasimetric RL bounds of \( \tilde{O}(T) \).

Empirically, our experiments demonstrate that QPRL attains state-of-the-art performance across various navigation and control tasks, significantly reducing irreversible constraint violations by approximately 4× compared to baselines.

QPRL Framework

QPRL decomposes asymmetric costs into path-independent potential \( \Phi \) and path-dependent residual \( \Psi \). Integrates Lyapunov safety constraints (yellow) for stable exploration.

Results

We compare QPRL against state-of-the-art baselines across a range of environments with asymmetric dynamics and cost structures. Our method consistently outperforms alternatives in terms of success rate and return, while maintaining fewer constraint violations.

Environment	Metric	QPRL (Ours)	QRL	Contrastive RL	DDPG+HER	SAC+HER
Asymmetric GridWorld	Success Rate (%)	92.5 \( \pm \) 2.2	87.3 \( \pm \) 3.0	82.4 \( \pm \) 3.5	78.9 \( \pm \) 4.2	80.3 \( \pm \) 4.0
MountainCar	Normalized Return	-95.6 \( \pm \) 4.1	-108.4 \( \pm \) 6.7	-118.3 \( \pm \) 8.1	-125.5 \( \pm \) 7.6	-121.2 \( \pm \) 7.0
FetchPush	Success Rate (%)	91.2 \( \pm \) 3.0	85.5 \( \pm \) 3.6	79.3 \( \pm \) 4.1	73.8 \( \pm \) 4.5	77.0 \( \pm \) 4.3
LunarLander	Success Rate (%)	88.9 \( \pm \) 3.4	81.4 \( \pm \) 4.0	76.7 \( \pm \) 4.5	72.5 \( \pm \) 5.0	74.2 \( \pm \) 4.8
Maze2D	Success Rate (%)	85.3 \( \pm \) 3.7	78.1 \( \pm \) 4.3	72.6 \( \pm \) 4.7	68.9 \( \pm \) 5.2	70.1 \( \pm \) 4.9

We further analyze the robustness of QPRL to asymmetric cost perturbations by comparing performance across symmetric and asymmetric variants of each environment. A smaller performance gap indicates greater robustness to asymmetry in the transition dynamics or reward structure.

Environment	Method	Symmetric (%)	Asymmetric (%)	Gap (%)
Asymmetric GridWorld	QPRL	94.1 \( \pm \) 1.8	88.7 \( \pm \) 2.5	5.4
	QRL	92.3 \( \pm \) 2.0	83.5 \( \pm \) 2.8	8.8
	SAC + HER	90.2 \( \pm \) 2.3	81.0 \( \pm \) 3.2	9.2
	DDPG + HER	89.8 \( \pm \) 2.5	80.5 \( \pm \) 3.5	9.3
MountainCar	QPRL	-90.5 \( \pm \) 4.3	-98.2 \( \pm \) 5.0	7.7
	QRL	-88.2 \( \pm \) 4.1	-96.5 \( \pm \) 5.2	8.3
	SAC + HER	-87.0 \( \pm \) 4.0	-95.8 \( \pm \) 5.3	8.8
	DDPG + HER	-86.5 \( \pm \) 4.2	-94.5 \( \pm \) 5.1	8.0
FetchPush	QPRL	92.0 \( \pm \) 2.2	85.3 \( \pm \) 3.1	6.7
	QRL	90.5 \( \pm \) 2.3	81.0 \( \pm \) 3.2	9.5
	SAC + HER	89.8 \( \pm \) 2.5	79.8 \( \pm \) 3.5	10.0
	DDPG + HER	88.5 \( \pm \) 2.4	78.5 \( \pm \) 3.4	10.0
LunarLander	QPRL	88.6 \( \pm \) 3.4	82.4 \( \pm \) 3.7	6.2
	QRL	87.0 \( \pm \) 3.5	80.0 \( \pm \) 4.0	7.0
	SAC + HER	85.5 \( \pm \) 3.8	77.5 \( \pm \) 4.2	8.0
	DDPG + HER	84.0 \( \pm \) 3.6	76.0 \( \pm \) 4.1	8.0

Table: Performance on symmetric vs. asymmetric variants of each environment (mean \( \pm \) 1 s.d. over 5 seeds). Gap (%) indicates performance degradation—lower is better, indicating robustness to asymmetric traversal costs.

Sample-efficiency and stability across tasks.
Success-rate learning curves for all five asymmetric environments.
The \( x \)-axis shows environment interactions (in millions of steps); the \( y \)-axis shows mean success rate.
Solid lines are the mean over 5 random seeds; shaded bands denote \( \pm 1 \) standard deviation.

BibTeX


      @InProceedings{hossain2025qprl,
      title     = {{QPRL}: Learning Optimal Policies with Quasi-Potential Functions for Asymmetric Traversal},
      author    = {Jumman Hossain and Nirmalya Roy},
      booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
      series    = {Proceedings of Machine Learning Research},
      publisher = {PMLR},
      year      = {2025}
      }

QPRL : Learning Optimal Policies with Quasi-Potential Functions for Asymmetric Traversal

Abstract

QPRL Framework

Results

BibTeX