FiRL learns locomotion policies with a direction-dependent
Finsler cost \(F(x,v)\) and a dynamic CVaR objective.
The framework captures uphill/downhill asymmetry, lateral slip, and rare high-cost failures
within a single risk-sensitive reinforcement learning objective.
FiRL builds a local cost geometry for locomotion and then trains an actor--critic policy
to minimize tail-risk under that geometry. The key idea is that moving in different
directions can have different physical costs: climbing uphill is not the same as descending,
and lateral slip is not the same as forward motion.
Direction-dependent Finsler cost:
FiRL defines the one-step locomotion cost using a Finsler metric \(F(x,v)\), where
\(x\) is the robot state and \(v\) is the induced motion direction:
\[
F(x,v)
=
w_e F_{\mathrm{energy}}(x,v)
+
w_d F_{\mathrm{drift}}(x,v)
+
w_f F_{\mathrm{friction}}(x,v).
\]
The energy term measures basic motion effort, the drift term penalizes uphill motion,
and the friction term penalizes lateral movement or slip. This lets the robot distinguish
between actions that may look similar in Euclidean space but have very different physical
consequences.
Quasi-metric path geometry:
Because \(F(x,v)\) can assign different costs to \(v\) and \(-v\), the induced path cost
is generally asymmetric. For two states \(x\) and \(y\), FiRL defines:
\[
d_F(x,y)
=
\inf_{\tau:x\to y}
\int_0^1 F(\tau(t),\dot{\tau}(t))\,dt.
\]
This path cost satisfies the triangle inequality through path concatenation, while still
allowing \(d_F(x,y) \neq d_F(y,x)\). In this sense, FiRL gives the policy an asymmetric
geometric bias for direction-aware locomotion.
Dynamic CVaR Bellman objective:
To reduce rare but costly failures, FiRL optimizes a dynamic CVaR objective rather than
only expected cost. The Bellman update is:
\[
\begin{aligned}
V_\alpha(x)
&=
\min_{u\in\mathcal{U}(x)}
\Bigl\{
c(x,u)
+
\gamma\,\mathrm{CVaR}_{\alpha}
\left[V_\alpha(X')\right]
\Bigr\}, \\
&\qquad X'\sim P(\cdot\mid x,u),
\end{aligned}
\]
where \(c(x,u)=F(x,v(x,u))\). This objective encourages the robot to avoid actions that
may lead to high-cost tail outcomes, such as slipping, falling, or using excessive energy.
Distributional critic and actor update:
FiRL uses a distributional critic \(Z_\phi(x)\) to estimate the future cost distribution.
The critic outputs quantiles of the return distribution, and the CVaR target is computed
from the worst-cost tail:
\[
y_i
=
c_i
+
\gamma(1-d_i)
\widehat{\mathrm{CVaR}}_\alpha
\left[Z_{\phi^-}(x'_i)\right].
\]
The actor is then updated using advantages derived from this CVaR-based backup. This keeps
the optimization practical while preserving FiRL’s main structure: direction-dependent cost
plus tail-risk-aware learning.
Policy behavior:
In slopes, wind, stairs, and narrow-support settings, FiRL learns to avoid direct but risky
motions when safer alternatives exist. Instead of only maximizing average performance, it
learns cautious, energy-aware behaviors that reduce high-cost failures.