Background
1Carnegie Mellon University
2Field AI

Abstract

Recent advances in video world modeling have enabled large-scale generative models to simulate embodied environments with high visual fidelity, providing strong priors for prediction, planning, and control. Yet, despite their realism, these models often lack geometric grounding, limiting their use in navigation tasks that require spatial coherence and long-horizon stability. We introduce Reinforcement Learning with World Grounding (RLWG), a self-supervised post-training framework that aligns pretrained world models with a physically verifiable structure through geometric and perceptual rewards. Analogous to reinforcement learning from verifiable feedback (RLVR) in language models, RLWG can use multiple rewards that measure pose cycle-consistency, depth reprojection, and temporal coherence. We instantiate this framework with GrndCtrl, a reward-aligned adaptation method based on Group Relative Policy Optimization (GRPO), yielding world models that maintain stable trajectories, consistent geometry, and reliable rollouts for embodied navigation. Like post-training alignment in large language models, GrndCtrl leverages verifiable rewards to bridge generative pretraining and grounded behavior, achieving superior spatial coherence and navigation stability over supervised fine-tuning in outdoor environments.

Overview

Reinforcement Learning with World Grounding (RLWG) addresses geometric inconsistencies in pretrained video world models through self-supervised post-training with verifiable rewards. Instead of reconstruction losses, RLWG grounds models using geometric and perceptual rewards from frozen evaluators.

GrndCtrl instantiates RLWG using Group Relative Policy Optimization (GRPO), enabling physically consistent rollouts essential for reliable world generation.

Problem

Despite impressive generative fidelity, current video world models often capture the appearance of motion more than its structure. Their rollouts remain visually plausible but geometrically and temporally inconsistent: poses drift, depths wobble, and trajectories lose alignment over time.

These instabilities limit the use of current models for closed-loop tasks such as localization, mapping, and planning, where physically consistent representation is essential.

Solution

RLWG refines pretrained world models using verifiable geometric and perceptual rewards derived from model rollouts. Each rollout is automatically scored using rewards that quantify spatial and temporal coherence, such as pose cycle-consistency, depth reprojection agreement, and action adherence.

GrndCtrl uses GRPO to optimize these verifiable rewards efficiently, preserving visual quality while progressively aligning the model's dynamics with measurable structure in the real world.

Qualitative Comparison

Click to see comparisons of GrndCtrl with ground truth and baseline methods across different scenarios.

CODA 2 ground truth
Scand 2 ground truth
CityWalk 1 ground truth
CODA 1 ground truth
Scand 1 ground truth
CityWalk 2 ground truth

Method

Figure 1. Overview of GrndCtrl. RLWG refines pretrained world models using verifiable geometric and perceptual rewards. GrndCtrl instantiates RLWG using Group Relative Policy Optimization (GRPO) to optimize these rewards, enabling physically consistent rollouts.

Quantitative Results

Method
Seen
Counterfactual
Unseen
T↓R↓V↑DTRI↑T↓R↓V↑DTRI↑T↓R↓V↑DTRI↑
CODa
Baseline57.81.777.4038.971.51.557.4139.156.91.717.4038.3
+T+R46.41.447.3238.450.51.537.3438.754.31.757.3639.3
+T+R+DTRI65.71.747.4337.057.71.867.4236.842.61.747.4037.1
+T+R+DTRI+V39.91.277.3537.540.71.427.3437.431.01.537.3738.0
SCAND
Baseline186.33.767.1623.6315.94.247.1321.4117.04.026.9918.4
+T+R158.23.617.1923.7251.24.347.1821.7131.13.957.0419.1
+T+R+DTRI157.93.657.1022.1288.64.457.1720.1118.64.077.0317.9
+T+R+DTRI+V133.43.307.1124.5220.14.237.0822.8123.43.626.9819.4
CityWalk
Baseline11.73.137.9646.913.13.277.9447.420.84.477.9044.5
+T+R8.93.317.9044.94.84.427.9145.610.23.477.8742.8
+T+R+DTRI8.43.367.8443.54.74.407.8344.110.93.687.7941.4
+T+R+DTRI+V8.83.377.8442.64.74.377.8543.39.93.747.8040.8

Table 1: Quantitative evaluation across three datasets (CODa, SCAND, CityWalk) and three regimes: Seen, Counterfactual, and Unseen. We compare baseline against progressive reward combinations (T+R, T+R+DTRI, T+R+DTRI+V). GrndCtrl achieves substantial improvements. Metrics: T (Translation Error, m), R (Rotation Error, rad), V (Video Quality), DTRI (Depth Temporal Reprojection Inliers).

Method
Seen
Counterfactual
Unseen
T↓R↓T↓R↓T↓R↓
Baseline73.2 ± 243.72.38 ± 3.8875.8 ± 253.92.38 ± 3.9071.2 ± 251.22.88 ± 4.28
GrndCtrl T+R 10072.0 ± 283.61.95 ± 3.3875.9 ± 311.71.85 ± 3.2258.4 ± 231.12.57 ± 4.08
GrndCtrl T+R 15026.7 ± 101.51.54 ± 2.8424.8 ± 99.11.53 ± 2.8026.5 ± 104.32.08 ± 3.33
GrndCtrl T+R 20018.4 ± 68.11.40 ± 2.4916.8 ± 63.01.36 ± 2.5716.3 ± 56.51.97 ± 3.11

Table 2: Reliability analysis showing error statistics (mean ± standard deviation) across multiple stochastic rollouts for different GRPO iterations. Baseline exhibits high variance, while GRPO training progressively reduces both mean errors and variance, achieving consistent rollouts. Metrics: T (Translation Error, m), R (Rotation Error, rad).

Citation

@misc{he2025grndctrlgroundingworldmodels,
      title={GrndCtrl: Grounding World Models via Self-Supervised Reward Alignment}, 
      author={Haoyang He and Jay Patrikar and Dong-Ki Kim and Max Smith and Daniel McGann 
              and Ali-akbar Agha-mohammadi and Shayegan Omidshafiei and Sebastian Scherer},
      year={2025},
      eprint={2512.01952},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.01952}, 
}