MomaGraph icon

State-Aware Unified Scene Graphs with Vision-Language Models for Embodied Task Planning

Yuanchen Ju*Berkeley Yongyuan Liang*UMD Yen-Jen Wang*Berkeley

Nandiraju Gireesh Yuanliang JuToronto Seungjae LeeUMD Qiao GuToronto Elvis HsiehBerkeley

Furong HuangUMD Koushil SreenathBerkeley

UC Berkeley

1University of California, Berkeley

UMD

2University of Maryland, College Park

University of Toronto

3University of Toronto

∗ Equal Contribution, † Equal Advising

arXiv
Paper
Code
Dataset
Benchmark
HuggingFace Model


Abstract

Teaser Image

Mobile manipulators in households must both navigate and manipulate. This requires a compact, semantically rich scene representation that captures where objects are, how they function, and which parts are actionable. Scene graphs are a natural choice, yet prior work often separates spatial and functional relations, treats scenes as static snapshots without object states or temporal updates, and overlooks information most relevant for accomplishing the current task. To overcome these shortcomings, we introduce MomaGraph, a unified scene representation for embodied agents that integrates spatial-functional relationships and part-level interactive elements. However, advancing such a representation requires both suitable data and rigorous evaluation, which have been largely missing. To address this, we construct MomaGraph-Scenes, the first large-scale dataset of richly annotated, task-driven scene graphs in household environments, and design MomaGraph-Bench, a systematic evaluation suite spanning six reasoning capabilities from high-level planning to fine-grained scene understanding. Built upon this foundation, we further develop MomaGraph-R1, a 7B vision–language model trained with reinforcement learning on MomaGraph-Scenes. MomaGraph-R1 predicts task-oriented scene graphs and serves as a zero-shot task planner under a Graph-then-Plan framework. Extensive experiments show that our model achieves state-of-the-art results among open-source models, reaching 71.6% accuracy on the benchmark (+11.4% over the best baseline), while generalizing across public benchmarks and transferring effectively to real-robot experiments.



Preliminary Findings and Motivation Experiments

To ground our analysis, before the full evaluations we perform two motivating experiments on the MomaGraph-Bench. These comparisons are designed to validate our motivation and design principles, and to reveal why our proposed model is essential for embodied task planning. In this section, we aim to answer the following questions.


Are VLMs Reliable for Direct Planning Without Scene Graphs?

To examine whether direct planning from visual inputs is reliable even for strong closed-source VLMs, we design controlled evaluations on real-world household tasks such as "Open the window" and "Obtain clean boiled water". In these scenarios, models must reason over functional relationships, spatial constraints, and multi-step dependencies (e.g., plug-in before activation, filtration before boiling). As shown in the figure below, despite their scale, closed-source VLMs like GPT-5 produce incorrect or incomplete plans, missing prerequisite steps, or misidentifying interaction types. In contrast, our Graph-then-Plan approach, which first generates a task-specific scene graph and then performs planning, consistently produces correct and complete action sequences aligned with ground-truth logic. This demonstrates that incorporating structured scene representations significantly enhances planning accuracy and robustness beyond what direct planning can achieve.


VLM Planning Comparison

Preliminary Findings 1

• In contrast to directly relying on vision-language models for task planning from raw scene images, our Graph-then-Plan strategy—which incorporates task-oriented scene graph generation as an intermediate structured representation prior to high-level planning, substantially improves both the accuracy and robustness of task planning.


Are Single-Relationship Graphs Adequate for Embodied Agents?

To ensure a fair comparison, we retrain our model using the same graph structure as in our approach, but constrain the edge types to encode only a single kind of relation—either spatial or functional. This setup allows us to isolate the effect of relation types while keeping the graph topology consistent, thereby directly examining whether single-relation representations are sufficient for task planning. To ensure this finding generalizes beyond one specific architecture, we evaluate this comparison across different base models using the same dataset and experimental configurations. As demonstrated in the table below, both MomaGraph-R1 (trained from Qwen-2.5-VL-7B) and LLaVA-Onevision consistently show superior performance with unified spatial-functional scene graphs compared to single-relationship variants, supporting our hypothesis that integrated representations are essential for effective embodied task planning. Detailed training methodology is described in the following section.


Single vs Unified Relationship Comparison

Preliminary Findings 2

• Graph representations that rely solely on spatial relationships or solely on functional relationships are insufficient. For embodied agents, a unified representation that jointly models both spatial and functional relationships provides a more complete and effective foundation for perception and action.


Method

VLMs Learn Scene Graph Representations with Reinforcement Learning

We use reinforcement learning with a graph-based reward to teach VLMs to build more accurate and task-relevant scene graphs. The reward checks action correctness, edge accuracy, and node completeness, while also enforcing proper format and concise output. This feedback helps the model learn compact, structured scene graphs that better support planning in embodied tasks.


VLM Training Process

State-Aware Dynamic Scene Graph Update

The scene graph initially encodes task-relevant objects and uncertain, one-to-many functional hypotheses. After the agent executes an action and observes the resulting state change, the update function eliminates inconsistent hypotheses and reinforces confirmed correspondences. For example, only the knob that ignites the burner retains a control edge, while others are pruned, enabling the scene graph to evolve into a compact, state-aware dynamic representation.


Dynamic Scene Graph Update Process


Dataset

We introduce MomaGraph-Scenes, the first dataset designed to provide a more comprehensive and task-relevant scene representation. Our dataset jointly encodes spatial relationships and functional relationships, which explicitly represent interactive elements such as handles and buttons. Our dataset consists of approximately 1,050 task-oriented subgraphs and 6,278 multi-view RGB images, collected from a combination of manually collected real-world data, re-annotated existing datasets, and simulated environments built with AI2-THOR. These samples span more than 350 diverse household scenes and encompass 93 distinct task instructions. Compared with prior datasets, our annotations are significantly more detailed, and capturing interaction semantics at both the object and part levels. This broad coverage ensures rich variability in scene layouts, object configurations, and interaction types, supporting robust learning and evaluation of embodied reasoning.


Dataset Visualization


Dataset Viewer

💡 Click on any thumbnail to view detailed scene information

Living Room 6 Living Room 7 Living Room 8 Living Room 9 Living Room 10 Living Room 1 Living Room 2 Living Room 3 Living Room 4 Living Room 5
Detail Image 1 Detail Image 2 Detail Image 3 Detail Image 4

Unified Spatial-Functional Scene Graph

"task_instruction": "Turn on the ceiling light.", "nodes": ["ceiling light", "light switch"], "edges": [ { "functional_relationship": "control", "object1": "light switch", "object2": "ceiling light", "spatial_relations": ["left_of", "lower_than", "in_front_of"], "is_touching": false }], "action_type": "press", "function_type": "device_control"

MomaGraph Benchmark and Evaluation

We introduce MomaGraph-Bench, the first benchmark that jointly evaluates fine-grained scene understanding and task planning abilities across diverse levels of difficulty. Our design principle for MomaGraph-Bench is to evaluate whether advances in scene understanding provide tangible improvements in downstream task planning and reasoning. Our evaluation framework examines six essential reasoning capabilities in four tiers of difficulty levels: (1) Action Sequence Reasoning, (2) Spatial Reasoning, (3) Object Affordance Reasoning, (4) Precondition & Effect Reasoning, (5) Goal Decomposition, and (6) Visual Correspondence (with concrete examples shown in the figure below). MomaGraph-Bench is formulated as a multi-choice VQA task which comprises 294 diverse indoor scenes with 1,446 multi-view images, featuring 352 task-oriented scene graphs spanning 1,315 instances that range from simple object manipulation (Tier 1) to complex multi-step planning (Tier 4) scenarios. MomaGraph-Bench offers the most comprehensive assessment for embodied agents' capacity to generalize across tasks and scenarios. To ensure that the evaluation truly reflects generalization rather than memorization, all scenarios are drawn from entirely unseen environments.


MomaGraph Benchmark Examples

Experimental Results

We compare the performance of our MomaGraph-R1 with other models across all task tiers in MomaGraph-Bench to rigorously assess embodied planning, including state-of-the-art closed source models (Claude-4-Sonnet, GPT-5, Gemini-2.5-Pro) and leading open source models (InstructBLIP, LLaVA-V1.5, DeepSeek-VL2, InternVL2.5, LLaVA-OneVision, Qwen2.5). We further examine whether Graph-then-Plan brings performance gains by evaluating each model under two controlled settings: (i) Direct Plan (w/o Graph): the model is directly evaluated on task planning in MomaGraph-Bench using multi-view observations and instructions; (ii) Graph-then-Plan (w/ Graph): the model first generates a task-oriented scene graph, capturing nodes, spatial and functional edges, and action types, and then performs task planning conditioned on the graph.


MomaGraph Benchmark Results

Experiment 1: Result Analysis

The results yield several key insights:


(1) Effectiveness of Graph-then-Plan

Across all models, the w/ Graph setting consistently outperforms the w/o Graph baseline, demonstrating that explicitly structuring task-oriented scene graphs provides a tangible benefit for downstream planning. This validates our central hypothesis that disentangling scene representation from action generation improves reasoning reliability.

(2) Competitiveness of MomaGraph-R1

Our MomaGraph-R1 achieves performance on par with closed-source giants like Claude-4-Sonnet and GPT-5, while clearly surpassing all leading open-source VLMs. Notably, MomaGraph-R1 delivers a +11.4% relative improvement over its base model (Qwen2.5-VL-7B) under w/ Graph, highlighting the effectiveness of reinforcement learning with graph-based rewards.

(3) Scalability with Task Complexity

As task complexity increases from Tier 1 to Tier 4, the performance of most open-source baselines drops sharply, reflecting their limited ability to generalize to multi-step reasoning. In contrast, MomaGraph-R1 exhibits a much smaller degradation, preserving strong performance in Tier 3 and Tier 4. This indicates superior scalability to long-horizon planning scenarios, a crucial capability for embodied agents.

(4) General Trend Across Communities

Closed-source models still maintain the highest absolute performance, benefiting from larger-scale pretraining and proprietary data. However, the consistent gap reduction achieved by MomaGraph-R1 shows that reinforcement learning with graph-structured intermediate representations can substantially narrow the divide, offering a practical path toward competitive open-source systems.


Experiment 2: Benchmark Evaluation for Visual Correspondence

As the model learns scene representations from multi-view observations, it exhibits an emergent ability of cross-view consistency, which can reason about the same point across different viewpoints. This capability is most evident in visual correspondence tasks. As shown in the table below, we compare model performance on visual correspondence tasks from public benchmark BLINK and our MomaGraph-Bench.


Visual Correspondence Results

Result Analysis

Scene graph representations enhance performance universally by reducing VLM hallucinations in visual perception. By prompting models to first generate structured scene graphs (w/ Graph) and then answer questions in single-turn interactions, we force them to explicitly reason about spatial and functional relationships between objects before answering. We primarily evaluate perception on multi-view reasoning and visual correspondence tasks from BLINK, as well as multi-view correspondence in MomaGraph-Bench. Our MomaGraph-R1 achieves state-of-the-art performance among open-source VLMs, leading by 3.8% on BLINK and 4.8% on our correspondence benchmark compared to the best competing open-source models. These results confirm that MomaGraph-R1 enables more nuanced and detailed perception of complex indoor scenes, effectively mitigating hallucinations and enabling more reliable scene perception.


Experiment 2 Conclusion

MomaGraph-R1 demonstrates superior cross-view consistency and visual correspondence capabilities, achieving 3.8% improvement on BLINK and 4.8% improvement on our correspondence benchmark. The structured scene graph representation effectively reduces VLM hallucinations and enables more reliable multi-view scene perception.


Experiment 3: Real Robot Demonstrations

To validate the effectiveness of our model in real-world settings, we deploy on the RobotEra Q5, a bimanual humanoid platform with a mobile base. An Intel RealSense D455 camera is mounted to enhance RGB-D perception. Importantly, all evaluation scenes are unseen, ensuring that performance reflects true generalization.


Real Robot Setup

We evaluate our model on four representative tasks: two contact interactions (opening a cabinet, opening a microwave) and two remote interactions (e.g., turning on the TV, turning off a light).


Task Demonstration



Quantitative Real-Robot Evaluation with Long-Horizon Task

To address the reviewer's concern about lacking quantitative metrics, we conduct a systematic evaluation with a challenging long-horizon task that requires multi-step reasoning and sequential execution across 10 independent trials.


Task Instruction

"I need better lighting. Turn on the light closest to the remote so I can find it and turn on the monitor to watch."

Long-Horizon Task Scene

Task Challenges

This task involves spatial reasoning (navigating to light switch, locating remote), functional understanding (light switch-light-remote-monitor relationships), state-dependent planning (lighting affects visibility), and manipulation across multiple objects.

Success Flow Analysis (N=10 trials)

Sankey Diagram
70%
Overall Success
7/10 trials
3
Failures

Successful Demonstration

Citation

If you find this work useful for your research, please consider citing:

@article{momagraph2025,
  title={MomaGraph: State-Aware Unified Scene Graphs with Vision-Language Models for Embodied Task Planning},
  author={[Author Names]},
  journal={[Conference/Journal Name]},
  year={2025}
}