Clicky

Embodied Agent Interface: A Single Line to Evaluate LLMs for Embodied Decision Making

Manling Li1,†, Shiyu Zhao1,†, Qineng Wang1,†, Kangrui Wang1,†, Yu Zhou1,†,
1Stanford University, 2Amazon, 3MIT
Equal contribution

Embodied Agent Interface aims to tackle the following challenges in evaluating LLMs for building embodied decision-making agents: (1) Standardization of goal specifications. (2) Standardization of modules and interfaces. (3) Broad coverage of evaluation and fine-grained metrics.


Empirical Findings

  1. Goal Interpretation:
    • LLMs struggle to translate natural language instructions into grounded states.
    • Common errors include generating intermediate goals and omitting spatial relationship goals.
    • Gemini 1.5 Pro has the highest goal interpretation performance, while Claude-3 Opus excels in goal retrieval rate.
    • Proprietary LLMs make fewer grammar errors compared to open-source LLMs.
    Table 3: All goal evaluation results (%) for goal interpretation
    Model Name Goal Interpretation
    State Spatial Action Overall
    Precision Recall F1 Precision Recall F1 Precision Recall F1 Precision Recall F1
    V B V B V B V B V B V B V B V B V B V B V B V B
    Claude-3 Haiku 21.8 22.8 58.9 93.5 31.8 36.7 24.2 64.5 50.8 64.6 32.8 64.6 12.2 - 95.7 - 21.6 - 18.0 41.5 63.2 71.2 28.0 52.5
    Claude-3 Sonnet 23.3 36.8 57.1 88.9 33.1 52.0 26.6 76.2 53.0 79.8 35.5 77.9 12.4 - 85.8 - 21.7 - 19.3 60.2 61.5 81.9 29.4 69.4
    Claude-3 Opus 27.0 72.6 66.9 93.5 38.5 81.7 22.6 75.2 46.8 79.2 30.5 77.1 14.5 - 92.6 - 25.1 - 20.7 72.2 65.0 82.5 31.4 77.0
    Cohere Command R 51.1 7.7 69.6 31.4 58.9 12.4 34.5 56.8 21.3 55.0 26.3 55.9 3.6 - 38.9 - 6.5 - 27.4 28.2 55.7 49.6 36.7 36.0
    Cohere Command R+ 20.9 23.3 52.0 79.1 29.8 36.0 17.9 66.7 15.2 61.5 16.4 64.0 10.4 - 82.6 - 18.5 - 14.9 42.0 44.5 65.5 22.4 51.2
    Gemini 1.0 Pro 25.3 27.4 57.9 81.1 34.9 41.0 17.0 75.2 20.6 70.4 18.6 72.7 9.9 - 68.7 - 17.2 - 16.2 51.0 45.2 72.8 23.8 60.0
    Gemini 1.5 Flash 23.6 55.8 57.9 94.1 33.5 70.1 19.8 76.6 21.1 76.7 20.5 76.7 13.5 - 90.1 - 23.5 - 18.2 69.7 50.8 80.7 26.8 74.8
    Gemini 1.5 Pro 45.4 94.0 49.1 92.8 47.2 93.4 40.0 74.4 9.7 76.7 15.6 75.6 26.8 - 80.9 - 40.3 - 35.2 78.8 41.1 80.4 37.9 79.6
    GPT-3.5-turbo 22.4 52.0 50.0 66.7 30.9 58.5 8.5 51.5 18.8 46.9 11.7 49.1 15.2 - 60.5 - 24.4 - 15.7 49.5 40.5 51.4 22.7 50.4
    GPT-4-turbo 28.6 70.4 58.5 86.9 38.4 77.8 24.7 77.5 32.9 76.4 28.2 76.9 19.0 - 82.1 - 30.9 - 24.0 75.6 53.8 78.8 33.2 77.2
    GPT-4o 29.0 67.1 60.0 94.8 39.1 78.6 31.5 81.1 43.6 78.5 36.6 79.8 20.5 - 85.8 - 33.1 - 26.4 76.5 59.1 82.2 36.5 79.2
    Llama3 8B 21.7 17.3 54.4 80.4 31.0 28.4 14.0 51.4 7.4 20.8 9.7 29.6 11.1 - 79.4 - 19.4 - 15.5 24.1 41.9 34.3 22.6 28.3
    Llama3 70B 23.9 69.5 61.2 95.4 34.3 80.4 22.6 70.0 37.5 73.3 28.2 71.6 11.2 - 88.8 - 19.8 - 17.5 64.7 58.0 78.3 26.9 70.9
    Mistral Large 23.6 63.5 59.1 92.2 32.8 75.2 23.7 75.1 40.3 76.2 29.8 75.6 11.2 - 84.0 - 19.7 - 17.5 69.6 57.1 79.8 26.8 74.3
    Mixtral 8x22B MoE 23.6 22.9 56.9 83.7 33.4 36.0 22.2 70.7 36.3 67.7 27.5 69.2 11.2 - 94.8 - 20.0 - 17.4 44.4 56.2 71.3 26.6 54.7

  2. Action Sequencing:
    • Reasoning ability needs improvement; trajectory feasibility errors are common.
    • GPT-4o has the highest goal and execution success rates.
    • SOTA LLMs make fewer grammar errors.
    • Common runtime errors include missing steps and wrong order.
    • LLMs perform better with state goals than relation goals; struggle with complex action goals.
    • Task complexity affects goal success rate.
    Table 4: Trajectory evaluation results (%) for action sequencing.
    Model Name Goal Evaluation Trajectory Evaluation
    Goal SR Execution SR Grammar Error (↓) Runtime Error (↓)
    Parsing Hallucination Action-Arg Num Wrong Order Missing Step Affordance Additional Step
    V B V B V B V B V B V B V B V B V B
    Claude-3 Haiku 43.6 26.0 51.5 32.0 0.0 0.0 4.9 6.0 0.3 0.0 0.0 7.0 42.0 54.0 1.3 1.0 1.6 1.0
    Claude-3 Sonnet 65.2 44.0 68.9 57.0 0.0 0.0 5.6 1.0 0.7 0.0 0.7 11.0 22.3 19.0 2.0 11.0 0.7 2.0
    Claude-3 Opus 65.9 51.0 64.9 59.0 0.0 0.0 14.1 0.0 0.0 0.0 1.3 3.0 19.0 35.0 0.7 3.0 1.3 2.0
    Gemini 1.0 Pro 33.1 27.0 36.7 32.0 0.7 7.0 9.2 3.0 10.5 6.0 0.3 13.0 40.7 35.0 2.0 4.0 4.6 4.0
    Gemini 1.5 Flash 61.0 40.0 65.9 52.0 0.0 0.0 2.0 0.0 0.3 0.0 0.3 5.0 30.8 42.0 0.7 1.0 1.3 2.0
    Gemini 1.5 Pro 75.1 42.0 82.0 54.0 0.3 0.0 1.6 0.0 0.3 0.0 0.0 6.0 14.8 39.0 1.0 1.0 0.7 2.0
    GPT-3.5-turbo 25.9 16.0 40.7 20.0 0.0 4.0 4.3 7.0 17.7 23.0 0.0 1.0 33.1 36.0 4.3 8.0 1.3 1.3
    GPT-4-turbo 60.7 38.0 64.6 45.0 0.0 0.0 1.6 0.0 1.6 0.0 0.0 7.0 32.1 47.0 0.0 1.0 0.7 0.0
    GPT-4o 70.2 47.0 71.8 53.0 0.0 0.0 1.3 1.0 0.7 0.0 0.0 9.0 25.3 36.0 1.0 1.0 0.3 0.0
    Cohere Command R 15.7 16.0 19.7 19.0 2.0 5.0 37.1 13.0 23.3 0.0 0.3 8.0 16.4 43.0 1.3 12.0 1.6 4.0
    Cohere Command R+ 54.8 27.0 61.0 35.0 0.0 0.0 5.9 1.0 2.6 15.0 0.0 10.0 29.5 39.0 1.0 0.0 3.9 15.0
    Mistral Large 78.4 33.0 83.9 50.0 0.0 0.0 3.3 0.0 0.3 0.0 0.0 8.0 12.5 35.0 0.0 6.0 2.0 7.0
    Mixtral 8x22B MoE 46.2 30.0 50.2 40.0 0.0 3.0 13.1 6.0 0.7 0.0 0.0 10.0 34.8 32.0 1.3 9.0 1.6 2.0
    Llama3 8B 22.3 10.0 22.3 16.0 0.0 0.0 43.6 15.0 5.6 9.0 0.0 6.0 28.5 44.0 0.0 9.0 0.0 5.0
    Llama3 70B 54.4 34.0 56.1 42.0 0.0 0.0 23.3 2.0 1.0 0.0 0.7 15.0 16.4 38.0 2.6 3.0 3.6 6.0

    Table 5: All goal success results (%) for action sequencing and subgoal decomposition.
    Model Name Action Sequencing Subgoal Decomposition
    State Goal Relation Goal Action Goal Total State Goal Relation Goal Action Goal Total
    V B V B V B V B V B V B V B V B
    Claude-3 Haiku 59.4 27.0 42.8 38.7 87.8 - 61.4 35.5 89.4 26.0 82.2 34.8 71.6 - 83.1 32.4
    Claude-3 Sonnet 79.5 41.0 67.2 59.8 85.1 - 77.2 54.6 89.1 37.0 89.3 49.8 83.3 - 88.0 46.3
    Claude-3 Opus 63.7 45.0 71.1 53.0 77.0 - 69.1 50.8 92.4 43.0 88.6 41.6 83.3 - 89.1 42.0
    Gemini 1.0 Pro 52.5 28.0 32.8 32.0 77.0 - 52.6 30.9 84.4 26.0 61.5 31.1 72.8 - 73.5 29.7
    Gemini 1.5 Flash 79.5 34.0 63.3 50.0 88.5 - 76.9 45.6 93.5 44.0 88.3 36.0 92.0 - 91.3 38.2
    Gemini 1.5 Pro 81.7 41.0 76.7 43.2 89.2 - 82.0 42.6 91.2 31.0 72.5 37.1 89.5 - 83.9 35.4
    GPT-3.5-turbo 29.1 20.0 15.6 22.6 64.2 - 33.7 21.9 84.7 28.0 54.4 28.5 64.8 - 69.4 28.3
    GPT-4-turbo 74.8 39.0 72.2 39.5 89.9 - 77.7 39.3 93.5 45.0 84.2 46.1 90.7 - 89.5 45.8
    GPT-4o 83.1 49.0 71.1 45.5 89.9 - 81.2 46.5 92.1 50.0 84.2 53.2 93.2 - 89.4 52.3
    Cohere Command R 18.4 20.0 31.1 25.9 48.0 - 29.4 24.3 85.3 20.0 67.4 21.4 60.5 - 73.6 21.0
    Cohere Command R+ 70.1 28.0 57.2 32.0 85.8 - 70.1 30.9 89.4 34.0 66.8 29.6 75.9 - 78.3 30.8
    Mistral Large 81.7 38.5 78.3 41.2 91.9 - 83.2 40.4 92.9 33.0 71.5 35.6 90.1 - 84.4 34.9
    Mixtral 8x22B MoE 48.9 30.0 50.0 36.8 89.2 - 59.1 35.0 92.1 30.0 74.8 34.1 87.7 - 84.8 33.0
    Llama3 8B 26.6 16.0 20.6 23.7 32.4 - 26.2 21.6 68.8 21.0 54.7 23.6 50.0 - 59.8 22.9
    Llama3 70B 42.8 31.0 61.1 45.5 75.0 - 56.1 41.5 93.2 25.0 63.4 27.7 82.7 - 80.0 27.0

  3. Subgoal Decomposition:
    • Not easier than action sequencing in abstract action spaces.
    • GPT-4o and Gemini 1.5 Flash show superior performance.
    • SOTA models avoid grammar errors but can hallucinate actions and objects.
    • Common runtime errors: additional steps in VirtualHome, missing steps in BEHAVIOR.
    • LLMs show higher accuracy in action goals in VirtualHome; state and relation goals in BEHAVIOR are challenging.
    • Performance is lower in BEHAVIOR due to complex task representations.
    Table 6: All trajectory evaluation results (%) for subgoal decomposition.
    Model Name Goal Evaluation Trajectory Evaluation
    Goal SR Execution SR Grammar Error (↓) Runtime Error (↓)
    Parsing Hallucination Action-Arg Num Wrong Order Missing Step Affordance Additional Step
    V B V B V B V B V B V B V B V B V B
    Claude-3 Haiku 78.4 29.0 82.8 35.0 0.3 0.0 2.4 1.0 1.8 0.0 1.8 2.0 2.7 59.0 8.3 3.0 20.4 3.0
    Claude-3 Sonnet 83.1 38.0 86.4 43.0 0.0 2.0 1.8 0.0 0.0 2.0 0.6 3.0 2.7 51.0 8.6 1.0 33.7 3.0
    Claude-3 Opus 87.0 39.0 90.0 47.0 0.3 0.0 3.6 3.0 0.0 0.0 1.2 5.0 3.0 45.0 2.4 0.0 16.0 5.0
    Gemini 1.0 Pro 70.4 23.0 84.6 33.0 0.6 2.0 3.3 4.0 2.4 0.0 1.2 3.0 2.7 51.0 5.3 7.0 10.4 3.0
    Gemini 1.5 Flash 89.1 34.0 94.1 42.0 0.0 2.0 1.5 1.0 0.0 0.0 0.6 2.0 3.9 53.0 0.0 0.0 13.3 3.0
    Gemini 1.5 Pro 87.0 31.0 91.1 37.0 0.0 1.0 1.5 0.0 1.8 1.0 0.0 3.0 5.6 59.0 0.0 0.0 16.0 2.0
    GPT-3.5-turbo 69.2 24.0 81.4 36.0 1.5 2.0 0.0 3.0 0.6 0.0 1.5 3.0 11.8 52.0 3.3 4.0 20.4 3.0
    GPT-4-turbo 85.5 37.0 94.1 47.0 0.0 0.0 1.8 3.0 0.0 0.0 1.5 9.0 2.4 40.0 0.3 1.0 22.2 6.0
    GPT-4o 88.8 48.0 90.2 55.0 0.0 0.0 6.2 3.0 0.0 0.0 1.2 5.0 2.4 37.0 0.0 0.0 15.7 5.0
    Cohere Command R 71.3 15.0 79.6 25.0 2.1 22.0 3.9 11.0 0.9 0.0 1.5 0.0 6.2 38.0 5.9 4.0 14.5 4.0
    Cohere Command R+ 79.0 24.0 83.7 37.0 1.5 2.0 4.5 4.0 2.1 0.0 0.9 5.0 7.7 51.0 2.7 1.0 16.0 6.0
    Mistral Large 84.3 30.0 92.0 38.0 0.3 1.0 1.8 3.0 0.3 0.0 2.1 4.0 3.3 52.0 0.3 2.0 11.0 1.0
    Mixtral 8x22B MoE 80.5 27.0 90.2 33.0 0.3 0.0 2.4 4.0 0.0 0.0 3.0 2.0 3.9 59.0 0.3 2.0 11.2 0.0
    Llama3 8B 48.8 21.0 58.0 29.0 0.6 2.0 2.4 11.0 0.6 0.0 6.8 6.0 5.0 44.0 26.6 8.0 18.3 7.0
    Llama3 70B 78.4 20.0 87.3 30.0 0.0 1.0 2.4 5.0 0.9 1.0 2.4 8.0 5.3 51.0 1.8 4.0 20.4 4.0

  4. Transition Modeling:
    • Models excel in specific categories like object states and orientation.
    • Non-spatial relations consistently pose a challenge.
    • Planning effectiveness relies on consistency in predicted action space.
  5. Sensitivity Analysis:
    • Actions like "plug_in" and "walk_towards" show low success rates.
    • Complex interactions like "slice_carvingknife" and "place_inside" present challenges.
    • Training regimens may not fully capture real-world interaction diversity.
  6. Pipeline-Based vs. Modularized:
    • Similar trajectory executable rates for both methods.
    • Pipeline-based methods suffer from error accumulation.
    • SOTA LLMs avoid grammar errors; less advanced models do not.
    • All LLMs are prone to runtime errors, missing necessary steps.
    Table : Pipeline-based evaluation results for (1) \(\mathcal{G}+\mathcal{Q}\) and (2) \(\mathcal{G}+\Phi\)$ in BEHAVIOR. \(\mathcal{G}\): Goal Interpretation. \(\mathcal{Q}\): Action Sequencing. \(\Phi\): Subgoal Decomposition. In this table, M means 'modularized', whereas P means 'pipeline-based'.
    Model Name Goal Evaluation Trajectory Evaluation
    Goal SR Execution SR Grammar Error (↓) Runtime Error (↓)
    Parsing Hallucination Action-Arg Num Wrong Order Missing Step Affordance Additional Step
    M P M P M P M P M P M P M P M P M P
    Goal Interpretation + Action Sequencing
    Claude-3 Haiku 26.0 21.0 32.0 29.0 0.0 0.0 6.0 6.0 0.0 0.0 7.0 6.0 54.0 52.0 1.0 7.0 1.0 17.0
    Claude-3 Sonnet 44.0 41.0 57.0 53.0 0.0 0.0 1.0 3.0 0.0 0.0 11.0 14.0 19.0 21.0 11.0 9.0 2.0 12.0
    Claude-3 Opus 51.0 46.0 59.0 54.0 0.0 1.0 0.0 1.0 0.0 0.0 3.0 6.0 35.0 35.0 3.0 3.0 2.0 4.0
    Gemini 1.0 Pro 27.0 26.0 32.0 35.0 7.0 5.0 3.0 3.0 6.0 6.0 13.0 14.0 35.0 38.0 4.0 2.0 4.0 11.0
    Gemini 1.5 Flash 40.0 35.0 52.0 49.0 0.0 0.0 0.0 2.0 0.0 0.0 5.0 10.0 42.0 41.0 1.0 0.0 2.0 7.0
    Gemini 1.5 Pro 42.0 37.0 54.0 55.0 0.0 1.0 0.0 1.0 0.0 0.0 6.0 7.0 39.0 35.0 1.0 1.0 2.0 0.0
    GPT-3.5-turbo 16.0 14.0 20.0 32.0 4.0 1.0 7.0 3.0 23.0 15.0 1.0 5.0 36.0 39.0 8.0 6.0 1.0 3.0
    GPT-4-turbo 38.0 32.0 45.0 47.0 0.0 1.0 0.0 1.0 0.0 0.0 7.0 9.0 47.0 41.0 1.0 1.0 0.0 0.0
    GPT-4o 47.0 42.0 53.0 55.0 0.0 0.0 1.0 3.0 0.0 0.0 9.0 6.0 36.0 35.0 1.0 1.0 0.0 4.0
    Cohere Command R 16.0 5.0 19.0 9.0 5.0 3.0 13.0 38.0 0.0 1.0 8.0 8.0 43.0 31.0 12.0 12.0 4.0 8.0
    Cohere Command R+ 27.0 15.0 35.0 29.0 0.0 0.0 1.0 8.0 15.0 14.0 10.0 30.0 39.0 31.0 0.0 2.0 15.0 22.0
    Mistral Large 33.0 31.0 50.0 38.0 0.0 0.0 0.0 3.0 0.0 0.0 8.0 14.0 35.0 37.0 6.0 8.0 7.0 5.0
    Mixtral 8x22B MoE 30.0 26.0 40.0 36.0 3.0 3.0 6.0 13.0 0.0 0.0 10.0 14.0 32.0 21.0 9.0 13.0 2.0 15.0
    Llama3 8B 10.0 0.0 16.0 5.0 0.0 2.0 15.0 25.0 9.0 6.0 6.0 11.0 44.0 34.0 9.0 17.0 5.0 14.0
    Llama3 70B 34.0 26.0 42.0 40.0 0.0 1.0 2.0 3.0 0.0 0.0 15.0 18.0 38.0 35.0 3.0 5.0 6.0 9.0
    Goal Interpretation + Subgoal Decomposition
    Claude-3 Haiku 29.0 21.0 35.0 40.0 0.0 0.0 1.0 5.0 0.0 0.0 2.0 2.0 59.0 46.0 3.0 7.0 3.0 16.0
    Claude-3 Sonnet 38.0 31.0 43.0 45.0 0.0 0.0 2.0 3.0 0.0 0.0 3.0 2.0 51.0 47.0 1.0 3.0 3.0 18.0
    Claude-3 Opus 39.0 35.0 47.0 45.0 0.0 0.0 3.0 8.0 0.0 0.0 5.0 4.0 45.0 42.0 0.0 1.0 5.0 7.0
    Gemini 1.0 Pro 23.0 14.0 33.0 30.0 2.0 0.0 4.0 10.0 0.0 1.0 3.0 1.0 51.0 45.0 7.0 13.0 3.0 17.0
    Gemini 1.5 Flash 34.0 32.0 42.0 44.0 2.0 1.0 1.0 3.0 0.0 0.0 2.0 2.0 53.0 48.0 0.0 2.0 3.0 7.0
    Gemini 1.5 Pro 31.0 26.0 37.0 38.0 0.0 1.0 1.0 3.0 0.0 0.0 3.0 2.0 59.0 56.0 0.0 0.0 2.0 1.0
    GPT-3.5-turbo 24.0 14.0 36.0 27.0 2.0 0.0 3.0 12.0 0.0 22.0 3.0 1.0 52.0 32.0 4.0 6.0 3.0 5.0
    GPT-4-turbo 37.0 37.0 47.0 49.0 0.0 0.0 3.0 4.0 0.0 0.0 9.0 8.0 40.0 37.0 1.0 2.0 6.0 6.0
    GPT-4o 48.0 38.0 55.0 52.0 0.0 0.0 3.0 4.0 0.0 0.0 5.0 6.0 37.0 35.0 0.0 3.0 5.0 9.0
    Cohere Command R 15.0 8.0 25.0 15.0 21.0 13.0 11.0 32.0 0.0 1.0 0.0 1.0 38.0 32.0 4.0 6.0 4.0 12.0
    Cohere Command R+ 24.0 17.0 37.0 31.0 2.0 6.0 4.0 10.0 0.0 2.0 5.0 7.0 51.0 40.0 1.0 4.0 6.0 14.0
    Mistral Large 30.0 22.0 38.0 29.0 1.0 1.0 3.0 12.0 0.0 1.0 4.0 5.0 52.0 50.0 2.0 2.0 1.0 5.0
    Mixtral 8x22B MoE 27.0 22.0 33.0 29.0 0.0 0.0 4.0 9.0 0.0 2.0 2.0 2.0 59.0 45.0 2.0 13.0 0.0 17.0
    Llama3 8B 21.0 3.0 29.0 14.0 2.0 7.0 11.0 29.0 0.0 2.0 6.0 3.0 44.0 30.0 8.0 15.0 7.0 7.0
    Llama3 70B 20.0 19.0 30.0 31.0 1.0 1.0 5.0 22.0 1.0 1.0 8.0 7.0 51.0 35.0 4.0 3.0 4.0 7.0

  7. Replanning and Feedback:
    • Replanning based on feedback significantly improves performance.
    • Replanning can result in over-generation of actions.
    Table : Replanning evaluation results (%) for action sequencing.
    Model Name Goal Evaluation Trajectory Evaluation
    Goal SR Execution SR Grammar Error (↓) Runtime Error (↓)
    Parsing Hallucination Action-Arg Num Wrong Order Missing Step Affordance Additional Step
    GPT-4o 65.2 71.8 0.0 1.3 0.7 0.0 25.3 1.0 0.3
    GPT-4o w/ replanning 77.4 83.3 0.0 1.3 0.0 0.0 14.1 0.3 0.7

Abstract

Problem: We aim to evaluate Large Language Models (LLMs) for embodied decision making. While a significant body of work has been leveraging LLMs for decision making in embodied environments, we still lack a systematic understanding of their performances, because they are usually applied in different domains for different purposes, and built based on different inputs and outputs. Furthermore, existing evaluations tend to rely solely on a final success rate, making it difficult to pinpoint what ability is missing in LLMs and where the problem lies, which in turn, blocks embodied agents from leveraging LLMs effectively and selectively.

Method: To address these limitations, we propose a generalized interface (Embodied Agent Interface) that supports the formalization of various types of tasks and input-output specifications of LLM-based modules. Specifically, it allows us to unify 1) a broad set of embodied decision making tasks involving both state and temporally extended goals, 2) four commonly-used LLM-based modules for decision making: goal interpretation, subgoal decomposition, action sequencing, and transition modeling, and 3) a collection of fine-grained metrics which break down evaluation into various types of errors, such as hallucination errors, affordance errors, various types of planning errors, etc.

Conclusion: Overall, our benchmark offers a comprehensive and systematic assessment of LLMs' performance for different subtasks, pinpointing the strengths and weaknesses in LLM-powered embodied AI systems, and providing insights for effective and selective use of LLMs in embodied decision making.

Embodied agent interface overview.
Figure 1: Embodied Agent Interface unifies a broad set of tasks involving both state and temporally extended goals and four LLM-based modules for decision making.

Embodied Agent Interface


In our Embodied Agent Interface, we propose a set of ability modules to evaluate LLMs for embodied decision making. The four ability modules are: Goal Interpretation, Subgoal Decomposition, Action Sequencing, and Transition Modeling. We provide a detailed description of each module below.

Ability Module 1: Goal Interpretation

Goal Interpretation aims to ground the natural language instruction to the environment representations of objects, states, relations, and actions. For example, the task instruction "Use the rag to clean the trays, the bowl, and the refrigerator. When you are done, leave the rag next to the sink..." can be grounded to specific objects with IDs, such as fridge (ID: 97), tray (ID: 1), bowl (ID: 1), rag (ID: 0), and sink (ID: 82). Note that a simple natural language description can be grounded into a set of multiple goal conditions (object state and relation).

Ability Module 2: Subgoal Decomposition

Subgoal Decomposition generates a sequence of states, where each state can be a set of objects and their states. Here, we highlight the important states, such as the transitions between a sequence of next_to(rag.0, sink.82), toggled_on(sink.82), soaked(rag.0), toggled_off(sink.82), open(fridge.97), not_stained(fridge.97). To achieve these state transitions, we can use a high-level planner such as BFS to search for the Action Sequences that achieve these state transitions. We obtain the following action sequence: RIGHT_GRASP(rag.0), RIGHT_PLACE_NEXTTO(sink.82), TOGGLE_ON(sink.82), SOAK(rag.0), TOGGLE_OFF(sink.82), OPEN(fridge.97), CLEAN(fridge.97). Note that multiple actions may be required to achieve a single one-step state transition. For example, to perform the state transition next_to(rag.0, sink.82) → toggled_on(sink.82), we need two actions RIGHT_GRASP(rag.0), RIGHT_PLACE_NEXTTO(sink.82). See Figure 2 for the input and output formulation.

Embodied agent interface taxonomy example.
Figure 2: The input and output formulation of four ability modules for Embodied Agent Interface.

Ability Module 3: Action Sequencing

Action Sequences are essential to achieve the state transitions identified in Subgoal Decomposition. For example, a successful execution of the action sequence RIGHT_GRASP(rag.0), RIGHT_PLACE_NEXTTO(sink.82), TOGGLE_ON(sink.82), SOAK(rag.0), TOGGLE_OFF(sink.82), OPEN(fridge.97), CLEAN(fridge.97) is shown in Figure 3.

Ability Module 4: Transition Modeling

Transition Modeling serves as the low-level controller to guide the simulator in performing state transitions from preconditions to post-effects. For example, in cleaning task, the input is the operator name soak, and the preconditions are three states: holding (?obj1), next_to (?sink ?agent), and toggled_on (?sink). The post effect after executing SOAK is soaked (?obj1).

Example of successful execution in Embodied Agent Interface.
Figure 3: An example of successful execution in Embodied Agent Interface.

Evaluation Setup


We evaluate the performance of LLMs for embodied decision making using the Embodied Agent Interface. Below is a detailed description of the evaluation setup.

Dataset Description

Focusing on complex long-horizon tasks, we select VirtualHome (V) and BEHAVIOR (B) as our evaluation simulators based on their task length and scene complexity. Table 1 shows our annotations. Apart from the goal and trajectory annotations, we introduce the Goal Action annotation to reflect necessary actions that do not have post effects, such as the goal action touch in the task “pet the cat”. In the subset of VirtualHome tasks we work on, \(80.7\%\) task categories include instructions with action steps longer than \(10\), and \(33\%\) of the instructions have step lengths of more than \(10\).

We select BEHAVIOR as another simulator for our evaluation due to its task complexity. BEHAVIOR BDDL goals may contain quantifiers, such as (forpairs (?jar ?apple) (inside ?apple ?jar)), which need to be translated into grounded goals with only atomic propositions, e.g., and ((inside apple_1 jar_1) (inside apple_2 jar_2)). There can be different grounded goals that satisfy the same BDDL goal, such as ((inside apple_2 jar_1) (inside apple_1 jar_2)). We call them goal options. In general, one BDDL goal corresponds to a number of goal options. The average number of grounded goals for each task is \(6.7\), and there are \(4,164.4\) goal options for each task on average.

Table 1: Simulator dataset statistics. New annotations collected in this paper are highlighted in color.
VirtualHome BEHAVIOR
#task name 26 100
#task instruction 338 100
#goal 801 673
   - #state 340 153
   - #relation 299 520
   - #action 162 -
#trajectory 338 100
   - #step 2960 1460
   - avg. step 8.76 14.6
#transition model 33 30
   - #precondition 99 84
   - #effect 57 51

Each instance in the dataset represents a task goal. Specifically, each task contains the following data:

  • Natural language task name
  • Natural language task instruction
  • Symbolic goal definition (including its LTL form)
  • Symbolic action trajectory
  • The transition models involved in the task

For tasks in the BEHAVIOR environment, the dataset also includes accompanying VR human demonstration videos that showcase the execution of the ground truth action trajectories.

VirtualHome dataset structure example
Figure 4: VirtualHome dataset structure example.
BEHAVIOR dataset structure example
Figure 5: BEHAVIOR dataset structure example.

Please find our JSON data format in this link: Dataset JSON Format

LLMs Implementations

We integrated our evaluation pipeline into the HELM code base for easy and reproducible LLM inference. Users can set up their environment using here. We standardized decoding parameters across all models, using temperature zero for \(\operatorname*{arg\,max}\) sampling. Evaluating all models on our benchmark required \(180\) runs. Detailed model information is provided in the table below.

Table 2 : Model Cards for All Evaluated Large Language Models
Model Name Creator Complete Model ID Release Hosting
Claude-3 Haiku Anthropic claude-3-haiku-20240307 03/07/24 Anthropic
Claude-3 Sonnet Anthropic claude-3-sonnet-20240229 02/29/24 Anthropic
Claude-3 Opus Anthropic claude-3-opus-20240229 02/29/24 Anthropic
Cohere Command R Cohere command-r 03/11/24 Cohere
Cohere Command R+ Cohere command-r-plus 04/04/24 Cohere
Gemini 1.0 Pro Google gemini-pro 12/13/23 GCP Vertex
Gemini 1.5 Flash Google gemini-1.5-flash-preview-0514 05/14/24 GCP Vertex
Gemini 1.5 Pro Google gemini-1.5-pro-preview-0409 04/09/24 GCP Vertex
GPT-3.5-turbo OpenAI gpt-3.5-turbo-0125 01/25/24 OpenAI
GPT-4-turbo OpenAI gpt-4-turbo-2024-04-09 04/09/24 OpenAI
GPT-4o OpenAI gpt-4o-2024-05-13 05/13/24 OpenAI
Llama3 8B Instruct Meta meta-llama-3-8b-instruct 04/18/24 TogetherAI
Llama3 70B Instruct Meta meta-llama-3-70b-instruct 04/18/24 TogetherAI
Mistral Large MistralAI mistral-large-2402 02/26/24 MistralAI
Mixtral 8x22B MoE MistralAI mixtral-8x22b-instruct-v0.1 04/17/24 TogetherAI