Episodic Memory Verbalization using Hierarchical Representations of Life-Long Robot Experience




Abstract

Verbalization of robot experience, i.e., summarization of and question answering about a robot's past, is a crucial ability for improving human-robot interaction. Previous works applied rule-based systems or fine-tuned deep models to verbalize short (several-minute-long) streams of episodic data, limiting generalization and transferability. In our work, we apply large pretrained models to tackle this task with zero or few examples, and specifically focus on verbalizing life-long experiences. For this, we derive a tree-like data structure from episodic memory (EM), with lower levels representing raw perception and proprioception data, and higher levels abstracting events to natural language concepts. Given such a hierarchical representation built from the experience stream, we apply a large language model as an agent to interactively search the EM given a user's query, dynamically expanding (initially collapsed) tree nodes to find the relevant information. The approach keeps computational costs low even when scaling to months of robot experience data. We evaluate our method on simulated household robot data, human egocentric videos, and real-world robot recordings, demonstrating its flexibility and scalability.


Method

Our goal is to enable an artificial agent to verbalize and answer questions about its past. Given the continuous, multimodal stream of experiences of a robot agent, we build up a hierarchical and interpretable representation of EM. The lower levels of this hierarchy are predefined and span raw experiences, events, and planning-level goals. Higher levels are then constructed by recursively summarizing the representations of the previous level, thus building EM of hours, days, weeks or months.

When a user later asks a question, our system needs to retrieve the relevant information from the EM to respond to the user's query. This is done by an LLM interactively explores the history tree to gather relevant information. We model this process using an "LLM as agent"-approach: The tree representation of the agent's history is initially in a collapsed state, i.e., only the top-level node's content is visible. Given the user's query and the history tree, an LLM is iteratively asked to invoke functions to gather the relevant information, or respond with the answer if sufficient details have been collected. Specifically, the LLM can interactively expand certain nodes of the tree to search for relevant details, ask helper models to inspect low-level information (e.g., to perform visual question-answering on images), or perform other computations in a Python console.


Real-world demonstrations



Evaluation samples

Here, we show traces of the LLM dynamically exploring the history tree from our experiments in the paper. Select the dataset on the left and the sample on the right. For Ego4D and the real-world robot experiments, this is the full set of evaluation samples from the paper. For TEACh, this is a small selection showcasing some success and failure cases.



Citation

[arxiv version]

This website is based on Jon Barron's source code.