ReMoGPT

Abstract

Generation of 3D human motion holds significant importance in the creative industry. While recent notable advances have been made in generating common motions, existing methods struggle to generate diverse and rare motions due to the complexity of motions and limited training data. This work introduces ReMoGPT, a unified motion-language generative model that solves a wide range of motion-related tasks by incorporating a multi-modal retrieval mechanism into the generation process to address the limitations of existing models, namely diversity and generalizability. We propose to focus on body-part-level motion features to enable fine-grained text-motion retrieval and locate suitable references from the database to conduct generation. Then, the motion-language generative model is trained with prompt-based question-and-answer tasks designed for different motion-relevant problems. We incorporate the retrieved samples into the prompt, and then perform instruction tuning of the motion-language model, to learn from task feedback and produce promising results with the help of fine-grained multi-modal retrieval. Extensive experiments validate the efficacy of ReMoGPT, showcasing its superiority over existing state-of-the-art methods. The framework performs well on multiple motion tasks, including motion retrieval, generation, and captioning.

Method

Body Part-Level Text-Motion Retrieval

To enhance retrieval quality, we move beyond single-modal approaches and adopt multi-modal retrieval between motions and captions, jointly considering both motion and textual similarities to select suitable references from a database. We introduce a novel fine-grained text-motion retrieval method that captures body-part-level motion features, enabling more precise alignment with textual descriptions. Unlike existing methods such as TMR and MotionPatches, which encode the full body into a single embedding using a shared motion encoder, our approach explicitly models inter-part relationships by dividing the body into distinct parts and encoding each with separate lightweight encoders. This part-based design captures finer motion details aligned with specific action semantics, leading to improved retrieval accuracy.

We show the comparison between text-to-text retrieval and text-to-motion retrieval, where motion-to-text retrieval can achieve better results than text-to-text retrieval. The retrieved samples are included in the prompts for effective guidance in the generation process of motion generation and motion captioning.

Instruction Tuning

In ReMoGPT, we perform the instruction tuning of the model with the retrieved samples. These are the samples of prompt used in the instruction tuning. <Motion_Placeholder> denotes the motion tokens of the input or the output motion clip paired with the caption. <Motion_Placeholder_R1> and <Motion_Placeholder_R2> denote the motion tokens of multi-modal retrieved motion-caption pairs.

Experiments

Benchmark Results

Text-to-Motion Generation

We compared ReMoGPT with ReMoDiffuse and MotionGPT in the task of text-to-motion generation.

Motion-to-Text Captioning

We compared ReMoGPT with TM2T and MotionGPT in the task of motion-to-text captioning.

Visual Results

Text-to-Motion Generation

Qualitative results of text-to-motion generation. For each query, we show the rendered motions generated by each model. We also show the closest samples retrieved from the training dataset by each method as the reference.

Motion-to-Text Captioning

Qualitative results of motion-to-text captioning. For each motion query, we show the captions generated by each model alongside the ground-truth captions.

Rendered Video

As some motions are difficult to depict in a single image, we have provided a video to showcase our results.

BibTeX

@inproceedings{yu2025remogpt,
    title={ReMoGPT: Retrieval-Augmented MotionGPT},
    author={Yu, Qing and Tanaka, Mikihiro and Fujiwara, Kent},
    booktitle={AAAI Conference on Artificial Intelligence (AAAI)},
    year={2025}
}

ReMoGPT: Part-Level Retrieval-Augmented Motion-Language Models

AAAI 2025

An illustration of the text-to-motion generation and motion-to-text captioning pipeline in ReMoGPT. Specifically, ReMoGPT learns a motion-language generative model to predict the output using the context of the retrieval motion-caption pairs.

Abstract

Method

Body Part-Level Text-Motion Retrieval

Instruction Tuning

Experiments

Benchmark Results

Text-to-Motion Generation

Motion-to-Text Captioning

Visual Results

Text-to-Motion Generation

Motion-to-Text Captioning

Rendered Video

BibTeX