To enhance retrieval quality, we move beyond single-modal approaches and adopt multi-modal retrieval between motions and captions, jointly considering both motion and textual similarities to select suitable references from a database. We introduce a novel fine-grained text-motion retrieval method that captures body-part-level motion features, enabling more precise alignment with textual descriptions. Unlike existing methods such as TMR and MotionPatches, which encode the full body into a single embedding using a shared motion encoder, our approach explicitly models inter-part relationships by dividing the body into distinct parts and encoding each with separate lightweight encoders. This part-based design captures finer motion details aligned with specific action semantics, leading to improved retrieval accuracy.

We show the comparison between text-to-text retrieval and text-to-motion retrieval, where motion-to-text retrieval can achieve better results than text-to-text retrieval. The retrieved samples are included in the prompts for effective guidance in the generation process of motion generation and motion captioning.