Av-ag: a system that uses sound to locate how to manipulate objects

Published on January 05, 2026 | Translated from Spanish
Diagram showing an image of a cup next to a sound waveform of sipping; arrows connect the audio to a segmentation mask that highlights the cup's handle in the image.

Av-ag: a system that uses sound to locate how to manipulate objects

Research in computer vision explores new ways to understand scenes. An innovative system, called AV-AG, proposes a different approach: using the sound of an action to find and precisely delineate the parts of an object that can be interacted with in an image. This method does not depend on the object being fully visible, solving problems of ambiguity or visual occlusion. 🎯

The power of acoustic cues

Unlike systems that use text or video, audio provides direct and immediate semantic signals. To train and test this capability, researchers created the first AV-AG dataset. This includes recordings of action sounds, corresponding images, and pixel-level annotations marking manipulable regions. A subset with objects not seen during training allows evaluating how the system generalizes to new cases, a crucial point for its practical utility.

Key components of the dataset:
  • Sounds of specific actions (e.g., sipping, grasping, hitting).
  • Images of the objects associated with those actions.
  • Pixelated annotations defining interaction zones.
  • A group of unseen objects to test generalization.
Sound can effectively guide the visual understanding of how we interact with objects.

AVAGFormer model architecture

The core of the system is the AVAGFormer model, which fuses auditory and visual information. It uses a transmodal mixer that integrates acoustic cues with image data in a semantically coherent way. Subsequently, a two-headed decoder generates the final segmentation masks. This architecture has proven to outperform previous methods in the task of locating audio-guided interaction regions.

AVAGFormer processing flow:
  • Simultaneous input of an image and an audio signal.
  • Semantically conditioned transmodal fusion.
  • Decoding in two branches to predict the precise mask.
  • Output of a pixelated segmentation of the manipulable zone.

Direct applications in 3D graphics and simulation

For the foro3d.com community, this technology opens concrete perspectives. It can assist in generating contact masks or manipulable zones in 3D models directly from audio cues, speeding up setup. In physical simulation, it can automatically identify realistic grip points. Additionally, it enriches animation and rigging systems by providing data on how objects are used. It can also facilitate texturing tools that detect functional surfaces, and inspire plugins that combine audio and vision to achieve greater coherence between actions, sounds, and movements in 3D scenes. Thus, the next time a character correctly grabs a cup, the credit might go to a simple sipping sound. 🫖