Meta launches SAM Audio, the first unified multimodal model for audio separation

Meta announced SAM Audio, a new AI model that introduces a unified, multimodal approach to audio separation in real-world scenarios. Inspired by the success of the Segment Anything Model (SAM) in computer vision, SAM Audio brings the same concept to sound, allowing isolation of instruments, voices, or specific audio events in complex mixes through natural instructions such as text, visual clicks on videos, or temporal markers. At the heart of the system is the Perception Encoder Audiovisual (PE-AV), an evolution of the open-source Perception Encoder released by Meta earlier this year. PE-AV acts as the model's perceptual mechanism, aligning visual and auditory information in time and enabling SAM Audio to accurately associate what is seen with what is heard. This audiovisual alignment is essential for separating visually anchored sound sources, such as people speaking or instruments on screen, and even inferring events outside the field of view. SAM Audio introduces three main interaction modes: text prompts, such as "dog barking"; visual instructions by clicking on objects or people in the video; and temporal extension prompts, an innovation that allows marking specific segments where the desired sound occurs. These modalities can be used alone or combined, offering precise and intuitive control to the user. Technically, the model uses a generative architecture based on flow-matching diffusion transformers. It receives an audio mix and multimodal prompts, encodes everything into a shared representation, and generates the target and residual audio tracks. To enable large-scale training, Meta developed an advanced data pipeline combining realistic mixes, automatic generation of multimodal prompts, and pseudo-labeling, covering speech, music, and ambient sounds. Alongside the main model, Meta also presented two first-of-their-kind resources for the audio ecosystem: SAM Audio-Bench, the first benchmark for audio separation under real-world multimodal conditions, and SAM Audio Judge, an automatic evaluation model that measures separation quality based on perceptually aligned criteria with human hearing, without requiring reference tracks. In evaluations, SAM Audio outperformed state-of-the-art models on multiple benchmarks and matched or exceeded domain-specific solutions. The system operates faster than real time (RTF ≈ 0.7) and scales between 500 million and 3 billion parameters. It still faces challenges when separating extremely similar sources, such as a solo singer within a choir. With applications ranging from audio cleanup and music creation to accessibility and smart hearing devices, SAM Audio represents a significant advance toward AI that is more creative, inclusive, and perceptually aligned.