AudioSAM: A Bioacoustics Foundation Model for Automatic Annotation

Abstract

Bioacoustic monitoring is essential for ecological research and wildlife conservation, enabling the tracking of species and ecosystem health at scale without physical intervention. Yet, automated analysis of bioacoustic data remains challenging due to complex soundscapes with overlapping species, environmental variability, and high background noise.

We introduce a self-supervised learning framework tailored to large-scale bioacoustic recordings. Our method adapts the Masked Autoencoder paradigm to animal vocalizations and is trained on over 10,000 hours of terrestrial and marine recordings spanning 500 species. We demonstrate significant improvements over traditional methods in species identification, vocalization detection, and acoustic event classification.

Our contributions are threefold: (1) release of a 1B-parameter Vision Transformer encoder optimized for long bioacoustic sequences, (2) the largest bioacoustics dataset with standardized annotations, and (3) state of the art results on bioacoustic tasks including event detection and species classification.

Datasets and Data Augmentation

Figure: Overview of the unified bioacoustics dataset spanning diverse species and environments.

We have assembled the largest unified bioacoustics dataset to date, comprising over 1 million annotated vocalizations from more than 500 species across 30+ genera. Our dataset consolidates 30+ existing bioacoustics datasets under a unified annotation framework, enabling unprecedented scale and diversity in bioacoustic research.

Dataset Composition

Scale: Over 1,000,000 annotated vocalizations
Diversity: 500+ species from 30+ genera
Coverage: Terrestrial and marine environments
Unified Framework: 30+ datasets consolidated under consistent annotations

Data Types

Short Audio Files

Individual files containing single vocalization events, ideal for training precise detection and classification models.

Long Audio Sequences

Extended recordings with multiple overlapping calls and complex soundscapes, enabling training on realistic field conditions.

Multi-species Recordings: Natural environments with multiple species vocalizing simultaneously
Controlled Recordings: High-quality single-species recordings for training

Data Augmentation Strategies

Stitching

We concatenate two or more audio segments to form longer sequences, thereby increasing temporal diversity and exposing the model to transitions across different species and acoustic contexts. This augmentation encourages the encoder to capture long-range dependencies in the recordings.

Mixing

We mix audio waveforms from different recordings by superimposing them, simulating overlapping vocalizations and complex natural soundscapes. This strategy trains the model to disentangle multiple sources and improves robustness to polyphonic bioacoustic environments.

Noise Addition

We add synthetic or natural background noise to audio recordings, replicating common environmental disturbances such as wind, rain, or anthropogenic sounds. This augmentation improves the model's resilience to noisy conditions and enhances generalization to field recordings.

Amplitude Changes

We randomly scale the amplitude of audio signals to mimic variability in microphone sensitivity, distance from the source, and propagation effects. This encourages the model to learn representations that are invariant to loudness and recording conditions.

These augmentation strategies enable our model to generalize across diverse acoustic environments and recording conditions, while the unified dataset provides unprecedented scale for training robust bioacoustic foundation models.

Model Architecture

Figure: Architecture of the AudioSAM foundation model for bioacoustic analysis.

Our AudioSAM model is a 1B-parameter Vision Transformer encoder specifically designed for processing long bioacoustic sequences. The architecture leverages self-supervised learning through masked autoencoding to learn robust representations of animal vocalizations across diverse species and acoustic environments.

Results

Start Frame

End Frame

Zero-Shot Detection Performance

We evaluate our model's zero-shot performance on vocalization detection across multiple bioacoustic datasets, demonstrating superior generalization capabilities compared to existing methods.

Model	DCASE	EnaBirds	Hiceas	Rainforest	Gibbons
LLM w/o audio	0.000	0.001	0.210	0.000	0.013
SALMONN	0.005	0.004	0.097	0.002	0.005
BioLingual	0.036	0.109	0.429	0.004	0.018
NatureLM-audio	0.058	0.314	0.336	0.025	0.005
Our Model	0.282	0.902	0.304	0.111	0.041

Table 1. Zero-shot F1 scores on vocalization detection datasets. The best metric is highlighted for each dataset.