AudioSAM: A Bioacoustics Foundation Model for Automatic Annotation

AudioSAM

AudioSAM contains a large collection of bioacoustic data with over 1,000,000 calls from over 500 species, and is a foundation model for automatic annotation.

Abstract

Bioacoustic monitoring is essential for ecological research and wildlife conservation, enabling the tracking of species and ecosystem health at scale without physical intervention. Yet, automated analysis of bioacoustic data remains challenging due to complex soundscapes with overlapping species, environmental variability, and high background noise.

We introduce a self-supervised learning framework tailored to large-scale bioacoustic recordings. Our method adapts the Masked Autoencoder paradigm to animal vocalizations and is trained on over 10,000 hours of terrestrial and marine recordings spanning 500 species. We demonstrate significant improvements over traditional methods in species identification, vocalization detection, and acoustic event classification.

Our contributions are threefold: (1) release of a 1B-parameter Vision Transformer encoder optimized for long bioacoustic sequences, (2) the largest bioacoustics dataset with standardized annotations, and (3) state of the art results on bioacoustic tasks including event detection and species classification.

Datasets and Data Augmentation

Overview of the bioacoustics dataset

Figure: Overview of the unified bioacoustics dataset spanning diverse species and environments.

We have assembled the largest unified bioacoustics dataset to date, comprising over 1 million annotated vocalizations from more than 500 species across 30+ genera. Our dataset consolidates 30+ existing bioacoustics datasets under a unified annotation framework, enabling unprecedented scale and diversity in bioacoustic research.

Dataset Composition

  • Scale: Over 1,000,000 annotated vocalizations
  • Diversity: 500+ species from 30+ genera
  • Coverage: Terrestrial and marine environments
  • Unified Framework: 30+ datasets consolidated under consistent annotations

Data Types

Short Audio Files

Individual files containing single vocalization events, ideal for training precise detection and classification models.

Short audio example

Long Audio Sequences

Extended recordings with multiple overlapping calls and complex soundscapes, enabling training on realistic field conditions.

Long audio example
  • Multi-species Recordings: Natural environments with multiple species vocalizing simultaneously
  • Controlled Recordings: High-quality single-species recordings for training

Data Augmentation Strategies

Stitching

We concatenate two or more audio segments to form longer sequences, thereby increasing temporal diversity and exposing the model to transitions across different species and acoustic contexts. This augmentation encourages the encoder to capture long-range dependencies in the recordings.

Mixing

We mix audio waveforms from different recordings by superimposing them, simulating overlapping vocalizations and complex natural soundscapes. This strategy trains the model to disentangle multiple sources and improves robustness to polyphonic bioacoustic environments.

Noise Addition

We add synthetic or natural background noise to audio recordings, replicating common environmental disturbances such as wind, rain, or anthropogenic sounds. This augmentation improves the model's resilience to noisy conditions and enhances generalization to field recordings.

Amplitude Changes

We randomly scale the amplitude of audio signals to mimic variability in microphone sensitivity, distance from the source, and propagation effects. This encourages the model to learn representations that are invariant to loudness and recording conditions.

These augmentation strategies enable our model to generalize across diverse acoustic environments and recording conditions, while the unified dataset provides unprecedented scale for training robust bioacoustic foundation models.

Model Architecture

AudioSAM model architecture

Figure: Architecture of the AudioSAM foundation model for bioacoustic analysis.

Our AudioSAM model is a 1B-parameter Vision Transformer encoder specifically designed for processing long bioacoustic sequences. The architecture leverages self-supervised learning through masked autoencoding to learn robust representations of animal vocalizations across diverse species and acoustic environments.

Results

Interpolate start reference image.

Start Frame

Interpolation end reference image.

End Frame


Zero-Shot Detection Performance

We evaluate our model's zero-shot performance on vocalization detection across multiple bioacoustic datasets, demonstrating superior generalization capabilities compared to existing methods.

Model DCASE EnaBirds Hiceas Rainforest Gibbons
LLM w/o audio 0.000 0.001 0.210 0.000 0.013
SALMONN 0.005 0.004 0.097 0.002 0.005
BioLingual 0.036 0.109 0.429 0.004 0.018
NatureLM-audio 0.058 0.314 0.336 0.025 0.005
Our Model 0.282 0.902 0.304 0.111 0.041

Table 1. Zero-shot F1 scores on vocalization detection datasets. The best metric is highlighted for each dataset.