Nature Sounds AI

Nature Sounds: Identifying Sounds Around Us

By Jasmin Ibrišimbegović

Introduction

Nature Sounds Project

We live in a world filled with sound—birds chirping at dawn, rain pattering on windows, dogs barking in the distance. Yet identifying these sounds computationally is far more challenging than it appears. Unlike images, which freeze a moment in a fixed frame, sounds are temporal, variable in length, and often obscured by background noise.

Nature Sounds is my attempt to tackle this problem: an AI-powered system that classifies 50 environmental sounds in real-time, from frog croaking to car horns, using deep learning and accessible mobile technology.

The Challenge of Sound Recognition

Sound recognition faces unique challenges that make it fundamentally different—and harder—than image recognition:

1. Temporal Variability

Unlike a photograph with fixed dimensions (e.g., 224×224 pixels), audio clips vary wildly in duration. A dog bark might last half a second; rain can continue for hours. Models must handle this temporal inconsistency.

2. Background Noise

Real-world audio is messy. A frog croaking near a busy road includes car engines, wind, and human voices. The target sound is often buried in noise—something rarely an issue with image recognition where the subject is usually visible.

3. Overlapping Sounds

Multiple sounds can occur simultaneously. A recording might contain both chirping birds and wind and footsteps. This cocktail party problem makes single-label classification difficult.

4. Sample Rate and Quality

Audio quality varies dramatically. Studio recordings at 48kHz are pristine; smartphone recordings might be 16kHz with compression artifacts. Models must generalize across these variations.

5. Feature Representation

Images are naturally represented as pixels. Audio requires transformation—raw waveforms are high-dimensional and hard to learn from. Spectrograms, MFCCs (Mel-Frequency Cepstral Coefficients), and other representations must be chosen carefully.

Despite these challenges, sound recognition opens incredible possibilities: wildlife monitoring, assistive technologies for the hearing impaired, smart home automation, and environmental sensing.

Project Vision

Nature Sounds aims to make sound recognition accessible. The system allows anyone with a smartphone to:

  • Record live audio from their environment
  • Upload existing audio files
  • Receive instant AI predictions with confidence scores
  • Identify 50 common environmental sounds

The ultimate goal: democratize acoustic intelligence and enable applications from bird identification apps to smart city noise monitoring.

Technology Stack

  • Model: Keras/TensorFlow CNN trained on ESC-50 dataset
  • Backend: FastAPI (Python) with uvicorn server
  • Frontend: Ionic 7 (Angular) Progressive Web App (mobile-first)
  • Platforms: Web, Android, iOS (cross-platform PWA)
  • Audio Processing: librosa for MFCC feature extraction

Architecture: End-to-End Pipeline

┌─────────────────┐     ┌──────────────┐     ┌─────────────────┐
│  Mobile App     │────▶│  FastAPI     │────▶│  Keras Model    │
│  (Ionic/Angular)│     │  Server      │     │  (CNN)          │
│                 │     │              │     │                 │
│  • Record Audio │     │  • Preprocess│     │  • 50 Classes   │
│  • Upload File  │     │  • Extract   │     │  • Softmax      │
│  • Display      │     │    MFCCs     │     │    Output       │
│    Results      │◀────│  • Predict   │◀────│                 │
└─────────────────┘     └──────────────┘     └─────────────────┘

Flow:

  1. User records/uploads audio via mobile app
  2. App sends audio file to FastAPI /predict endpoint
  3. Server extracts 40 MFCC features using librosa
  4. Features fed to CNN model (model.h5)
  5. Model outputs 50-class probability distribution
  6. Server returns top prediction with confidence
  7. App displays result (e.g., "Frog - 92.3% confidence")

Key Features

Try the Demo: Nature Sounds Live Demo (Coming Soon)

Screenshots:

Mobile-First Design

The Ionic app works seamlessly across iOS, Android, and web browsers. Record audio with a single tap, see real-time feedback during recording, and get instant predictions.

50 Sound Categories

The model recognizes a diverse range of sounds:

  • Animals: dog, cat, pig, cow, frog, rooster, hen, sheep, crow, insects
  • Nature: rain, sea waves, crackling fire, crickets, chirping birds, thunderstorm, wind
  • Urban: car horn, siren, engine, train, airplane, helicopter, chainsaw
  • Household: clock tick, keyboard typing, mouse click, door knock, vacuum cleaner, can opening
  • Human: breathing, coughing, sneezing, snoring, crying, laughing, clapping, footsteps

Full category mapping available in esc50.csv.

Real-Time Processing

From audio capture to prediction typically takes under 2 seconds:

  • Recording: instant
  • Upload: ~200ms (local network)
  • Preprocessing: ~100ms
  • Model inference: ~50ms (CPU)
  • Response: ~50ms

Local Inference

No cloud dependencies. The entire system runs on your local machine—ensuring privacy, zero latency, and no API costs.

The CNN Architecture

Unlike image CNNs that operate on pixel grids, our model processes MFCC features—a compact audio representation.

Input: MFCC Feature Extraction

import librosa
import numpy as np

def extract_features(audio_path):
    # Load audio (preserve original sample rate)
    audio, sr = librosa.load(audio_path, sr=None)

    # Extract 40 MFCC coefficients
    mfccs = librosa.feature.mfcc(y=audio, sr=sr, n_mfcc=40)

    # Average across time dimension: (40, T) → (40,)
    mfccs_processed = np.mean(mfccs.T, axis=0)

    # Reshape for CNN: (40,) → (1, 40, 1, 1)
    return mfccs_processed.reshape(1, 40, 1, 1)

Why MFCCs?

  • MFCCs capture timbral characteristics humans use to distinguish sounds
  • Time-averaging handles variable-length audio (5s, 10s, 30s clips all → 40 features)
  • Compact representation reduces model complexity

Network Architecture

Input: (40, 1, 1)
├─ Conv2D(32 filters, 3×3 kernel, ReLU)
├─ MaxPooling2D(2×1)
├─ Conv2D(64 filters, 3×3 kernel, ReLU)
├─ MaxPooling2D(2×1)
├─ Flatten()
├─ Dense(128, ReLU)
├─ Dropout(0.5)
└─ Dense(50, Softmax)  → 50 probabilities

Training:

  • Dataset: ESC-50 (2000 labeled samples, 50 classes)
  • Loss: Categorical cross-entropy
  • Optimizer: Adam
  • Epochs: 50
  • Accuracy: ~85-90% (baseline for ESC-50)

Why this architecture?

  • Two convolutional layers extract hierarchical features
  • Pooling reduces dimensionality while preserving patterns
  • Dropout prevents overfitting on small dataset
  • Shallow enough to run fast on CPU

API Design: FastAPI Server

The FastAPI backend handles audio uploads, preprocessing, and inference.

Key Endpoints

from fastapi import FastAPI, UploadFile, File
from tensorflow.keras.models import load_model
import librosa
import numpy as np

app = FastAPI()
model = load_model('model.h5')

@app.post("/predict")
async def predict(file: UploadFile = File(...)):
    # Read uploaded audio
    audio_bytes = await file.read()

    # Save temporarily and extract features
    audio, sr = librosa.load(io.BytesIO(audio_bytes), sr=None)
    mfccs = librosa.feature.mfcc(y=audio, sr=sr, n_mfcc=40)
    features = np.mean(mfccs.T, axis=0).reshape(1, 40, 1, 1)

    # Predict
    prediction = model.predict(features)
    predicted_index = int(np.argmax(prediction))
    confidence = float(np.max(prediction))

    return {
        "prediction": {
            "label": get_label(predicted_index),
            "index": predicted_index,
            "confidence": confidence
        }
    }

Endpoints:

  • GET / — API info
  • GET /health — Health check
  • GET /model/info — Model metadata
  • POST /predict — Audio classification (accepts multipart/form-data)

Example Request:

curl -X POST http://localhost:8000/predict \
  -F "file=@frog_sound.wav"

Example Response:

{
  "id": "a3f8b2c1-4d5e-6f7g-8h9i-0j1k2l3m4n5o",
  "prediction": {
    "label": "frog",
    "index": 8,
    "confidence": 0.9234
  },
  "top_k": [
    {"rank": 1, "index": 8, "label": "frog", "confidence": 0.9234},
    {"rank": 2, "index": 14, "label": "chirping_birds", "confidence": 0.0432},
    {"rank": 3, "index": 29, "label": "rain", "confidence": 0.0187}
  ]
}

Mobile App: User Experience

Built with Ionic/Angular, the app provides an intuitive interface for sound recording and classification.

Core Features

1. Audio Recording

  • Tap microphone icon to start recording
  • Visual timer shows elapsed time
  • Tap again to stop and automatically submit for prediction

2. File Upload

  • Select existing audio files from device storage
  • Supports .wav, .mp3, .m4a, .ogg
  • Preview audio before submission

3. Real-Time Feedback

  • Loading spinner during prediction
  • Results displayed with confidence percentage
  • Clean, card-based UI with sound label in Title Case

TypeScript Implementation

export class HomePage {
  API_URL = 'http://localhost:8000/predict';
  audioData: Blob | null = null;
  label: string = '';
  confidence: number = 0;
  isRecording: boolean = false;
  isLoading: boolean = false;

  async startRecording() {
    const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
    this.mediaRecorder = new MediaRecorder(stream);

    this.mediaRecorder.ondataavailable = (event) => {
      this.audioChunks.push(event.data);
    };

    this.mediaRecorder.onstop = () => {
      this.audioData = new Blob(this.audioChunks, { type: 'audio/wav' });
      this.uploadAudio();
    };

    this.mediaRecorder.start();
    this.isRecording = true;
  }

  async uploadAudio() {
    this.isLoading = true;
    const formData = new FormData();
    formData.append('file', this.audioData, 'recording.wav');

    this.http.post(this.API_URL, formData).subscribe({
      next: (response: any) => {
        this.label = this.formatLabel(response.prediction.label);
        this.confidence = (response.prediction.confidence * 100).toFixed(1);
        this.isLoading = false;
      },
      error: (err) => {
        console.error('Prediction error:', err);
        this.isLoading = false;
      }
    });
  }
}

The ESC-50 Dataset: Training Ground

The model is trained on ESC-50, a carefully curated dataset for environmental sound classification.

Dataset Statistics:

  • 2000 audio files (40 examples per class)
  • 50 classes organized into 5 categories
  • 5-second clips at 44.1kHz
  • Balanced distribution (equal samples per class)

Example Categories:

Category Examples
Animals dog, rooster, pig, cow, frog, cat, hen, insects, sheep, crow
Natural soundscapes rain, sea waves, crackling fire, crickets, chirping birds, water drops, wind, pouring water, toilet flush, thunderstorm
Human non-speech crying baby, sneezing, clapping, breathing, coughing, footsteps, laughing, brushing teeth, snoring, drinking/sipping
Interior/domestic door wood knock, mouse click, keyboard typing, door wood creaks, can opening, washing machine, vacuum cleaner, clock alarm, clock tick, glass breaking
Exterior/urban helicopter, chainsaw, siren, car horn, engine, train, church bells, airplane, fireworks, hand saw

Training Split:

  • 80% training (1600 samples)
  • 20% testing (400 samples)
  • 5-fold cross-validation used in research

Reference: K. J. Piczak. ESC: Dataset for Environmental Sound Classification. Proceedings of the 23rd Annual ACM Conference on Multimedia, Brisbane, Australia, 2015.

Performance and Limitations

Model Performance

Accuracy: ~85-90% on test set (typical for ESC-50 baseline CNNs)

Best Performing Classes:

  • Clock tick (97%)
  • Rooster (93%)
  • Church bells (91%)

Challenging Classes:

  • Engine vs. Train (similar spectral patterns)
  • Crickets vs. Insects (overlapping features)
  • Breathing vs. Snoring (subtle differences)

Inference Speed:

  • Model loading: ~2s (one-time)
  • Feature extraction: ~100ms
  • Prediction: ~50ms
  • Total per-request: ~150ms

Future Improvements

  • Multi-label classification: Detect multiple simultaneous sounds
  • Transfer learning: Use pre-trained audio embeddings (YAMNet, AudioSet)
  • Confidence calibration: More reliable uncertainty estimates
  • On-device inference: TensorFlow.js or TFLite for offline mode
  • Streaming support: Real-time classification from microphone input

Why This Matters

Sound recognition technology has transformative potential:

Environmental Monitoring

  • Wildlife conservation: Automated species detection in rainforests
  • Urban planning: Noise pollution mapping in cities
  • Climate research: Tracking ecosystem changes through soundscapes

Nature Sounds is a small step toward making these applications accessible to everyone.

Getting Started

Quick Start (3 Commands)

# 1. Start the API server
cd PredictionsAPI && python fastapi-app.py

# 2. Launch the mobile app (separate terminal)
cd NatureSoundsApp && ionic serve

# 3. Test with sample audio
curl -X POST http://localhost:8000/predict -F "file=@samples/Frog-Croaking.wav"

Prerequisites

API Server:

  • Python 3.11+
  • TensorFlow/Keras
  • librosa, FastAPI, uvicorn

Mobile App:

  • Node.js 18+
  • Ionic CLI (npm install -g @ionic/cli)

Model Training (Google Colab):

  • TensorFlow 2.x
  • ESC-50 dataset

Installation

# Clone repository
git clone https://github.com/jalle007/NatureSounds.git
cd NatureSounds

# Setup API
cd PredictionsAPI
python -m venv env
env\Scripts\activate  # Windows
source env/bin/activate  # macOS/Linux
pip install -r requirements.txt

# Setup app
cd ../NatureSoundsApp
npm install

Training Your Own Model

  1. Open NatureSoundsModel/NatureSoundsModel.ipynb in Google Colab
  2. Mount Google Drive and upload ESC-50 dataset
  3. Run all cells to train model (50 epochs, ~30 minutes)
  4. Download model.h5 and place in NatureSoundsModel/ directory
  5. Restart API server to load new model

GitHub Repository

https://github.com/jalle007/NatureSounds

Full source code, documentation, and sample audio files available.

Closing Thoughts

Sound recognition is a hard problem—harder than image recognition in many ways. Variable lengths, background noise, overlapping sources, and the need for careful feature engineering all conspire to make it challenging.

But Nature Sounds proves that with modern deep learning, accessible tools, and careful design, we can build practical systems that work. From identifying birds in your backyard to detecting fire alarms for the hearing impaired, the applications are endless.

This is just the beginning. I hope this project inspires others to explore acoustic AI, contribute improvements, and build applications that make the world more accessible and connected.

The sounds around us tell stories. Let's learn to listen.


Connect & Contribute

References

Made with 🎵 for sound classification and environmental awareness

Published [Date] • 15 min read

Comments

Popular posts from this blog

Building REST API service that performs image labelling

Building ASP.NET Core app strengthened with AngularJS 2

Thing Translator