Deep LearningNLP

Speech Emotion Recognition

Deep learning system recognising human emotions from raw audio using state-of-the-art wav2vec2 transformer architecture.

Deep LearningNLP
๐ŸŽ™๏ธ

85โ€“92%

Target Accuracy

960h

Pre-training Audio

<500ms

Inference Latency

0

Manual Features Needed

The Problem

Call centres and mental health applications need to detect emotional states from voice in real time. Traditional handcrafted audio features (MFCCs, pitch) plateau at ~70% accuracy and struggle with speaker variability.

The Solution

Fine-tuned Facebook's wav2vec2 transformer โ€” pre-trained on 960 hours of unlabelled speech โ€” on labelled emotion datasets. The model learns rich contextual audio representations directly from waveforms, bypassing fragile feature engineering. PyTorch Lightning manages training with gradient accumulation and mixed precision.

Results & Metrics

  • Target accuracy range of 85โ€“92% across emotion classes
  • End-to-end learning from raw audio โ€” no manual feature engineering
  • Robust to speaker variability via pre-trained representations
  • Supports real-time inference for live call monitoring
  • Deployable as a REST API with sub-second latency

Tech Stack

wav2vec2PyTorchTransformersHugging FaceLibROSAPython