Speech Emotion Recognition
Deep learning system recognising human emotions from raw audio using state-of-the-art wav2vec2 transformer architecture.
85โ92%
Target Accuracy
960h
Pre-training Audio
<500ms
Inference Latency
0
Manual Features Needed
The Problem
Call centres and mental health applications need to detect emotional states from voice in real time. Traditional handcrafted audio features (MFCCs, pitch) plateau at ~70% accuracy and struggle with speaker variability.
The Solution
Fine-tuned Facebook's wav2vec2 transformer โ pre-trained on 960 hours of unlabelled speech โ on labelled emotion datasets. The model learns rich contextual audio representations directly from waveforms, bypassing fragile feature engineering. PyTorch Lightning manages training with gradient accumulation and mixed precision.
Results & Metrics
- Target accuracy range of 85โ92% across emotion classes
- End-to-end learning from raw audio โ no manual feature engineering
- Robust to speaker variability via pre-trained representations
- Supports real-time inference for live call monitoring
- Deployable as a REST API with sub-second latency