NVIDIA Releases Granary Dataset and Models for Multilingual Speech AI
Executive Summary
NVIDIA has released Granary, a massive open-source dataset with approximately one million hours of audio for training speech AI. The company also launched two new models trained on this dataset: the high-accuracy Canary-1b-v2 and the high-throughput Parakeet-tdt-0.6b-v3. This initiative aims to accelerate the development of production-scale automatic speech recognition (ASR) and translation (AST) for 25 European languages, including those with limited existing data.
Key Takeaways
* Granary Dataset: An open-source corpus of ~1 million audio hours, consisting of ~650,000 hours for speech recognition and ~350,000 hours for speech translation, covering 25 European languages.
* Data Processing Innovation: The dataset was created using the NVIDIA NeMo Speech Data Processor toolkit, which processed unlabeled audio into high-quality training data without manual human annotation.
* Canary-1b-v2 Model: A 1-billion-parameter model optimized for high-quality transcription and translation between English and the 25 supported languages. It reportedly offers quality comparable to models 3x larger with up to 10x faster inference.
* Parakeet-tdt-0.6b-v3 Model: A streamlined 600-million-parameter model designed for real-time, high-throughput transcription. It automatically detects the input language and can process long audio segments in a single pass.
* Target Audience: AI developers building multilingual applications such as chatbots, customer service voice agents, and near-real-time translation services.
* Availability: The Granary dataset and both Canary and Parakeet models are available now on Hugging Face. The Canary-1b-v2 model is released under a permissive license.
Strategic Importance
This release lowers the barrier for creating high-performance speech AI in underserved languages, positioning NVIDIA's NeMo ecosystem as a foundational tool for the global developer community and reinforcing its leadership in AI infrastructure.