Automatic Transcriptions
This repository serves as a central index for automatically transcribed resources in Manx, with a particular focus on transparency, traceability, and reproducibility. It is designed to bring together a wide range of transcriptions from diverse sources, while clearly distinguishing between real and synthetic data.
None of the original audio files will be uploaded here due to file size constraints and the potential for copyright infringement or restrictive licensing conditions. Where the original recordings are publicly available, a link to the source will be provided alongside the transcriptions, usually in a metadata.tsv file.
Purpose
The goal of this repository is to:
- Provide open access to automatically generated Manx transcriptions
- Track the origin and method of each transcription (e.g., model used, data type)
- Facilitate the reuse of these transcriptions for downstream NLP tasks
- Make a clear distinction between real-world audio and synthetic inputs, enabling responsible and informed research use
Model
All automatic transcriptions in this repository were generated using a DNN-HMM hybrid model trained on the Loayr dataset — a curated speech corpus of Manx containing diverse and domain-rich content. The hybrid model was developed using the Kaldi toolkit and fine-tuned specifically for Manx.
Repository Structure
The repository will be organized by:
- Source domain (e.g. podcasts, educational audio, folklore)
- Corpus (e.g. Manx Radio Broadcasts)
- Transcription format (e.g.
.txt
,.srt
)
Each corpus will include:
- Metadata indicating the original audio source
- Details about the transcription process
- Relevant statistics
Why include synthetic data?
Synthetic data can be useful for several tasks:
- 🗣 Text-to-Speech (TTS) training: Paired synthetic audio and text can augment low-resource datasets.
- 📝 Automatic subtitling: Transcripts aligned to speech can help subtitle previously uncaptioned Manx content.
- 🔄 Data augmentation: For bootstrapping more robust ASR models through multi-condition training.
📣 Contributions of new automatically transcribed resources — real or synthetic — are welcome. Please open an issue or pull request with your additions.