A Collection of Tools to Help Shape the Digital Future of the Manx Language

Automatic Transcriptions

This repository serves as a central index for automatically transcribed resources in Manx, with a particular focus on transparency, traceability, and reproducibility. It is designed to bring together a wide range of transcriptions from diverse sources, while clearly distinguishing between real and synthetic data.

None of the original audio files will be uploaded here due to file size constraints and the potential for copyright infringement or restrictive licensing conditions. Where the original recordings are publicly available, a link to the source will be provided alongside the transcriptions, usually in a metadata.tsv file.

Purpose

The goal of this repository is to:

Model

All automatic transcriptions in this repository were generated using a DNN-HMM hybrid model trained on the Loayr dataset — a curated speech corpus of Manx containing diverse and domain-rich content. The hybrid model was developed using the Kaldi toolkit and fine-tuned specifically for Manx.

Repository Structure

The repository will be organized by:

Each corpus will include:

Why include synthetic data?

Synthetic data can be useful for several tasks:

📣 Contributions of new automatically transcribed resources — real or synthetic — are welcome. Please open an issue or pull request with your additions.