Speech Transcription Data

This page brings together information and links related to Manx speech transcription resources, with a particular emphasis on open evaluation sets and research-ready data. The goal is to support the development of speech technologies for Manx, including automatic speech recognition (ASR), text-to-speech (TTS), and machine translation (MT).

None of the original audio files will be uploaded here due to file size constraints and the potential for copyright infringement or restrictive licensing conditions. Where the original recordings are publicly available, a link to the source will be provided alongside the transcriptions, usually in a metadata.tsv file.

Why Build Manx Speech Datasets?

Manx is a low-resource language, making the development of reliable speech and language technologies difficult without carefully curated data. Publicly available transcription datasets help by:

🗣️ Supporting ASR development:
Train and evaluate systems that convert Manx speech into text
🔊 Improving TTS systems:
Model Manx pronunciation and prosody for natural-sounding synthesis
🌍 Advancing MT and S2T tasks:
Enable translation of Manx speech or text into English (and vice versa)
🧭 Benchmarking Spoken Language Identification (SpokenLID):
Evaluate whether systems can correctly identify Manx in multilingual or code-switched contexts

Featured Dataset: Loayr

Loayr is the first segmented speech corpus for Manx, offering manually validated and automatically segmented transcriptions across a range of domains. It supports robust evaluation of ASR, and is structured into training, development, and test sets with consistent metadata and formatting.

For detailed information, data format, statistics, and experimental results, visit the Loayr repository.

Contributing or Requesting Data

Due to size and licensing restrictions, this repository does not host the audio files directly. For data access or to contribute new speech recordings, please contact:

📧 csjbartley1@sheffield.ac.uk

We welcome:

Transcribed Manx recordings
Cleaned or aligned text
Metadata corrections
Evaluation feedback from model developers

Licensing and Attribution

All data referenced in this project has been sourced from publicly available materials and is distributed in accordance with the original creators’ licensing terms. For detailed attribution, consult the metadata files in each dataset.