Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- Got it! I will research open-source machine translation models that support fine-tuning, specifically for English to Neo-Aramaic (both Eastern and Western varieties). I'll also look for parallel corpus datasets that can be used for training. Expect recommendations on repositories that provide training pipelines and tools to make the process smoother for you.
- I'll get back to you with my findings soon!
- # Open-Source MT Models & Frameworks for English→Neo-Aramaic
- Machine translation for **Neo-Aramaic** (modern Aramaic dialects) can leverage several open-source frameworks that support custom training or fine-tuning. The following frameworks and models are readily available for **English→Neo-Aramaic** translation:
- - **Helsinki-NLP MarianMT (OPUS-MT)** – MarianMT is a C++ based NMT framework used by the OPUS-MT project. The University of Helsinki provides pre-trained models on OPUS data. Notably, they released a **multilingual Bible translator model** that covers 1,300+ languages (including **Assyrian Neo-Aramaic** with Syriac script) ([Helsinki-NLP/opus-mt-tc-bible-big-mul-mul · Hugging Face](https://huggingface.co/Helsinki-NLP/opus-mt-tc-bible-big-mul-mul#:~:text=,gor%20gos%20got%20grc%20grn)) ([Helsinki-NLP/opus-mt-tc-bible-big-mul-mul · Hugging Face](https://huggingface.co/Helsinki-NLP/opus-mt-tc-bible-big-mul-mul#:~:text=,bik%20bis%20bod%20bom_Latn%20bos_Cyrl)). This Marian transformer model can translate from English into Neo-Aramaic, though performance is limited by scarce training data. It can be fine-tuned further on additional Neo-Aramaic data using the Marian toolkit or via Hugging Face Transformers (Marian models are available on Hugging Face). For example, a community-trained Marian model for English→Assyrian Neo-Aramaic (`mt-empty/english-assyrian`) was built by fine-tuning Helsinki’s English→Arabic model on Aramaic Bible texts ([GitHub - mt-empty/assyrian-translation-model: Assyrian translation model using huggingface](https://github.com/mt-empty/assyrian-translation-model#:~:text=This%20is%20an%20English%20to,be%20read%20by%20inexperienced%20developers)) ([GitHub - mt-empty/assyrian-translation-model: Assyrian translation model using huggingface](https://github.com/mt-empty/assyrian-translation-model#:~:text=The%20dataset%20are%20sourced%20from%3A)). This model (hosted on Hugging Face) demonstrates an approach of starting with a related high-resource pair and **fine-tuning** on Neo-Aramaic data, achieving a BLEU score ~33 after training ([GitHub - mt-empty/assyrian-translation-model: Assyrian translation model using huggingface](https://github.com/mt-empty/assyrian-translation-model#:~:text=Evaluation)). Marian’s training pipeline (OPUS-MT-train) and documentation can be used with minimal modifications, since it supports training on custom parallel corpora ([Helsinki-NLP/opus-mt-tc-bible-big-mul-mul · Hugging Face](https://huggingface.co/Helsinki-NLP/opus-mt-tc-bible-big-mul-mul#:~:text=This%20model%20is%20part%20of,train.%20Model%20Description)).
- - **Fairseq (PyTorch)** – Facebook AI’s Fairseq library is a powerful seq2seq training framework. While Fairseq doesn’t have a built-in Neo-Aramaic model, it allows you to train your own transformer or even leverage multilingual models. Notably, **Meta’s NLLB-200 (No Language Left Behind)** and **M2M-100** models were developed in Fairseq and cover many low-resource languages. **NLLB-200** is a massive many-to-many model covering 200 languages (reports indicate it includes **Assyrian Neo-Aramaic**) ([Models - Hugging Face](https://huggingface.co/models?language=ain#:~:text=Models%20,v1.%20Translation%20%E2%80%A2%20Updated)). You could use the released NLLB model (available on Hugging Face as `facebook/nllb-200-distilled-600M`) and fine-tune it on English–Neo-Aramaic data to specialize it. Similarly, **M2M-100** (100-language multilingual model) could be tried, though it may not include Aramaic by default. If starting from scratch, Fairseq provides examples for low-resource translation – you prepare parallel data, binarize it, and train a Transformer model via the CLI. Fairseq’s documentation and examples guide this process ([Applications - OpenNMT](https://opennmt.net/OpenNMT/applications/#:~:text=Neural%20Machine%20Translation%20,of%20domains%20and%20language%20pairs)) ([Applications - OpenNMT](https://opennmt.net/OpenNMT/applications/#:~:text=)). (See also the Medium tutorial *“Working with low-resource languages in Fairseq”* for tips on training a translator between English and a low-resource language like Neo-Aramaic.)
- - **OpenNMT (TensorFlow/PyTorch)** – OpenNMT is an open-source toolkit specifically designed for easy training of translation models. It has **ready-made training pipelines** that require minimal coding. You provide your English–Neo-Aramaic sentence pairs as input, tokenize them, and run the training command. OpenNMT-py’s [Quickstart guide](https://opennmt.net/OpenNMT-py/quickstart/) walks through an example (English–German) which can be adapted to Neo-Aramaic. The OpenNMT docs note that *“Neural Machine Translation (NMT) is the default task… It requires a corpus of bilingual sentences (for instance from OPUS) for a very large variety of language pairs”* ([Applications - OpenNMT](https://opennmt.net/OpenNMT/applications/#:~:text=Neural%20Machine%20Translation%20,of%20domains%20and%20language%20pairs)). This means you can plug in an English–Neo-Aramaic parallel corpus (see datasets below) and train a Transformer or RNN model. OpenNMT supports **transfer learning** as well – e.g. initialize with a model trained on a related high-resource language pair, then continue training on the low-resource pair. There is community discussion on using transfer learning in OpenNMT for low-resource cases (e.g. fine-tuning a Russian–English model on a smaller Russian–Abkhaz corpus) ([Are you interested in training Russian-Abkhazian parallel corpus?](https://forum.opennmt.net/t/are-you-interested-in-training-russian-abkhazian-parallel-corpus/3467#:~:text=Are%20you%20interested%20in%20training,translate%20more%20sentences%20for%20you)) ([Transfer learning · Issue #2149 · OpenNMT/OpenNMT-py - GitHub](https://github.com/OpenNMT/OpenNMT-py/issues/2149#:~:text=Transfer%20learning%20%C2%B7%20Issue%20,scale%20parallel%20corpus%20training)), which is analogous to using an English–Arabic model for English–Neo-Aramaic.
- - **Hugging Face Transformers** – Many of the above models (MarianMT, NLLB, M2M100, etc.) are integrated into Hugging Face’s `transformers` library. This means you can load a pre-trained translation model and fine-tune it using the high-level Trainer API. For example, you could load Helsinki’s `opus-mt-tc-big-en-ar` (English–Arabic) model as a starting point and fine-tune on English–Neo-Aramaic data, similar to what the `mt-empty/english-assyrian` project did ([GitHub - mt-empty/assyrian-translation-model: Assyrian translation model using huggingface](https://github.com/mt-empty/assyrian-translation-model#:~:text=This%20is%20an%20English%20to,be%20read%20by%20inexperienced%20developers)). Hugging Face provides [guides and scripts](https://huggingface.co/docs/transformers/tasks/translation) for fine-tuning Seq2Seq models. Using a Transformers pipeline can be very convenient if you have your data in hand – you’d tokenize the parallel texts, then train for a few epochs on a GPU. The mentioned `mt-empty/assyrian-translation-model` repository on GitHub even provides a simple training script (`model.py`) using 🤗 Transformers to reproduce their English→Assyrian mode ([GitHub - mt-empty/assyrian-translation-model: Assyrian translation model using huggingface](https://github.com/mt-empty/assyrian-translation-model#:~:text=To%20Train))】. This approach is accessible and avoids having to compile Marian C++ or deep dive into Fairseq configs.
- **Summary:** For ease of use, starting with **MarianMT via Hugging Face** (or OpenNMT) is recommended – you get a working English→Neo-Aramaic model with existing architectures and only need to fine-tune with your data. For more advanced experimentation or potentially better multilingual transfer, consider **Fairseq** with a model like NLLB. All of these frameworks are open-source and well-documented, so you won’t need extensive code modifications – just prepare your data and follow the training recipes.
- ## Parallel Corpora & Datasets for English–Neo-Aramaic
- Gathering a quality parallel corpus is the biggest challenge for Neo-Aramaic MT, since these languages are **extremely low-resource**. However, there are a few existing datasets and sources that can be utilized:
- - **Bible Translations:** The Bible is one of the few texts translated into various Neo-Aramaic dialects. For **Eastern Neo-Aramaic** (e.g. Assyrian Neo-Aramaic, known as Suret or Lishana Aturaya), there exist translated New Testament and whole Bible editions. These are a prime source of parallel data. In fact, the OPUS project’s *Bible Corpus* (based on Christodouloupoulos and Steedman’s collection) includes Assyrian Neo-Aramaic (ISO 639-3: `aii`) aligned with Englis ([[PDF] A Multilingual Dataset for Text Classification in 1500 Languages](https://arxiv.org/pdf/2305.08487#:~:text=Languages%20arxiv,08)) ([[PDF] GlotLID: Language Identification for Low-Resource ... - ACL Anthology](https://aclanthology.org/2023.findings-emnlp.410.pdf#:~:text=Anthology%20aclanthology,3e))】. Researchers have noted that *“obviously the New Testament would be great”* as a parallel dataset for Assyria ([Auto-Translator for Preserving a Semitic Language : r/LanguageTechnology](https://www.reddit.com/r/LanguageTechnology/comments/qyfyez/autotranslator_for_preserving_a_semitic_language/#:~:text=As%20for%20the%20data%20set%2C,so%20one%20could%20start%20there))】. You can obtain such data through OPUS (e.g. downloading the aligned verses in English and `aii` Syriac script). Similarly, for **Western Neo-Aramaic** (the dialect of Ma’loula, Syria, ISO `amw`), the New Testament has been translated by the **Aramaic Bible Translation (ABT)** organization. ABT spent over a decade translating the Bible into Maalouli Western Neo-Aramai ([Western Neo-Aramaic - Wikipedia](https://en.wikipedia.org/wiki/Western_Neo-Aramaic#:~:text=Aramaic%20Bible%20Translation%20,the%20Book%20of%20Psalms%20and))】. Portions like the Book of Psalms and the Gospels are available on their site Rinyo.or ([Western Neo-Aramaic - Wikipedia](https://en.wikipedia.org/wiki/Western_Neo-Aramaic#:~:text=Aramaic%20Bible%20Translation%20,the%20Book%20of%20Psalms%20and))】. These biblical texts (with corresponding English verses from standard Bible translations) form a valuable parallel corpus. Even if the full text isn’t directly downloadable, one can align verses from published Western Neo-Aramaic scriptures with English. In summary, *religious texts provide a ready-made parallel corpus* for both Eastern and Western Neo-Aramaic – typically tens of thousands of verse pairs (e.g. ~10k+ verse pairs for the New Testamen ([[PDF] GlotLID: Language Identification for Low-Resource ... - ACL Anthology](https://aclanthology.org/2023.findings-emnlp.410.pdf#:~:text=Anthology%20aclanthology,3e)) ([[PDF] arXiv:2310.16248v3 [cs.CL] 2 Jul 2024](https://arxiv.org/pdf/2310.16248#:~:text=a%20massively%20parallel%20Bible%20corpus,05))】).
- - **Tatoeba Community Corpus:** *Tatoeba* is an open, crowd-sourced collection of translated sentences. While Neo-Aramaic entries are sparse, it is a platform to **find or contribute parallel sentences**. As of now, Tatoeba has only a handful of Assyrian Neo-Aramaic sentences (a stats page shows just a single-digit count ([Tatoeba: Sentences & Translations Stats](https://tatoeba.j-langtools.com/transtop/?by=lang#:~:text=Tatoeba%3A%20Sentences%20%26%20Translations%20Stats,50%2C%20%E3%81%82%E2%86%92a%2C%20G))】. However, some in the community have used Tatoeba to build up data. One effort managed to gather an initial Assyrian–English sentence corpus via Tatoeba and recommended it for low-resource data collectio ([Auto-Translator for Preserving a Semitic Language : r/LanguageTechnology](https://www.reddit.com/r/LanguageTechnology/comments/qyfyez/autotranslator_for_preserving_a_semitic_language/#:~:text=parallel%20corpus%20for%20a%20low,managed%20to%20build%20up%20some))】. If you have access to fluent speakers, contributing translations to Tatoeba (or harvesting what’s there) could incrementally grow a corpus. Think of Tatoeba as a long-term crowdsourcing strategy – not much data to start, but *accessible and license-free*.
- - **OPUS Parallel Corpora:** Beyond the Bible, the OPUS repository contains other multi-lingual corpora that *might* include Neo-Aramaic. For example, the **JW300 corpus** (a collection of Jehovah’s Witness publications in 300 languages) is known to cover many low-resource languages. There are indications that Assyrian Neo-Aramaic is included in JW30 ([[PDF] Do Not Trust Licenses You See—Dataset Compliance Requires ...](https://lgresearch.ai/data/upload/LG_AI_Research_Data_compliance_arxiv_EST.pdf#:~:text=,sentences))】, which could mean up to tens of thousands of sentence pairs (JW300 has ~100k sentences per language on average). You could check OPUS for a `JW300.en-aii.txt` or similar. Another possible OPUS corpus is **Tanzil** or **Bible-uedin** (which overlap with the Bible data discussed). It’s worth exploring OPUS’s web interface for language code “aii” (Assyrian) or “amw” (Western Neo-Aramaic) to see available aligned texts. **Note:** Even if direct OPUS parallel data for Western Neo-Aramaic is lacking (which is likely), you might find data for **Turoyo** (Surayt), a Central Neo-Aramaic dialect written in Syriac script, since some modern literature exists in Turoyo. OPUS’s *Books* or *TED* corpora won’t have Aramaic, but *Tanzil* (Quran) has classical Arabic–Aramaic (classical Syriac) translations, which is more of a historical interest.
- - **Academic & Literary Translations:** Outside of religious texts, there are a few translated literature and documentation projects:
- - *Folklore and Oral Literature:* A recent project *“Neo-Aramaic and Kurdish Folklore from Northern Iraq”* published a bilingual anthology of folktale ([Masoud Mohammadirad - Projects](https://masoudmohammadirad.com/projects.html#:~:text=The%20main%20objective%20is%20to,between%20Christian%20and%20Muslim%20communities))】. This includes stories in **North-Eastern Neo-Aramaic (Assyrian)** alongside English (and Kurdish). Because it’s an academic publication (Open Book Publishers, 2022), the texts are likely aligned paragraph by paragraph, and possibly available under Creative Commons. This could be mined for parallel sentences. Folktales provide more colloquial language than the Bible, so they are a great complement to religious text data.
- - *Fiction Translations:* There have been efforts to translate popular literature into Neo-Aramaic dialects. For example, some **Turoyo (Central Neo-Aramaic)** translations of works of fiction exis ([Auto-Translator for Preserving a Semitic Language : r/LanguageTechnology](https://www.reddit.com/r/LanguageTechnology/comments/qyfyez/autotranslator_for_preserving_a_semitic_language/#:~:text=other%20languages%20,so%20one%20could%20start%20there))】. If these can be obtained (e.g. from publishers or community groups), you could align them with the original English text. One known publisher is **Verlag Tintenfass**, which has released Syrian folk tales and possibly children’s books in Turoyo/Suray ([Auto-Translator for Preserving a Semitic Language : r/LanguageTechnology](https://www.reddit.com/r/LanguageTechnology/comments/qyfyez/autotranslator_for_preserving_a_semitic_language/#:~:text=other%20languages%20,so%20one%20could%20start%20there))】. Similarly, there might be **Assyrian translations of novels** (for instance, an Assyrian translation of *Alice in Wonderland* was published). While these are not large corpora, each book can contribute a few thousand sentence pairs.
- - *News or Web Text:* Neo-Aramaic does not have much presence in news media, but there are community websites and blogs (often with translations). For example, some Assyrian diaspora organizations post content in both English and Assyrian. If you find such bilingual articles, you can extract parallel sentences. The **Assyrian International News Agency (AINA)** or **Zinda Magazine** archives might have side-by-side translations in some cases. This requires manual scraping/alignment but can yield domain-diverse data.
- - *Dictionaries & Phrasebooks:* While not full sentences, resources like the **Assyrian-English online dictionaries* ([Auto-Translator for Preserving a Semitic Language : r/LanguageTechnology](https://www.reddit.com/r/LanguageTechnology/comments/qyfyez/autotranslator_for_preserving_a_semitic_language/#:~:text=Yep%2C%20that%27s%20the%20language%21%20There,online%20dictionaries%20that%20I%27ve%20found))】 or Wiktionary can provide **glosses and short phrases**. These can be used to create synthetic simple sentences or at least ensure proper names and key terms are covered in your vocab. For example, Wikiversity has an “Aramaic Phrases” page with common expressions in English and Assyrian Neo-Aramai ([Aramaic Language/Phrases - Wikiversity](https://en.wikiversity.org/wiki/Aramaic_Language/Phrases#:~:text=Aramaic%20Language%2FPhrases%20,female))】. Such phrase pairs (hello, thank you, etc.) can augment your training data slightly and be useful for evaluation.
- **If no direct Neo-Aramaic data is at hand:** Consider leveraging **closely related languages**. Neo-Aramaic dialects are part of the Semitic family, and they share some vocabulary with **Arabic** and **Hebrew**. You might not find parallel corpora for Neo-Aramaic itself, but you can use a **pivot-language strategy**. For instance, translate English→Arabic using a high-quality model, then have a human or rule-based system convert Arabic→Neo-Aramaic (since many Neo-Aramaic speakers also know Arabic). This is not ideal, but as a bootstrap it can create a pseudo-parallel corpus. Another approach is **back-translation**: gather monolingual Neo-Aramaic text (e.g. the *Neo-Aramaic Web Corpora* project has stories in Urmi and Turoyo) and translate them to English with an initial rough model or via human translators, then use those as training data.
- In summary, you should **prioritize the Bible and any available religious or folk text translations** as your parallel data. Those give you a solid base. Then, supplement with whatever small parallel sources you can find (community contributions, academic projects). Even a few thousand sentence pairs can be sufficient to fine-tune a larger multilingual model, thanks to transfer learning.
- ## Fine-Tuning & Training Setup
- Once you have chosen a model/framework and gathered data, the **fine-tuning process** will generally involve the following steps (with references to guides):
- 1. **Data Preparation:** Ensure your parallel corpus is in the correct format. This usually means one text file of English sentences and one text file of Neo-Aramaic sentences, line-aligned (each line in one is the translation of the same line in the other). Clean the data to remove any unwanted characters. Tokenization is important, especially for Syriac script – you may want to use SentencePiece or byte-pair encoding to handle rare Unicode characters. For example, the Assyrian MT project used SentencePiece for tokenizatio ([GitHub - mt-empty/assyrian-translation-model: Assyrian translation model using huggingface](https://github.com/mt-empty/assyrian-translation-model#:~:text=Tokenizer))】. Many frameworks (Marian, OpenNMT) have built-in tokenizers or support subword models.
- 2. **Leverage Existing Models (Transfer Learning):** Given the low-resource nature, it’s highly beneficial to start from a pre-trained model. You could take a **Marian English→Arabic** model or a **multilingual NMT** model and use it as the initialization for English→Neo-Aramaic. This transfers general linguistic knowledge. The Reddit discussion recommended *“transfer learning from an Arabic–English model”* for Assyrian M ([Auto-Translator for Preserving a Semitic Language : r/LanguageTechnology](https://www.reddit.com/r/LanguageTechnology/comments/qyfyez/autotranslator_for_preserving_a_semitic_language/#:~:text=in%20the%20relevant%20language%20pair,are%20actually%20translations%20of%20some))】, which is exactly what was done by mt-empty’s model (they started with Helsinki’s EN-AR Marian model ([GitHub - mt-empty/assyrian-translation-model: Assyrian translation model using huggingface](https://github.com/mt-empty/assyrian-translation-model#:~:text=This%20is%20an%20English%20to,be%20read%20by%20inexperienced%20developers))】. If you use Hugging Face, this is as simple as `AutoModelForSeq2SeqLM.from_pretrained("<model-name>")` and then providing your new training data to the Trainer. If using Marian C++ directly, you can provide the pre-trained model as a starting point (check Marian’s `--transfer` or finetune options).
- 3. **Training/Fine-Tuning Process:** Follow the framework’s training instructions:
- - **Marian/OPUS-MT:** If using the OPUS pipeline, you’d use their scripts to preprocess (train SentencePiece model, encode data) and then run Marian training. OPUS-MT’s GitHub and OPUS-MT-train docs detail this proces ([Helsinki-NLP/opus-mt-tc-bible-big-mul-mul · Hugging Face](https://huggingface.co/Helsinki-NLP/opus-mt-tc-bible-big-mul-mul#:~:text=This%20model%20is%20part%20of,train.%20Model%20Description))】. Alternatively, using Hugging Face, you can fine-tune with the `Seq2SeqTrainingArguments` and `Trainer`. The Hugging Face documentation has a [translation fine-tuning example](https://huggingface.co/docs/transformers/training#fine-tune-translation) which you can adapt to your dataset. The mt-empty GitHub repository provides a simple script (`model.py`) that loads data with 🤗 Datasets and trains the model for 50 epoc ([GitHub - mt-empty/assyrian-translation-model: Assyrian translation model using huggingface](https://github.com/mt-empty/assyrian-translation-model#:~:text=To%20Train))1】 – you can use that as a template.
- - **Fairseq:** You will first run `fairseq-preprocess` to binarize your aligned data (generating vocabulary, etc.), then `fairseq-train` with appropriate flags (model architecture, embedding size, etc.). The Fairseq examples repository has a translation READ ([fairseq/examples/translation/README.md at main - GitHub](https://github.com/facebookresearch/fairseq/blob/master/examples/translation/README.md#:~:text=fairseq%2Fexamples%2Ftranslation%2FREADME.md%20at%20main%20,well%20as%20training%20new%20models))8】 and the official docs cover training new mode ([README.md - facebookresearch/fairseq - GitHub](https://github.com/facebookresearch/fairseq/blob/main/README.md#:~:text=README.md%20,trained%20models))7】. Since Neo-Aramaic might use a unique script, be mindful to set character encoding properly (UTF-8) and maybe use SentencePiece for open-vocabulary. Fairseq also allows continuing training from a checkpoint (for fine-tuning a pre-trained model like NLLB – you’d need to ensure the tokenizer/vocab matches or is compatible).
- - **OpenNMT:** OpenNMT-py’s quickstart shows a 3-step process: tokenize, preprocess, then `onmt_train`. In the config, you specify the parallel corpus files and model type. The OpenNMT documentation explicitly uses OPUS data in examples (which is convenient for u ([Applications - OpenNMT](https://opennmt.net/OpenNMT/applications/#:~:text=Neural%20Machine%20Translation%20,of%20domains%20and%20language%20pairs))8】. After training, you get a checkpoint that can be used with `onmt_translate` for inference. OpenNMT also supports modern Transformer models and multi-language training if needed. For low-resource, you might experiment with techniques like copy mechanisms or synthetic data (OpenNMT forums have discussions on these).
- 4. **Evaluation:** Use standard MT metrics to track progress. If you have a held-out test set (maybe a portion of your parallel corpus, or another source like translated verses not in training), compute BLEU or SacreBLEU. The Assyrian MT project reported using SacreBLEU to evaluate their mod ([GitHub - mt-empty/assyrian-translation-model: Assyrian translation model using huggingface](https://github.com/mt-empty/assyrian-translation-model#:~:text=Evaluation))2】. BLEU scores for low-resource MT can be low, so also do qualitative evaluation. If possible, involve a fluent speaker to judge translation quality (especially for Western Neo-Aramaic, where automatic metrics are less reliable due to small test sets).
- 5. **Iterate and Improve:** Given the likely small data size, you might iterate by adding more data (if found or translated later), adjusting hyperparameters, or employing **data augmentation**:
- - *Back-translation:* Take Neo-Aramaic text (from web corpora or books) and translate it *into English* using a preliminary model, then add those synthetic pairs to training. This can boost performance by exposing the model to more Aramaic structur ([Auto-Translator for Preserving a Semitic Language : r/LanguageTechnology](https://www.reddit.com/r/LanguageTechnology/comments/qyfyez/autotranslator_for_preserving_a_semitic_language/#:~:text=NMT%20is%20definitely%20the%20sexiest,or%20not%20you%27re%20translating%20from))3】.
- - *Multilingual training:* You could train a single model on English→{Arabic, Hebrew, Aramaic} combined data (treating Aramaic as just another target language) to let it learn from the higher-resource languages and hopefully carry some improvement to Aramaic. This is essentially what massive models like NLLB do. It’s complex but can be tried with Fairseq or Transformers by tagging the target language.
- - *Fine-tune separately for dialects:* Eastern and Western Neo-Aramaic are quite different. If you need a model for each, consider training separate models. Alternatively, train a single model with a special token to indicate dialect (e.g., `<dialect:Western>` vs `<dialect:Eastern>` in the input) if using a multilingual approach. This way, the model can handle both but still distinguish their outputs.
- ## Additional Resources & References
- - **Helsinki-NLP OPUS-MT** – Project releasing pre-trained MT models for many languages. See their model repository on Hugging Fa ([Helsinki-NLP/opus-mt-tc-bible-big-mul-mul · Hugging Face](https://huggingface.co/Helsinki-NLP/opus-mt-tc-bible-big-mul-mul#:~:text=This%20model%20is%20part%20of,train.%20Model%20Description))0】 and the OPUS website. Many models (including Bible-based ones) are under Apache 2.0 license, meaning you can use and fine-tune them freely.
- - **mt-empty English-Assyrian MT** – Community project demonstrating fine-tuning Marian for Neo-Aramaic. GitHub: *mt-empty/assyrian-translation-mode ([GitHub - mt-empty/assyrian-translation-model: Assyrian translation model using huggingface](https://github.com/mt-empty/assyrian-translation-model#:~:text=To%20Train))1】. It includes code, data references, and usage examples.
- - **Aramaic Bible Translation (ABT)** – Non-profit translating Aramaic dialects. Their site (e.g., Rinyo.org and scriptureearth.org) provides Western Neo-Aramaic scriptur ([Western Neo-Aramaic - Wikipedia](https://en.wikipedia.org/wiki/Western_Neo-Aramaic#:~:text=Aramaic%20Bible%20Translation%20,the%20Book%20of%20Psalms%20and))4】. This can be a dataset and also they might have audio recordings which could aid speech-to-text if ever needed.
- - **Cambridge NENA Database** – A rich **monolingual** resource with transcribed spoken narratives in many North-Eastern Neo-Aramaic dialec ([Masoud Mohammadirad - Projects](https://masoudmohammadirad.com/projects.html#:~:text=intertwined%20histories%20of%20the%20Aramaic,and%20Gorani%20dialects%20is%20required))7】. While it’s not parallel, it’s useful for understanding the language and could be a source of monolingual text for back-translation experiments.
- - **OpenNMT Forums & Fairseq Github** – For technical questions on training, these communities can help. E.g., OpenNMT forum has threads on low-resource translation and synthetic da ([Are you interested in training Russian-Abkhazian parallel corpus?](https://forum.opennmt.net/t/are-you-interested-in-training-russian-abkhazian-parallel-corpus/3467#:~:text=Are%20you%20interested%20in%20training,translate%20more%20sentences%20for%20you)) ([Transfer learning · Issue #2149 · OpenNMT/OpenNMT-py - GitHub](https://github.com/OpenNMT/OpenNMT-py/issues/2149#:~:text=Transfer%20learning%20%C2%B7%20Issue%20,scale%20parallel%20corpus%20training))3】.
- By combining the right open-source model with available data and some ingenuity in data sourcing, you can set up an English→Neo-Aramaic translation system without extensive code development. The key is to **maximize transfer learning and existing resources**: start with a robust framework (Marian, Fairseq, OpenNMT) and plug in any parallel text you can find. With the above models, corpora, and guidelines, you should be able to fine-tune a translator for both Eastern and Western Neo-Aramaic that is as good as the data allows. Good luck with your Neo-Aramaic MT project!
- **Sources:**
- - OPUS Bible corpus and OPUS-MT model in ([Helsinki-NLP/opus-mt-tc-bible-big-mul-mul · Hugging Face](https://huggingface.co/Helsinki-NLP/opus-mt-tc-bible-big-mul-mul#:~:text=,gor%20gos%20got%20grc%20grn)) ([Helsinki-NLP/opus-mt-tc-bible-big-mul-mul · Hugging Face](https://huggingface.co/Helsinki-NLP/opus-mt-tc-bible-big-mul-mul#:~:text=,bik%20bis%20bod%20bom_Latn%20bos_Cyrl))0】
- - mt-empty English-Assyrian model (HuggingFace & GitHu ([GitHub - mt-empty/assyrian-translation-model: Assyrian translation model using huggingface](https://github.com/mt-empty/assyrian-translation-model#:~:text=This%20is%20an%20English%20to,be%20read%20by%20inexperienced%20developers)) ([GitHub - mt-empty/assyrian-translation-model: Assyrian translation model using huggingface](https://github.com/mt-empty/assyrian-translation-model#:~:text=The%20dataset%20are%20sourced%20from%3A))7】
- - Reddit discussion on low-resource MT for Neo-Aramaic (transfer learning and data source ([Auto-Translator for Preserving a Semitic Language : r/LanguageTechnology](https://www.reddit.com/r/LanguageTechnology/comments/qyfyez/autotranslator_for_preserving_a_semitic_language/#:~:text=in%20the%20relevant%20language%20pair,into%20Turoyo%2FSurayt%2C%20so%20one%20could)) ([Auto-Translator for Preserving a Semitic Language : r/LanguageTechnology](https://www.reddit.com/r/LanguageTechnology/comments/qyfyez/autotranslator_for_preserving_a_semitic_language/#:~:text=As%20for%20the%20data%20set%2C,so%20one%20could%20start%20there))3】
- - OpenNMT documentation (training process and data from OPU ([Applications - OpenNMT](https://opennmt.net/OpenNMT/applications/#:~:text=Neural%20Machine%20Translation%20,of%20domains%20and%20language%20pairs)) ([Applications - OpenNMT](https://opennmt.net/OpenNMT/applications/#:~:text=Training%20a%20NMT%20engine%20is,a%203%20steps%20process))3】
- - Western Neo-Aramaic Bible translation (ABT projec ([Western Neo-Aramaic - Wikipedia](https://en.wikipedia.org/wiki/Western_Neo-Aramaic#:~:text=Aramaic%20Bible%20Translation%20,the%20Book%20of%20Psalms%20and))4】
- - Masoud Mohammadirad’s project on Neo-Aramaic/Kurdish folktal ([Masoud Mohammadirad - Projects](https://masoudmohammadirad.com/projects.html#:~:text=The%20main%20objective%20is%20to,between%20Christian%20and%20Muslim%20communities))5】.
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement