Helsinki-NLP · amikael · Aug 22, 2025 · Aug 23, 2025 · Aug 23, 2025 · Aug 23, 2025
diff --git a/.gitignore b/.gitignore
@@ -117,3 +117,8 @@ sftp.config*.json
 #vim
 *.swp
 *.swo
+*.ipynb
+
+# claude
+CLAUDE.md
+.claude/*
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -0,0 +1,56 @@
+# HuggingFace Model Integration Update
+
+## Overview
+
+This update introduces HuggingFace model integration capabilities to Mammoth, enabling seamless conversion and use of pre-trained HuggingFace BART models within the Mammoth translation framework.
+
+For usage instructions, please refer to README.md.
+
+## What Was Updated (15-Sep-2025)
+
+**LUMI Supercomputer Support**
+
+- Added LUMI environment setup script (`lumi/env_setup.sh`) for PyTorch virtual environment configuration
+- Added SLURM batch scripts for training (`lumi/train.sh`) and translation (`lumi/translate.sh`) on LUMI
+- Minimal requirements file (`requirements_lumi.txt`) for LUMI deployment with essential dependencies
+
+**Improvements**
+
+- Added early validation for save_model path with proper error handling and directory creation. So the destination directory will be validated before the training starts.
+- Improved error messages for save_model path validation
+
+## What Was Updated (05-Sep-2025)
+
+**HuggingFace Converter Enhancement (`hf2mammoth2hf.py`)**
+
+- Convert Mammoth models (previously converted from HF) back to HuggingFace format
+- Push converted models directly to HuggingFace model hub
+
+### Enhanced Dependencies
+
+- Added `sentencepiece==0.2.1` for tokenization/detokenization support
+- Cleaned up NVIDIA CUDA dependencies for broader compatibility
+
+## What Was Updated (27-Aug-2025)
+
+**Training Configuration Updates**
+
+- Added `--valid_metrics` parameter to training configuration to enable in-training validation with metrics
+- Added "BLEU" metric from `sacrebleu` library as an option for `--valid_metrics`. By default, this returns the corpus BLEU score against the reference.
+
+## What Was Updated (22-Aug-2025)
+
+#### 1. HuggingFace Model Converter (`hf2mammoth.py`)
+
+- **Complete BART to Mammoth conversion pipeline**
+- Converts HuggingFace BART models to Mammoth-compatible format
+- Three-stage conversion process:
+  - Stage 1: HuggingFace BART → X-Transformers
+  - Stage 2: X-Transformers → Mammoth model
+  - Stage 3: Save as Mammoth checkpoint
+- Automatic vocabulary extraction from HF tokenizers for on-the-fly tokenization/detokenization
+
+#### 2. X-Transformers library integration & update (`mammoth/x_transformers/`)
+
+- Integrated the X-Transformers library into the Mammoth directory
+- Updated the supported [X-Transformers](https://github.com/lucidrains/x-transformers) library version to 2.7.2
diff --git a/README.md b/README.md
@@ -1,14 +1,156 @@
 # 🦣 MAMMOTH: Massively Multilingual Modular Open Translation @ Helsinki
+
 This repository contains the code for 🦣 MAMMOTH, the modular translation toolkit from Helsinki-NLP.
 
 This library is built on top of OpenNMT-py.
 [OpenNMT-py](https://github.com/OpenNMT/OpenNMT-py) is the [PyTorch](https://github.com/pytorch/pytorch) version of the [OpenNMT](https://opennmt.net) project, an open-source (MIT) neural machine translation framework. It is designed to be research friendly to try out new ideas in translation, summary, morphology, and many other domains. Some companies have proven the code to be production ready.
 
 ### Documentation
+
 The original ONMT-py documentation is available [here](https://opennmt.net/OpenNMT-py/).  
 Our own modifications (currently-under-development) are documented [here](https://helsinki-nlp.github.io/mammoth/).
 
-FINAL NOTE: We would appreciate it A LOT if you report issues with this repo and/or the documentation 
+**Note:** We would greatly appreciate issue reports for this repository and its documentation.
 
 ### Acknowledgements
+
 We thank the NVIDIA AI Technology Center Finland for their help with the multi-gpu/node implementation.
+
+For the update history in this branch, see CHANGELOG.md.
+
+**If you are working in LUMI environment, please see the [quickstart guide here](lumi/LUMI_QUICKSTART.md).**
+
+## Quick Start: HuggingFace Integration
+
+This guide shows:
+
+- how to quickly convert and use HuggingFace BART models with Mammoth
+- how to inference with a converted (or general) Mammoth model
+- how to continue training a Mammoth model from a converted checkpoint (or any checkpoint)
+- how to push the Mammoth BART model to HuggingFace.
+
+### Setup
+
+```bash
+# Copy repository from feat/hf_integration branch
+git clone -b feat/hf_integration https://github.com/Helsinki-NLP/mammoth.git
+
+# cd into the directory
+cd mammoth
+
+# Install the dependencies
+pip install -r requirements.txt
+```
+
+### HF2Mammoth Converter: convert Huggingface BART model to Mammoth
+
+#### Basic Usage
+
+Convert a HuggingFace BART model to Mammoth format:
+
+```bash
+# Basic conversion from Hugging Face model hub (language pair needs to be specified)
+python hf2mammoth.py vgaraujov/bart-base-translation-en-es ./save/path/converted_model --src-lang en --tgt-lang es
+
+# Local model conversion (language pair needs to be specified)
+python hf2mammoth.py /path/to/local/bart/model ./save/path/converted_model --src-lang en --tgt-lang es
+```
+
+**Important Notes:**
+
+- The "converted_model" parameter serves as the prefix for converted model component names and does not affect inference.
+- While BART architecture supports many tasks, this setup has been tuned and tested specifically for translation tasks (using BartForConditionalGeneration). Different model heads may have varying weights and could produce unexpected results.
+- The HF BART model used for debugging is [vgaraujov/bart-base-translation-en-es](https://huggingface.co/vgaraujov/bart-base-translation-en-es).
+
+#### What the Converter Does
+
+1. **Downloads/Loads** the HF BART model and tokenizer
+2. **Creates** an x-transformers model with matching architecture
+3. **Maps weights** from HF format to x-transformers format
+4. **Builds** a Mammoth model structure with task configuration
+5. **Transfers weights** from x-transformers to Mammoth
+6. **Saves** the complete Mammoth checkpoint
+
+#### Output Files
+
+After conversion, you'll find:
+
+- `{save_path}/` - Main Mammoth model checkpoint
+- `src_vocab_{src_lang}.txt` - Vocab file of source language
+- `tgt_vocab_{tgt_lang}.txt` - Vocab file of target language
+- `xt_model_keys.txt` - x-transformers model layer names (debug)
+- `mammoth_model_keys.txt` - Mammoth model layer names (debug)
+
+**Note:** Mammoth models are saved as components, which is normal behavior.
+
+#### Supported Models
+
+✅ **Tested Models:** (batch size: 1)
+
+- [vgaraujov/bart-base-translation-en-es](https://huggingface.co/vgaraujov/bart-base-translation-en-es)
+- [NYTK/translation-bart-128-en-hu](https://huggingface.co/NYTK/translation-bart-128-en-hu)
+- [ahazeemi/bart-base-wmt-en-fr-finetuned](https://huggingface.co/ahazeemi/bart-base-wmt-en-fr-finetuned)
+- [NYTK/translation-bart-hu-en](https://huggingface.co/NYTK/translation-bart-hu-en) (Not very stable)
+
+⚠️ **Requirements:**
+
+- Model must be BART-based architecture (BART has its own specific settings)
+
+---
+
+### Translation
+
+#### Basic Usage
+
+Use the provided `translation_config.yaml` as a template config.
+
+```bash
+# Run translation
+cd mammoth
+python translate.py -config translation_config.yaml
+```
+
+**Note:** The `task_id` must match the `task_id` specified in the `tasks` section.
+
+### Training from the converted model (or any checkpoint)
+
+#### Basic Usage
+
+Use the provided `training_ft.yaml` as a template config
+
+```bash
+# Run training
+cd mammoth
+python train.py -config training_ft.yaml
+```
+
+### Mammoth2HF Converter: convert Mammoth model to Huggingface BART model
+
+**Two Conversion Scenarios:**
+
+1. **HF → Mammoth → HF:** When converting a Mammoth model that was originally converted from HuggingFace, the model uses the original HF tokenizer (the same tokenizer used during inference).
+
+2. **Native Mammoth → HF:** When converting a model trained natively in the Mammoth framework to HuggingFace format, special considerations apply. Since HuggingFace BART natively uses a BPE tokenizer requiring both vocab and merges files, if the Mammoth model lacks the "merges" file (particularly when trained with a SentencePiece tokenizer), switch to LlamaTokenizer. This tokenizer requires a SentencePiece model that was created during the original model training (vocabulary building phase).
+
+#### Basic Usage
+
+Scenario 1:
+
+##### Convert Model Only
+
+```bash
+python hf2mammoth2hf.py \
+  --mammoth_model /path/to/mammoth/checkpoint \
+  --hf_model /path/to/output/hf_model \
+  --tokenizer /path/to/original/tokenizer
+```
+
+##### Convert and Push to HuggingFace Hub
+
+```bash
+python hf2mammoth2hf.py \
+  --mammoth_model /path/to/mammoth/checkpoint \
+  --hf_model your-username/model-name \
+  --tokenizer /path/to/original/tokenizer \
+  --push_to_hub
+```
diff --git a/helper/README.md b/helper/README.md
@@ -0,0 +1,61 @@
+# 🦣 MAMMOTH: Massively Multilingual Modular Open Translation @ Helsinki
+This repository contains the code for 🦣 MAMMOTH, the modular translation toolkit from Helsinki-NLP.
+
+This library is built on top of OpenNMT-py.
+[OpenNMT-py](https://github.com/OpenNMT/OpenNMT-py) is the [PyTorch](https://github.com/pytorch/pytorch) version of the [OpenNMT](https://opennmt.net) project, an open-source (MIT) neural machine translation framework. It is designed to be research friendly to try out new ideas in translation, summary, morphology, and many other domains. Some companies have proven the code to be production ready.
+
+### Documentation
+The original ONMT-py documentation is available [here](https://opennmt.net/OpenNMT-py/).  
+Our own modifications (currently-under-development) are documented [here](https://helsinki-nlp.github.io/mammoth/).
+
+FINAL NOTE: We would appreciate it A LOT if you report issues with this repo and/or the documentation 
+
+### Acknowledgements
+We thank the NVIDIA AI Technology Center Finland for their help with the multi-gpu/node implementation.
+
+### Quick Installation of MAMMOTH with Helper Features
+
+The NEW **helper features** of MAMMOTH are meant for guarded,
+sanity-checked and optimized deployment of supercomputers for MAMMOTH
+workflows, as well as cumulative, idempotent, backupped, and
+replicable creation of new job configurations (directories, slurm
+scripts, datafiles, and yaml files).
+
+Follow the instructions on [this page](https://github.com/Helsinki-NLP/mammoth/tree/feat/helper/helper/bin/conf).
+This will create the virtual environment and a directory tree containing the MAMMOTH software.  
+
+### Downloading Datasets
+
+Follow the instructions on [this
+page](https://github.com/Helsinki-NLP/mammoth/tree/feat/helper/helper/bin/gets)
+for the easy downloading and management of commonly used datasets.
+
+### Setting Up Experiments
+
+Follow the instructions on [this
+page](https://github.com/Helsinki-NLP/mammoth/tree/feat/helper/helper/bin/create)
+for the easy creation of job directories and their sbatch files.
+(We will also support the easy creation of `config.yaml` files and the easy
+setup of input files for your jobs.)
+
+### Running Training Jobs
+
+That means that we are currently testing scripts that would facilitate
+the creation of the job-specific local environment (directory trees
+and configurations) for your training (and other) jobs.
+
+The slurm launch helpers are available, but they are under testing.
+Follow the instructions on [this
+page](https://github.com/Helsinki-NLP/mammoth/tree/feat/helper/helper/bin/slurm)
+to use the readily made execution scripts and to create shorter and
+safer slurm scripts with the helper feature.
+
+### Translation, Evaluation and Efficient Inference
+
+We are working on extending the domain of the helper features.
+
+## Metadata of the Helper Features (Added on the Top of MAMMOTH NLP)
+
+Authors: Anssi Yli-Jyrä (c) 2025
+License: CC-NC-BY
+