Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
97 commits
Select commit Hold shift + click to select a range
1429e40
Add HuggingFace integration support and code refactoring
chaowang0524 Aug 22, 2025
b36b71a
Add ChangeLog.md and quickstart guide in README.me.
chaowang0524 Aug 23, 2025
0e72f39
Ready for test
chaowang0524 Aug 23, 2025
ffd4fc0
Provide the attribution link to original x-transformers library
chaowang0524 Aug 23, 2025
625da97
Update the tokenizer for "Ġ" symbol.
chaowang0524 Aug 23, 2025
59f6305
Updated the README.md
chaowang0524 Aug 23, 2025
c9a6f72
1. Added training configuration template.
chaowang0524 Aug 26, 2025
d141848
Updated readme and changelog file
chaowang0524 Aug 27, 2025
5569f4d
Updated the readme file
chaowang0524 Aug 27, 2025
d14a183
Convert mammoth model (converted from HF) back to HF model hub.
chaowang0524 Sep 5, 2025
2f35acc
Updated the readme.md and changelog.md
chaowang0524 Sep 5, 2025
0137b2b
Updated sentencepiece detokenization
chaowang0524 Sep 5, 2025
48fd5b6
Updated changelog.md
chaowang0524 Sep 5, 2025
25b1cec
created feat/helper branch
Sep 11, 2025
3e147d4
updated build-venv-mammoth-hf.sh
Sep 11, 2025
06f9267
fixed build-venv
Sep 11, 2025
24e0ce3
Updated README
Sep 11, 2025
8ec024a
Updated README
Sep 11, 2025
de54ba3
Updated README
Sep 11, 2025
d6aa13c
added lib
Sep 11, 2025
bc8172b
upgraded slurms code
Sep 13, 2025
8b31f7d
upgraded to relative
Sep 13, 2025
60509fb
deleted
Sep 13, 2025
c6ad2a0
finished upgrading
Sep 13, 2025
0e5c640
updated
Sep 13, 2025
9e5929e
updated
Sep 13, 2025
106dc63
README updates
Sep 13, 2025
8526752
README updates
Sep 13, 2025
74871ed
README updates
Sep 13, 2025
899e7d4
README updates
Sep 13, 2025
d946240
README updates
Sep 13, 2025
3808964
README updates
Sep 13, 2025
08006f3
README updates
Sep 13, 2025
b336b17
purged
Sep 13, 2025
bb69fc4
typofixes
Sep 13, 2025
bea5dae
typofixes
Sep 13, 2025
17b7a72
typo-fixes
Sep 13, 2025
a52a342
added mkjob.sh
Sep 14, 2025
625c1bf
updated mkjob.sh
Sep 14, 2025
d4459cf
added {mkjobdir,showjobs,mksbatch}.sh; updated READMEs
Sep 15, 2025
500f0ef
fixed code
Sep 15, 2025
2bc722a
fixed info
Sep 15, 2025
42420f6
updated
Sep 15, 2025
de3c3ba
updated
Sep 15, 2025
6ecb6f9
updated
Sep 15, 2025
52257b6
updated
Sep 15, 2025
0f3afb3
updated
Sep 15, 2025
314b310
updated
Sep 15, 2025
a8f3d56
updated
Sep 15, 2025
ac0a383
Add LUMI supercomputer support and enhance HF integration
chaowang0524 Sep 15, 2025
326e2c0
Updated the changelog.md
chaowang0524 Sep 15, 2025
5dfd919
Added get/datasets.sh
Sep 15, 2025
37cb651
Updated the readme and a quickstart for LUMI
chaowang0524 Sep 15, 2025
76f4727
Added get/datasets.sh
Sep 15, 2025
d1979ce
updated the lumi quickstart
chaowang0524 Sep 15, 2025
4c61df8
Added get/datasets.sh
Sep 15, 2025
a504d7c
Added get/datasets.sh
Sep 15, 2025
9673ce6
updated the path to requirements_lumi.txt
chaowang0524 Sep 15, 2025
074cae2
Updated README files
Sep 15, 2025
02b496e
Updated README files
Sep 15, 2025
b29fa75
Updated README files
Sep 15, 2025
bac68bb
Updated README files
Sep 15, 2025
6c54e09
Updated README files
Sep 15, 2025
4a38837
Updated README files
Sep 15, 2025
745bd1e
Updated README files
Sep 15, 2025
fe7bd21
Updated README files
Sep 15, 2025
6f74819
Updated README files
Sep 15, 2025
652c968
Updated README files
Sep 15, 2025
74df7f2
Updated README files
Sep 15, 2025
5f33016
fixed bugs in 4-module-loads.sh
Sep 15, 2025
705203d
fixed table
Sep 15, 2025
eb44627
various updates
Sep 16, 2025
c13aca8
Updated lumi quickstart
chaowang0524 Sep 16, 2025
75ecbe2
updated
Sep 16, 2025
59fcb33
added symlinks
Sep 16, 2025
a651ff5
added uncorpus/tatoeba to datasets
Sep 16, 2025
8b8122f
a lot of updates; still unstable
Sep 19, 2025
6d6de28
updated installation scripts
Sep 19, 2025
613e45f
various changes
Sep 19, 2025
4ec4ab6
fix single-branch fetchspec trap
Sep 19, 2025
56d0c84
Merge feat/hf_integration into feat/helper; keep upstream README, sav…
Sep 19, 2025
aad88c8
Keep upstream README; move helper notes to helper/README-helper.md
Sep 19, 2025
386edc0
fixed python wrapper
Sep 20, 2025
c3d6db8
clean
Sep 20, 2025
c5f32cc
fixed wrapper
Sep 20, 2025
2e6335d
Added cc.md
Sep 23, 2025
d58cb7a
cc.md
Sep 24, 2025
fb3ca1b
cc.md
Sep 24, 2025
63fbb51
cc.md
Sep 24, 2025
e0a2666
updated
Sep 24, 2025
9a19d4b
updated
Sep 24, 2025
3a06342
updated
Sep 24, 2025
588e2a8
updated
Sep 24, 2025
e6d4e97
removed python_container.md
Sep 25, 2025
92c2789
minimal packages
Sep 29, 2025
8ef8c1a
Created config_config2
Oct 1, 2025
49e34e3
Added functionalities and robustness
Oct 7, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -117,3 +117,8 @@ sftp.config*.json
#vim
*.swp
*.swo
*.ipynb

# claude
CLAUDE.md
.claude/*
56 changes: 56 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# HuggingFace Model Integration Update

## Overview

This update introduces HuggingFace model integration capabilities to Mammoth, enabling seamless conversion and use of pre-trained HuggingFace BART models within the Mammoth translation framework.

For usage instructions, please refer to README.md.

## What Was Updated (15-Sep-2025)

**LUMI Supercomputer Support**

- Added LUMI environment setup script (`lumi/env_setup.sh`) for PyTorch virtual environment configuration
- Added SLURM batch scripts for training (`lumi/train.sh`) and translation (`lumi/translate.sh`) on LUMI
- Minimal requirements file (`requirements_lumi.txt`) for LUMI deployment with essential dependencies

**Improvements**

- Added early validation for save_model path with proper error handling and directory creation. So the destination directory will be validated before the training starts.
- Improved error messages for save_model path validation

## What Was Updated (05-Sep-2025)

**HuggingFace Converter Enhancement (`hf2mammoth2hf.py`)**

- Convert Mammoth models (previously converted from HF) back to HuggingFace format
- Push converted models directly to HuggingFace model hub

### Enhanced Dependencies

- Added `sentencepiece==0.2.1` for tokenization/detokenization support
- Cleaned up NVIDIA CUDA dependencies for broader compatibility

## What Was Updated (27-Aug-2025)

**Training Configuration Updates**

- Added `--valid_metrics` parameter to training configuration to enable in-training validation with metrics
- Added "BLEU" metric from `sacrebleu` library as an option for `--valid_metrics`. By default, this returns the corpus BLEU score against the reference.

## What Was Updated (22-Aug-2025)

#### 1. HuggingFace Model Converter (`hf2mammoth.py`)

- **Complete BART to Mammoth conversion pipeline**
- Converts HuggingFace BART models to Mammoth-compatible format
- Three-stage conversion process:
- Stage 1: HuggingFace BART → X-Transformers
- Stage 2: X-Transformers → Mammoth model
- Stage 3: Save as Mammoth checkpoint
- Automatic vocabulary extraction from HF tokenizers for on-the-fly tokenization/detokenization

#### 2. X-Transformers library integration & update (`mammoth/x_transformers/`)

- Integrated the X-Transformers library into the Mammoth directory
- Updated the supported [X-Transformers](https://github.com/lucidrains/x-transformers) library version to 2.7.2
144 changes: 143 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,156 @@
# 🦣 MAMMOTH: Massively Multilingual Modular Open Translation @ Helsinki

This repository contains the code for 🦣 MAMMOTH, the modular translation toolkit from Helsinki-NLP.

This library is built on top of OpenNMT-py.
[OpenNMT-py](https://github.com/OpenNMT/OpenNMT-py) is the [PyTorch](https://github.com/pytorch/pytorch) version of the [OpenNMT](https://opennmt.net) project, an open-source (MIT) neural machine translation framework. It is designed to be research friendly to try out new ideas in translation, summary, morphology, and many other domains. Some companies have proven the code to be production ready.

### Documentation

The original ONMT-py documentation is available [here](https://opennmt.net/OpenNMT-py/).
Our own modifications (currently-under-development) are documented [here](https://helsinki-nlp.github.io/mammoth/).

FINAL NOTE: We would appreciate it A LOT if you report issues with this repo and/or the documentation
**Note:** We would greatly appreciate issue reports for this repository and its documentation.

### Acknowledgements

We thank the NVIDIA AI Technology Center Finland for their help with the multi-gpu/node implementation.

For the update history in this branch, see CHANGELOG.md.

**If you are working in LUMI environment, please see the [quickstart guide here](lumi/LUMI_QUICKSTART.md).**

## Quick Start: HuggingFace Integration

This guide shows:

- how to quickly convert and use HuggingFace BART models with Mammoth
- how to inference with a converted (or general) Mammoth model
- how to continue training a Mammoth model from a converted checkpoint (or any checkpoint)
- how to push the Mammoth BART model to HuggingFace.

### Setup

```bash
# Copy repository from feat/hf_integration branch
git clone -b feat/hf_integration https://github.com/Helsinki-NLP/mammoth.git

# cd into the directory
cd mammoth

# Install the dependencies
pip install -r requirements.txt
```

### HF2Mammoth Converter: convert Huggingface BART model to Mammoth

#### Basic Usage

Convert a HuggingFace BART model to Mammoth format:

```bash
# Basic conversion from Hugging Face model hub (language pair needs to be specified)
python hf2mammoth.py vgaraujov/bart-base-translation-en-es ./save/path/converted_model --src-lang en --tgt-lang es

# Local model conversion (language pair needs to be specified)
python hf2mammoth.py /path/to/local/bart/model ./save/path/converted_model --src-lang en --tgt-lang es
```

**Important Notes:**

- The "converted_model" parameter serves as the prefix for converted model component names and does not affect inference.
- While BART architecture supports many tasks, this setup has been tuned and tested specifically for translation tasks (using BartForConditionalGeneration). Different model heads may have varying weights and could produce unexpected results.
- The HF BART model used for debugging is [vgaraujov/bart-base-translation-en-es](https://huggingface.co/vgaraujov/bart-base-translation-en-es).

#### What the Converter Does

1. **Downloads/Loads** the HF BART model and tokenizer
2. **Creates** an x-transformers model with matching architecture
3. **Maps weights** from HF format to x-transformers format
4. **Builds** a Mammoth model structure with task configuration
5. **Transfers weights** from x-transformers to Mammoth
6. **Saves** the complete Mammoth checkpoint

#### Output Files

After conversion, you'll find:

- `{save_path}/` - Main Mammoth model checkpoint
- `src_vocab_{src_lang}.txt` - Vocab file of source language
- `tgt_vocab_{tgt_lang}.txt` - Vocab file of target language
- `xt_model_keys.txt` - x-transformers model layer names (debug)
- `mammoth_model_keys.txt` - Mammoth model layer names (debug)

**Note:** Mammoth models are saved as components, which is normal behavior.

#### Supported Models

✅ **Tested Models:** (batch size: 1)

- [vgaraujov/bart-base-translation-en-es](https://huggingface.co/vgaraujov/bart-base-translation-en-es)
- [NYTK/translation-bart-128-en-hu](https://huggingface.co/NYTK/translation-bart-128-en-hu)
- [ahazeemi/bart-base-wmt-en-fr-finetuned](https://huggingface.co/ahazeemi/bart-base-wmt-en-fr-finetuned)
- [NYTK/translation-bart-hu-en](https://huggingface.co/NYTK/translation-bart-hu-en) (Not very stable)

⚠️ **Requirements:**

- Model must be BART-based architecture (BART has its own specific settings)

---

### Translation

#### Basic Usage

Use the provided `translation_config.yaml` as a template config.

```bash
# Run translation
cd mammoth
python translate.py -config translation_config.yaml
```

**Note:** The `task_id` must match the `task_id` specified in the `tasks` section.

### Training from the converted model (or any checkpoint)

#### Basic Usage

Use the provided `training_ft.yaml` as a template config

```bash
# Run training
cd mammoth
python train.py -config training_ft.yaml
```

### Mammoth2HF Converter: convert Mammoth model to Huggingface BART model

**Two Conversion Scenarios:**

1. **HF → Mammoth → HF:** When converting a Mammoth model that was originally converted from HuggingFace, the model uses the original HF tokenizer (the same tokenizer used during inference).

2. **Native Mammoth → HF:** When converting a model trained natively in the Mammoth framework to HuggingFace format, special considerations apply. Since HuggingFace BART natively uses a BPE tokenizer requiring both vocab and merges files, if the Mammoth model lacks the "merges" file (particularly when trained with a SentencePiece tokenizer), switch to LlamaTokenizer. This tokenizer requires a SentencePiece model that was created during the original model training (vocabulary building phase).

#### Basic Usage

Scenario 1:

##### Convert Model Only

```bash
python hf2mammoth2hf.py \
--mammoth_model /path/to/mammoth/checkpoint \
--hf_model /path/to/output/hf_model \
--tokenizer /path/to/original/tokenizer
```

##### Convert and Push to HuggingFace Hub

```bash
python hf2mammoth2hf.py \
--mammoth_model /path/to/mammoth/checkpoint \
--hf_model your-username/model-name \
--tokenizer /path/to/original/tokenizer \
--push_to_hub
```
61 changes: 61 additions & 0 deletions helper/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
# 🦣 MAMMOTH: Massively Multilingual Modular Open Translation @ Helsinki
This repository contains the code for 🦣 MAMMOTH, the modular translation toolkit from Helsinki-NLP.

This library is built on top of OpenNMT-py.
[OpenNMT-py](https://github.com/OpenNMT/OpenNMT-py) is the [PyTorch](https://github.com/pytorch/pytorch) version of the [OpenNMT](https://opennmt.net) project, an open-source (MIT) neural machine translation framework. It is designed to be research friendly to try out new ideas in translation, summary, morphology, and many other domains. Some companies have proven the code to be production ready.

### Documentation
The original ONMT-py documentation is available [here](https://opennmt.net/OpenNMT-py/).
Our own modifications (currently-under-development) are documented [here](https://helsinki-nlp.github.io/mammoth/).

FINAL NOTE: We would appreciate it A LOT if you report issues with this repo and/or the documentation

### Acknowledgements
We thank the NVIDIA AI Technology Center Finland for their help with the multi-gpu/node implementation.

### Quick Installation of MAMMOTH with Helper Features

The NEW **helper features** of MAMMOTH are meant for guarded,
sanity-checked and optimized deployment of supercomputers for MAMMOTH
workflows, as well as cumulative, idempotent, backupped, and
replicable creation of new job configurations (directories, slurm
scripts, datafiles, and yaml files).

Follow the instructions on [this page](https://github.com/Helsinki-NLP/mammoth/tree/feat/helper/helper/bin/conf).
This will create the virtual environment and a directory tree containing the MAMMOTH software.

### Downloading Datasets

Follow the instructions on [this
page](https://github.com/Helsinki-NLP/mammoth/tree/feat/helper/helper/bin/gets)
for the easy downloading and management of commonly used datasets.

### Setting Up Experiments

Follow the instructions on [this
page](https://github.com/Helsinki-NLP/mammoth/tree/feat/helper/helper/bin/create)
for the easy creation of job directories and their sbatch files.
(We will also support the easy creation of `config.yaml` files and the easy
setup of input files for your jobs.)

### Running Training Jobs

That means that we are currently testing scripts that would facilitate
the creation of the job-specific local environment (directory trees
and configurations) for your training (and other) jobs.

The slurm launch helpers are available, but they are under testing.
Follow the instructions on [this
page](https://github.com/Helsinki-NLP/mammoth/tree/feat/helper/helper/bin/slurm)
to use the readily made execution scripts and to create shorter and
safer slurm scripts with the helper feature.

### Translation, Evaluation and Efficient Inference

We are working on extending the domain of the helper features.

## Metadata of the Helper Features (Added on the Top of MAMMOTH NLP)

Authors: Anssi Yli-Jyrä (c) 2025
License: CC-NC-BY

Loading
Loading