MariTerm to ItalWordNet Mapping: Complete Lexical Integration Pipeline
Modular Python toolkit for extracting, scoring, filtering, and integrating shared lemmas between MariTerm (maritime terminology) and ItalWordNet (Italian WordNet). Processes XML-encoded lexical resources through a seven-stage pipeline from candidate identification to finalized bidirectional plugin links, with optional analysis and reporting utilities.
Stage 1 candidates.py MariT.xml + IWN.xml → candidates.csv
Stage 2 score.py candidates.csv → breakdown.csv
Stage 3 filter.py breakdown.csv → MariT_filtered.xml, IWN_filtered.xml
Stage 4 update.py filtered XMLs → IWN_updates.xml
Stage 5 merge.py IWN_updates.xml → IWN_pre_merge.xml
Stage 6 analyze.py IWN_pre_merge.xml → console report
Stage 7 finalize.py IWN_post_merge.xml → IWN_final.xml, MariT_final.xml
Stage 2.5 (Optional): report.py - Match Classification Reports
Generates detailed breakdowns showing why each match was accepted or rejected. Does not affect pipeline output - pure analysis/reporting tool.
python scripts/report.pyInput: results/breakdown.csv (from score.py)
Output:
results/reports/accepted_matches.txt- Detailed breakdown of accepted pairs with full score componentsresults/reports/rejected_matches.txt- Detailed breakdown of rejected pairs with rejection reasons
Use cases:
- Understanding matching algorithm behavior
- Identifying false positive/negative patterns
- Threshold tuning and validation
- Documentation for papers and reports
MT2IWN/
├── data/ XML input files (not in repo)
├── results/ Generated outputs (not in repo)
│ ├── candidates.csv
│ ├── breakdown.csv
│ ├── reports/ Match classification reports (optional)
│ │ ├── accepted_matches.txt
│ │ └── rejected_matches.txt
├── scripts/
│ ├── config.py Paths, Config, parse_xml, threshold constants
│ ├── candidates.py CLI — Stage 1
│ ├── score.py CLI — Stage 2
│ ├── report.py CLI — Optional Stage 2.5 (NEW)
│ ├── filter.py CLI — Stage 3
│ ├── update.py CLI — Stage 4
│ ├── merge.py CLI — Stage 5
│ ├── analyze.py CLI — Stage 6
│ ├── finalize.py CLI — Stage 7
│ ├── extraction/ Lemma extraction module
│ ├── similarity/ Normalization and scoring
│ ├── matching/ Word meaning matching
│ │ ├── matcher.py Core matching algorithm
│ │ ├── normalizer.py Gloss preprocessing
│ │ └── writer.py Output formatting
│ ├── filtering/ XML filtering and transcription
│ ├── updating/ IWN entry creation and update
│ ├── merging/ File merging and formatting
│ ├── analysis/ Post-hoc checks and reporting
│ │ ├── audit.py Post-merge validation
│ │ ├── identifier.py Update identification
│ │ └── report.py Match classification and formatting (NEW)
│ └── plugins/ Plugin link finalization
└── README.md
git clone https://github.com/CoPhi/mt2iwn.git
cd mt2iwn
pip install pandas scikit-learn numpyPython 3.8+ required. No other external dependencies.
All threshold values and paths are defined in scripts/config.py:
| Constant | Default | Used By | Purpose |
|---|---|---|---|
GLOSS_HIGH_THRESHOLD |
0.43 | score.py, filter.py, report.py | Gate A: Accept on high gloss similarity alone |
GLOSS_LOW_THRESHOLD |
0.13 | score.py, filter.py, report.py | Gate B: Minimum gloss with relation support |
REL_SUPPORT_THRESHOLD |
0.09 | score.py, filter.py, report.py | Gate B: Minimum relation score |
REPORT_OUT_DIR |
results/reports/ | report.py | Report output location |
Matching constraints:
- One-to-one mapping: Each MariTerm sense matches to at most one IWN sense, and vice versa
- Two-gate threshold logic: Matches pass via high gloss similarity (Gate A) OR moderate gloss + strong relation support (Gate B)
To change thresholds, edit these values in config.py - all relevant stages will use the updated values automatically.
Place MariT_03_24.xml and IWN_03_24.xml in data/, then run each stage:
python scripts/candidates.py
python scripts/score.py
python scripts/filter.py
python scripts/update.py
python scripts/merge.py
python scripts/analyze.py
python scripts/finalize.py# Generate detailed acceptance/rejection reports
python scripts/report.py
# Use custom thresholds
python scripts/report.py --gloss-high 0.45 --out-dir results/custom_reports/
# See all options
python scripts/report.py --helpAll scripts use the default paths from scripts/config.py.
Run any script with --help to see all options.
Each module has a README.md with full API documentation:
scripts/extraction/README.md- Lemma extraction from XMLscripts/similarity/README.md- Normalization and TF-IDF scoringscripts/matching/README.md- One-to-one sense matchingscripts/filtering/README.md- XML filtering and transcriptionscripts/updating/README.md- IWN entry creationscripts/merging/README.md- File merging and formattingscripts/analysis/README.md- Post-hoc validation and reportingscripts/plugins/README.md- Plugin link finalization
- TF-IDF-based gloss similarity with weighted mean scoring
- Relation-aware scoring with configurable bonus/malus weights
- One-to-one constraint enforcement - no duplicate matches
- Two-gate threshold logic for balanced precision/recall
- Multi-stage validation: Scoring → filtering → updating → merging → analysis
- One-to-one mapping verification: Each sense matches at most once
- Manual validation support: Post-hoc inspection of merged XML
- Detailed reporting: Full score breakdowns for all candidates
- Enhanced XML resources with bidirectional plugin links
- CSV breakdowns with complete scoring details
- Text reports with human-readable match classifications
- Console summaries for quick pipeline monitoring
If you use this toolkit in your research, please cite:
Galiero, L. & Boschetti, F. (2026). MT2IWN: MariTerm to ItalWordNet Integration Toolkit (Version 1.0.0) [Software]. Zenodo. https://doi.org/10.5281/zenodo.18788538
BibTeX:
@software{galiero2026mt2iwn,
author = {Galiero, Lucia and Boschetti, Federico},
title = {{MT2IWN}: {MariTerm} to {ItalWordNet} Integration Toolkit},
year = {2026},
publisher = {Zenodo},
version = {1.0.0},
doi = {10.5281/zenodo.18788538},
url = {https://github.com/CoPhi/mt2iwn}
}Galiero, L., Boschetti, F., Del Gratta, R., Del Grosso, A. M., & Monachini, M. (2026). Reviving Legacy WordNet-like Resources: MariTerm and ItalWordNet Renewal through Mutual Expansion and Plug-in Links. Journal of Open Humanities Data.
MIT - See LICENSE file for details.
This toolkit was developed as part of a research project at CNR-ILC. For bug reports, feature requests, or contributions, please open an issue or pull request on GitHub.
- Initial release
- Seven-stage integration pipeline
- Optional reporting utilities
- Configurable threshold system
Last Updated: April 20th, 2026