This repository contains the code for BioNER, an LSTM-based model designed for biomedical named entity recognition (NER).
We provide the model trained for the following datasets:
| Dataset | Mirror (Siasky) | Mirror (Mega) |
|---|---|---|
| MedMentions full | Download Model | Download Model |
| MedMentions ST21pv | Download Model | Download Model |
| JNLPBA | Download Model | Download Model |
In addition, the word embeddings trained with fastText on PubMed Baseline 2021 are provided for the following n-gram ranges:
| n-gram range | Mirror (Siasky) | Mirror (Mega) | Mirror (Storj) |
|---|---|---|---|
| 3-4 | Download | Download | Download |
| 3-6 | Download | Download | Download |
Install the dependencies.
pip install -r requirements.txtAs deterministic behaviour is enabled by default, you may need to set a debug environment variable CUBLAS_WORKSPACE_CONFIG to prevent RuntimeErrors when using CUDA.
export CUBLAS_WORKSPACE_CONFIG=:4096:8BioNER expects a dataset in the CoNLL-2003 format. We used the tool bconv for preprocessing the MedMentions dataset.
You can either use the provided Makefile to train the BioNER model or execute train_bioner.py directly.
Makefile:
Don't forget to fill in the empty fields in the Makefile before the first start.
make train-bioner ngrams=3-4You can annotate a CoNLL-2003 dataset in the following way:
python annotate_dataset.py \
--embeddings \ # path to the word embeddings file
--dataset \ # path to the CoNLL-2003 dataset
--outputFile \ # path to the output file for storing the annotated dataset
--model # path to the trained BioNER modelFurthermore, you can add the flag --enableExportCoNLL to export an additional file at the same location at the same parent folder as the outputFile, which can be used for the evaluation with the original conlleval.pl perl script (source).