[NLP] Enable import of models with missing vocabulary files

Eland needs access to a model's vocabulary file so that is can be uploaded to Elasticsearch along with the model definition. In some cases the vocab file is not included in the model repo on HuggingFace, one example is [Jina Reranker](https://huggingface.co/jinaai/jina-reranker-v2-base-multilingual/tree/main). The `eland_import_hub_model` script fails with this error when the file is missing:

```
Unable to load vocabulary from file. Please check that the provided vocabulary is accessible and not corrupted.
```

This happens because `AutoTokenizer.from_pretrained(...)` is called with `use_fast=False`. 

It should be possible to download the vocab from the base model, investigate other ways to download the vocab file where it is not present in the model repo.  


```
eland_import_hub_model --cloud-id labs:xxxxxx== --hub-model-id jinaai/jina-reranker-v2-base-multilingual --task-type text_similarity --es-api-key xxxx== --start --clear-previous
And I'm getting this error:
2024-09-03 01:59:53,443 INFO : Establishing connection to Elasticsearch
2024-09-03 01:59:53,940 INFO : Connected to cluster named 'XXX' (version: 8.15.0)
2024-09-03 01:59:53,942 INFO : Loading HuggingFace transformer tokenizer and model 'jinaai/jina-reranker-v2-base-multilingual'
/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2256, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/usr/local/lib/python3.10/site-packages/transformers/models/xlm_roberta/tokenization_xlm_roberta.py", line 154, in __init__
    self.sp_model.Load(str(vocab_file))
  File "/usr/local/lib/python3.10/site-packages/sentencepiece/__init__.py", line 961, in Load
    return self.LoadFromFile(model_file)
  File "/usr/local/lib/python3.10/site-packages/sentencepiece/__init__.py", line 316, in LoadFromFile
    return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
OSError: Not found: "None": No such file or directory Error #2
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/usr/local/bin/eland_import_hub_model", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/site-packages/eland/cli/eland_import_hub_model.py", line 298, in main
    tm = TransformerModel(
  File "/usr/local/lib/python3.10/site-packages/eland/ml/pytorch/transformers.py", line 655, in __init__
    self._tokenizer = transformers.AutoTokenizer.from_pretrained(
  File "/usr/local/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 768, in from_pretrained
    return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2024, in from_pretrained
    return cls._from_pretrained(
  File "/usr/local/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2258, in _from_pretrained
    raise OSError(
OSError: Unable to load vocabulary from file. Please check that the provided vocabulary is accessible and not corrupted.
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[NLP] Enable import of models with missing vocabulary files #721

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[NLP] Enable import of models with missing vocabulary files #721

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions