Skip to content

How do I replace a spare tokens? #31475

Open
@onaka-ga-pkpk

Description

@onaka-ga-pkpk

System Info

I want to SFT Mistral-v0.3 with my own chat template.
So I followed this comment and replaced some [controal_n] tokens with special tokens for the chat template.
However, the new vocabulary was actually added and the size of the vocabulary increased.
Is there any way to replace the vocabulary?

Who can help?

@ArthurZucker

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

tokenizer.json

{
  "version": "1.0",
  "truncation": null,
  "padding": null,
  "added_tokens": [
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
{
      "id": 10,
      "content": "<|system|>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false,
      "special": true
    },
    {
      "id": 11,
      "content": "<|user|>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false,
      "special": true
    },
    {
      "id": 12,
      "content": "<|assistant|>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false,
      "special": true
    },
    {
      "id": 13,
      "content": "<|eot|>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false,
      "special": true
    },
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

tokenizer_config.json

{
  "add_bos_token": true,
  "add_eos_token": false,
  "add_prefix_space": true,
  "added_tokens_decoder": {
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    "10": {
          "content": "<|system|>",
          "lstrip": false,
          "normalized": false,
          "rstrip": false,
          "single_word": false,
          "special": true
        },
        "11": {
          "content": "<|user|>",
          "lstrip": false,
          "normalized": false,
          "rstrip": false,
          "single_word": false,
          "special": true
        },
        "12": {
          "content": "<|assistant|>",
          "lstrip": false,
          "normalized": false,
          "rstrip": false,
          "single_word": false,
          "special": true
        },
        "13": {
          "content": "<|eot|>",
          "lstrip": false,
          "normalized": false,
          "rstrip": false,
          "single_word": false,
          "special": true
        },
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
}

test code

tokenizer =  AutoTokenizer.from_pretrained(model_dir)
pprint(tokenizer.added_tokens_decoder)

output

~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 768: AddedToken("[control_766]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
 769: AddedToken("[control_767]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
 770: AddedToken("[control_768]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
 32768: AddedToken("<|system|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
 32769: AddedToken("<|user|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
 32770: AddedToken("<|assistant|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
 32771: AddedToken("<|eot|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True)}

Expected behavior

[control_n] Tokens can be replaced with any token.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions