Fix model_input_names singleton issue causing shared state

Yashwant Bezawada · Yashwant Bezawada · commit 6ba1ffbe5d25 · 2025-11-05T20:16:41.000-06:00
Fixes huggingface#42024 The model_input_names attribute was defined as a class-level list, and when initializing tokenizer instances, they were all pointing to the same list object. This meant modifying model_input_names on one instance would affect all other instances. The issue was in tokenization_utils_base.py line 1417: ```python self.model_input_names = kwargs.pop("model_input_names", self.model_input_names) ``` When no model_input_names is passed in kwargs, it would use the class attribute directly (self.model_input_names), creating a reference to the shared list instead of creating a new list for the instance. Fixed by wrapping it in list() to ensure each instance gets its own copy: ```python self.model_input_names = list(kwargs.pop("model_input_names", self.model_input_names)) ``` This is a standard pattern for handling mutable default values in Python.
diff --git a/src/transformers/tokenization_utils_base.py b/src/transformers/tokenization_utils_base.py
@@ -1414,7 +1414,7 @@ def __init__(self, **kwargs):
                 f"Truncation side should be selected between 'right' and 'left', current value: {self.truncation_side}"
             )
 
-        self.model_input_names = kwargs.pop("model_input_names", self.model_input_names)
+        self.model_input_names = list(kwargs.pop("model_input_names", self.model_input_names))
 
         # By default, cleaning tokenization spaces for both fast and slow tokenizers
         self.clean_up_tokenization_spaces = kwargs.pop("clean_up_tokenization_spaces", False)

Original file line number	Diff line number	Diff line change
`@@ -1414,7 +1414,7 @@ def __init__(self, **kwargs):`
`1414`	`1414`	`f"Truncation side should be selected between 'right' and 'left', current value: {self.truncation_side}"`
`1415`	`1415`	`)`
`1416`	`1416`
`1417`		`- self.model_input_names = kwargs.pop("model_input_names", self.model_input_names)`
	`1417`	`+ self.model_input_names = list(kwargs.pop("model_input_names", self.model_input_names))`
`1418`	`1418`
`1419`	`1419`	`# By default, cleaning tokenization spaces for both fast and slow tokenizers`
`1420`	`1420`	`self.clean_up_tokenization_spaces = kwargs.pop("clean_up_tokenization_spaces", False)`