Skip to content

Conversation

@Steboss
Copy link
Contributor

@Steboss Steboss commented Mar 7, 2025

No description provided.

olupton
olupton previously approved these changes Apr 1, 2025
Copy link
Collaborator

@olupton olupton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can get this merged, it's been open a while.

That said, I think the metric/perf extraction is going to need some more changes.

@Steboss
Copy link
Contributor Author

Steboss commented Apr 8, 2025

@olupton
I can see this error while running AXlearn:

[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model] Traceback (most recent call last):
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model]   File "/usr/local/bin/fuji-train-perf.py", line 9, in <module>
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model]     from axlearn.experiments.text.gpt import c4_trainer
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model]   File "/opt/axlearn/axlearn/experiments/text/gpt/__init__.py", line 5, in <module>
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model]     from axlearn.experiments.text.gpt import (
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model]   File "/opt/axlearn/axlearn/experiments/text/gpt/c4_trainer.py", line 44, in <module>
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model]     from axlearn.common.input_lm import lm_text_preprocessor
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model]   File "/opt/axlearn/axlearn/common/input_lm.py", line 12, in <module>
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model]     import seqio
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model]   File "/usr/local/lib/python3.12/dist-packages/seqio/__init__.py", line 18, in <module>
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model]     from seqio.dataset_providers import *
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model]   File "/usr/local/lib/python3.12/dist-packages/seqio/dataset_providers.py", line 39, in <module>
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model]     from seqio import metrics as metrics_lib
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model]   File "/usr/local/lib/python3.12/dist-packages/seqio/metrics.py", line 27, in <module>
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model]     from seqio import utils
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model]   File "/usr/local/lib/python3.12/dist-packages/seqio/utils.py", line 29, in <module>
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model]     from seqio.vocabularies import Vocabulary
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model]   File "/usr/local/lib/python3.12/dist-packages/seqio/vocabularies.py", line 26, in <module>
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model]     import tensorflow_text as tf_text
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model]   File "/usr/local/lib/python3.12/dist-packages/tensorflow_text/__init__.py", line 21, in <module>
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model]     from tensorflow_text.python import keras
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model]   File "/usr/local/lib/python3.12/dist-packages/tensorflow_text/python/keras/__init__.py", line 21, in <module>
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model]     from tensorflow_text.python.keras.layers import *
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model]   File "/usr/local/lib/python3.12/dist-packages/tensorflow_text/python/keras/layers/__init__.py", line 22, in <module>
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model]     from tensorflow_text.python.keras.layers.tokenization_layers import *
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model]   File "/usr/local/lib/python3.12/dist-packages/tensorflow_text/python/keras/layers/tokenization_layers.py", line 24, in <module>
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model]     from tensorflow_text.python.ops import unicode_script_tokenizer
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model]   File "/usr/local/lib/python3.12/dist-packages/tensorflow_text/python/ops/__init__.py", line 26, in <module>
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model]     from tensorflow_text.python.ops.boise_offset_converter import boise_tags_to_offsets
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model]   File "/usr/local/lib/python3.12/dist-packages/tensorflow_text/python/ops/boise_offset_converter.py", line 32, in <module>
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model]     gen_boise_offset_converter = load_library.load_op_library(resource_loader.get_path_to_datafile('_boise_offset_converter.so'))
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model]                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model]   File "/usr/local/lib/python3.12/dist-packages/tensorflow/python/framework/load_library.py", line 54, in load_op_library
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model]     lib_handle = py_tf.TF_LoadLibrary(library_filename)
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model] tensorflow.python.framework.errors_impl.NotFoundError: /usr/local/lib/python3.12/dist-packages/tensorflow_text/python/ops/_boise_offset_converter.so: undefined symbol: _ZN6tflite4shim23TfShapeInferenceContextC1EPN10tensorflow15shape_inference16InferenceContextE

you were mentioning an error in tensorflow in the most recent commit. Is there a specific version I should use for the moment? and does this error is related? thank you :)

@olupton
Copy link
Collaborator

olupton commented Apr 9, 2025

@olupton I can see this error while running AXlearn:

[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model] Traceback (most recent call last):
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model]   File "/usr/local/bin/fuji-train-perf.py", line 9, in <module>
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model]     from axlearn.experiments.text.gpt import c4_trainer
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model]   File "/opt/axlearn/axlearn/experiments/text/gpt/__init__.py", line 5, in <module>
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model]     from axlearn.experiments.text.gpt import (
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model]   File "/opt/axlearn/axlearn/experiments/text/gpt/c4_trainer.py", line 44, in <module>
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model]     from axlearn.common.input_lm import lm_text_preprocessor
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model]   File "/opt/axlearn/axlearn/common/input_lm.py", line 12, in <module>
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model]     import seqio
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model]   File "/usr/local/lib/python3.12/dist-packages/seqio/__init__.py", line 18, in <module>
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model]     from seqio.dataset_providers import *
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model]   File "/usr/local/lib/python3.12/dist-packages/seqio/dataset_providers.py", line 39, in <module>
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model]     from seqio import metrics as metrics_lib
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model]   File "/usr/local/lib/python3.12/dist-packages/seqio/metrics.py", line 27, in <module>
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model]     from seqio import utils
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model]   File "/usr/local/lib/python3.12/dist-packages/seqio/utils.py", line 29, in <module>
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model]     from seqio.vocabularies import Vocabulary
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model]   File "/usr/local/lib/python3.12/dist-packages/seqio/vocabularies.py", line 26, in <module>
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model]     import tensorflow_text as tf_text
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model]   File "/usr/local/lib/python3.12/dist-packages/tensorflow_text/__init__.py", line 21, in <module>
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model]     from tensorflow_text.python import keras
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model]   File "/usr/local/lib/python3.12/dist-packages/tensorflow_text/python/keras/__init__.py", line 21, in <module>
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model]     from tensorflow_text.python.keras.layers import *
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model]   File "/usr/local/lib/python3.12/dist-packages/tensorflow_text/python/keras/layers/__init__.py", line 22, in <module>
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model]     from tensorflow_text.python.keras.layers.tokenization_layers import *
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model]   File "/usr/local/lib/python3.12/dist-packages/tensorflow_text/python/keras/layers/tokenization_layers.py", line 24, in <module>
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model]     from tensorflow_text.python.ops import unicode_script_tokenizer
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model]   File "/usr/local/lib/python3.12/dist-packages/tensorflow_text/python/ops/__init__.py", line 26, in <module>
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model]     from tensorflow_text.python.ops.boise_offset_converter import boise_tags_to_offsets
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model]   File "/usr/local/lib/python3.12/dist-packages/tensorflow_text/python/ops/boise_offset_converter.py", line 32, in <module>
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model]     gen_boise_offset_converter = load_library.load_op_library(resource_loader.get_path_to_datafile('_boise_offset_converter.so'))
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model]                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model]   File "/usr/local/lib/python3.12/dist-packages/tensorflow/python/framework/load_library.py", line 54, in load_op_library
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model]     lib_handle = py_tf.TF_LoadLibrary(library_filename)
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model] tensorflow.python.framework.errors_impl.NotFoundError: /usr/local/lib/python3.12/dist-packages/tensorflow_text/python/ops/_boise_offset_converter.so: undefined symbol: _ZN6tflite4shim23TfShapeInferenceContextC1EPN10tensorflow15shape_inference16InferenceContextE

you were mentioning an error in tensorflow in the most recent commit. Is there a specific version I should use for the moment? and does this error is related? thank you :)

Yes, it's related. #1383 added a pin to avoid this for MaxText -- looks like axlearn needs it too. You can see it's picking up tensorflow==2.19.0 here https://github.com/NVIDIA/JAX-Toolbox/actions/runs/14330157713/job/40166118596#step:11:620 and that will be translated into tensorflow-cpu before finalisation. You can add a pin to make sure tensorflow-cpu is 2.18.1.

Copy link
Collaborator

@olupton olupton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM, see separate comment about tensorflow

@Steboss Steboss requested a review from olupton April 10, 2025 21:31
@Steboss Steboss merged commit e14c8fc into main Apr 11, 2025
@Steboss Steboss deleted the sbosisio/axlearn_improvements branch April 11, 2025 08:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants