-
Notifications
You must be signed in to change notification settings - Fork 66
Improve error handling, s3 mounting, distributed tests for axlearn
#1332
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
olupton
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can get this merged, it's been open a while.
That said, I think the metric/perf extraction is going to need some more changes.
…-Toolbox into sbosisio/axlearn_improvements
…-Toolbox into sbosisio/axlearn_improvements
|
@olupton [pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model] Traceback (most recent call last):
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model] File "/usr/local/bin/fuji-train-perf.py", line 9, in <module>
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model] from axlearn.experiments.text.gpt import c4_trainer
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model] File "/opt/axlearn/axlearn/experiments/text/gpt/__init__.py", line 5, in <module>
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model] from axlearn.experiments.text.gpt import (
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model] File "/opt/axlearn/axlearn/experiments/text/gpt/c4_trainer.py", line 44, in <module>
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model] from axlearn.common.input_lm import lm_text_preprocessor
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model] File "/opt/axlearn/axlearn/common/input_lm.py", line 12, in <module>
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model] import seqio
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model] File "/usr/local/lib/python3.12/dist-packages/seqio/__init__.py", line 18, in <module>
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model] from seqio.dataset_providers import *
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model] File "/usr/local/lib/python3.12/dist-packages/seqio/dataset_providers.py", line 39, in <module>
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model] from seqio import metrics as metrics_lib
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model] File "/usr/local/lib/python3.12/dist-packages/seqio/metrics.py", line 27, in <module>
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model] from seqio import utils
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model] File "/usr/local/lib/python3.12/dist-packages/seqio/utils.py", line 29, in <module>
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model] from seqio.vocabularies import Vocabulary
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model] File "/usr/local/lib/python3.12/dist-packages/seqio/vocabularies.py", line 26, in <module>
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model] import tensorflow_text as tf_text
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model] File "/usr/local/lib/python3.12/dist-packages/tensorflow_text/__init__.py", line 21, in <module>
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model] from tensorflow_text.python import keras
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model] File "/usr/local/lib/python3.12/dist-packages/tensorflow_text/python/keras/__init__.py", line 21, in <module>
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model] from tensorflow_text.python.keras.layers import *
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model] File "/usr/local/lib/python3.12/dist-packages/tensorflow_text/python/keras/layers/__init__.py", line 22, in <module>
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model] from tensorflow_text.python.keras.layers.tokenization_layers import *
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model] File "/usr/local/lib/python3.12/dist-packages/tensorflow_text/python/keras/layers/tokenization_layers.py", line 24, in <module>
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model] from tensorflow_text.python.ops import unicode_script_tokenizer
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model] File "/usr/local/lib/python3.12/dist-packages/tensorflow_text/python/ops/__init__.py", line 26, in <module>
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model] from tensorflow_text.python.ops.boise_offset_converter import boise_tags_to_offsets
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model] File "/usr/local/lib/python3.12/dist-packages/tensorflow_text/python/ops/boise_offset_converter.py", line 32, in <module>
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model] gen_boise_offset_converter = load_library.load_op_library(resource_loader.get_path_to_datafile('_boise_offset_converter.so'))
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model] File "/usr/local/lib/python3.12/dist-packages/tensorflow/python/framework/load_library.py", line 54, in load_op_library
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model] lib_handle = py_tf.TF_LoadLibrary(library_filename)
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[pod/axlearn-fuji-3b-14336074367-9xxfv/axlearn-fuji-model] tensorflow.python.framework.errors_impl.NotFoundError: /usr/local/lib/python3.12/dist-packages/tensorflow_text/python/ops/_boise_offset_converter.so: undefined symbol: _ZN6tflite4shim23TfShapeInferenceContextC1EPN10tensorflow15shape_inference16InferenceContextEyou were mentioning an error in |
Yes, it's related. #1383 added a pin to avoid this for MaxText -- looks like axlearn needs it too. You can see it's picking up |
olupton
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally LGTM, see separate comment about tensorflow
…n the outputs - easy for the future
…-Toolbox into sbosisio/axlearn_improvements
No description provided.