-
Notifications
You must be signed in to change notification settings - Fork 66
Improve error handling, s3 mounting, distributed tests for axlearn
#1332
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 73 commits
Commits
Show all changes
80 commits
Select commit
Hold shift + click to select a range
e5ee472
Improve error handling, s3 mounting, distributed tests for
Steboss 9de29cd
test mounted s3 bucket
Steboss 6c6cd34
Fix action
Steboss aa63ac6
fix the bash shell and remember to mount the volume
Steboss 0ae1b83
start working on the shell of the action
Steboss 8f65cd4
try to fix using posix-sh-compatible
Steboss 7e70be7
test on name of the volume and location
Steboss 2eec743
check tests can run
Steboss d809b67
Merge branch 'main' into sbosisio/axlearn_improvements
Steboss 27f985d
create script for running 70B model
Steboss fca227e
test on 5 tests to see if mount works now
Steboss 2f671cc
try to ls what's inside output folder
Steboss df34f12
try @aybchan pytest dist on axlearn
Steboss 551bf94
test only the tests
Steboss 89944f1
test again
Steboss 3f78165
try with simpel bash to avoid bash conflicts for bad substitution
Steboss dacafd4
check if we can use the pipefail here
Steboss d36550c
do not use bash to run the test suite
Steboss 4673919
do not use bash to run the test suite
Steboss 7a0081b
add explicitly log dir
Steboss bc56ad8
error what
Steboss 705b8f1
test whats wrong
Steboss 144f217
try with shell bash
Steboss 3106e3b
retry
Steboss 5e0ee44
try simple test
Steboss 73eb0bd
try to modify test script
Steboss ea02323
reset test and maybe the bad subs is in the submission step
Steboss f1be512
try with another subs
Steboss 6b951e0
try to use a sh-like approach in the k8s action
Steboss 43c9382
change to posix shell type
Steboss 0cb48f4
do we need parallelism
Steboss 73edb7b
try to fix the mps
Steboss d08a53b
do we really need it
Steboss c30296a
regardless parallelism test
Steboss 4fd33d3
add echo
Steboss 275ed81
add some logs
Steboss 8e4f17e
try to modify the instructions for polling and set a 2 hours poll
Steboss a63fd82
check fail
Steboss 2f345c8
try to simplify teh approach
Steboss d7c55c5
try a new build
Steboss 6d1ae2c
fix step
Steboss 935e72b
start craeting also a precommit file
Steboss d790590
fix the pre-commit so it avoids running on rosetta
Steboss e95e090
Fix the workers and gpus needed
Steboss b4f9fef
Update .github/eks-workflow-files/axlearn/axlearn-job.yml
Steboss 5857445
@olupton comments fix
Steboss 0c9c515
Merge branch 'main' into sbosisio/axlearn_improvements
Steboss 2a774a0
try to extend the timeout for nccl
Steboss 05c77a8
Merge branch 'sbosisio/axlearn_improvements' of github.com:NVIDIA/JAX…
Steboss 4600cbb
@olupton comments fix
Steboss 8764fa5
update README file
Steboss 2521a4f
Merge branch 'main' into sbosisio/axlearn_improvements
Steboss ba6f277
@olupton comments
Steboss 535a018
fix readme
Steboss cec2cf0
retrieve metrics direclty in script
Steboss 9b29514
fix to yaml
Steboss 72e05a8
fix script to retrieve output
Steboss 8f40ae5
fix fuji script for metrics retrieval
Steboss dabc2cb
fix to world_size
Steboss 2cba865
Merge branch 'main' into sbosisio/axlearn_improvements
Steboss 23cc9bc
Merge branch 'main' into sbosisio/axlearn_improvements
Steboss 4fc790d
Merge branch 'main' into sbosisio/axlearn_improvements
Steboss c47a318
test on personal branch to see if axlearn works
Steboss 8fa9182
Merge branch 'sbosisio/axlearn_improvements' of github.com:NVIDIA/JAX…
Steboss 9c2e71a
typo in path
Steboss e755827
test on tflops
Steboss aa654f9
add vocab for fuji3b
Steboss 851c6a8
Merge branch 'main' into sbosisio/axlearn_improvements
Steboss b803deb
compute tflops as an output metric
Steboss 7c783a5
Merge branch 'sbosisio/axlearn_improvements' of github.com:NVIDIA/JAX…
Steboss b8f0acd
add parameters for fuji-1B-v3-flash so we can have a smoke test
Steboss bcbc8c9
fix time array
Steboss 0d93cd6
fix error in model
Steboss 4fd722b
Merge branch 'main' into sbosisio/axlearn_improvements
Steboss 98b4afd
pin tensorflow version
Steboss 844fd2d
pin tensorflow and check finalize output
Steboss 64cd87c
modify the fuji script, so we can be exactly overlap on matext even i…
Steboss a40a8ac
Merge branch 'main' into sbosisio/axlearn_improvements
Steboss 49ff802
Fix script for running perf tests
Steboss 83b0445
Merge branch 'sbosisio/axlearn_improvements' of github.com:NVIDIA/JAX…
Steboss File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.