Commit 409f079
authored
[pjrt] ensure client destruction on process exit (#1999)
`torch_xla` doesn't call `PJRT_ClientDestroy` properly. This means that
we are not closing the devices properly.
Recently, this started causing hangs on `n300` boards on subsequent
execution of tests.
This PR introduces a global singleton object which will ensure that we
are properly destroying the client instance on process shutdown. The
singleton serves as a fallback mechanism if the framework doesn't call
`PJRT_ClientDestroy` - like in the case of `torch_xla`.
Additionally, optimizer sub-meshes are now closed after each
compilation; this previously was needed to avoid hangs, but now it
causes them. Leaving the mechanism of persisting optimizer submesh in
the code base, so that we can play with it if needed. Obviously, we need
to dig deep into these issues to fix them properly.
NOTE: this does not solve the case when the process terminates abruptly,
e.g. in case of `SIGSEGV` (segmentation fault). For this, ideally we
would want a fix on `tt-metal` side.
Closes #18241 parent 21df083 commit 409f079
File tree
3 files changed
+86
-10
lines changed- pjrt_implementation
- inc/api
- src/api
- module_builder
3 files changed
+86
-10
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
34 | 34 | | |
35 | 35 | | |
36 | 36 | | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
37 | 63 | | |
38 | 64 | | |
39 | 65 | | |
| |||
96 | 122 | | |
97 | 123 | | |
98 | 124 | | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
99 | 128 | | |
100 | 129 | | |
101 | 130 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
37 | 37 | | |
38 | 38 | | |
39 | 39 | | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
40 | 71 | | |
41 | 72 | | |
42 | 73 | | |
| |||
58 | 89 | | |
59 | 90 | | |
60 | 91 | | |
61 | | - | |
62 | | - | |
63 | | - | |
| 92 | + | |
64 | 93 | | |
65 | 94 | | |
66 | 95 | | |
| |||
446 | 475 | | |
447 | 476 | | |
448 | 477 | | |
| 478 | + | |
| 479 | + | |
| 480 | + | |
| 481 | + | |
| 482 | + | |
| 483 | + | |
| 484 | + | |
449 | 485 | | |
450 | 486 | | |
451 | 487 | | |
| |||
487 | 523 | | |
488 | 524 | | |
489 | 525 | | |
490 | | - | |
491 | | - | |
| 526 | + | |
| 527 | + | |
492 | 528 | | |
| 529 | + | |
493 | 530 | | |
494 | 531 | | |
495 | 532 | | |
496 | | - | |
497 | | - | |
| 533 | + | |
| 534 | + | |
| 535 | + | |
| 536 | + | |
498 | 537 | | |
499 | 538 | | |
500 | 539 | | |
501 | 540 | | |
502 | 541 | | |
503 | 542 | | |
504 | 543 | | |
505 | | - | |
506 | | - | |
| 544 | + | |
| 545 | + | |
| 546 | + | |
| 547 | + | |
| 548 | + | |
507 | 549 | | |
508 | 550 | | |
509 | 551 | | |
| |||
Lines changed: 6 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
861 | 861 | | |
862 | 862 | | |
863 | 863 | | |
864 | | - | |
| 864 | + | |
| 865 | + | |
| 866 | + | |
| 867 | + | |
| 868 | + | |
| 869 | + | |
865 | 870 | | |
866 | 871 | | |
867 | 872 | | |
| |||
0 commit comments