Description
below error message of a run of main_lightning.py:
Failure # 1 (occurred at 2021-05-23_21-45-03)
Traceback (most recent call last):
File "C:\Users\addalin.conda\envs\lidar\lib\site-packages\ray\tune\trial_runner.py", line 880, in _process_trial_save
results = self.trial_executor.fetch_result(trial)
File "C:\Users\addalin.conda\envs\lidar\lib\site-packages\ray\tune\ray_trial_executor.py", line 686, in fetch_result
result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
File "C:\Users\addalin.conda\envs\lidar\lib\site-packages\ray_private\client_mode_hook.py", line 47, in wrapper
return func(*args, **kwargs)
File "C:\Users\addalin.conda\envs\lidar\lib\site-packages\ray\worker.py", line 1481, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(OSError): �[36mray::ImplicitFunc.save()�[39m (pid=22632, ip=132.68.58.209)
File "python\ray_raylet.pyx", line 505, in ray._raylet.execute_task
File "python\ray_raylet.pyx", line 449, in ray._raylet.execute_task.function_executor
File "C:\Users\addalin.conda\envs\lidar\lib\site-packages\ray_private\function_manager.py", line 556, in actor_method_executor
return method(__ray_actor, *args, **kwargs)
File "C:\Users\addalin.conda\envs\lidar\lib\site-packages\ray\tune\function_runner.py", line 434, in save
checkpoint_path = TrainableUtil.process_checkpoint(
File "C:\Users\addalin.conda\envs\lidar\lib\site-packages\ray\tune\utils\trainable.py", line 46, in process_checkpoint
with open(checkpoint_path + ".tune_metadata", "wb") as f:
OSError: [Errno 22] Invalid argument: 'C:\Users\addalin\Dropbox\Lidar\lidar_learning\results\main_2021-05-23_19-35-00\main_5831d016_3_bsize=32,dfilter=None,dnorm=False,fc_size=[32],hsizes=[4, 4, 4, 4],lr=0.001,ltype=MAELoss,source=signal_p,use_bg=F_2021-05-23_21-28-18\checkpoint_epoch=3-step=703\.tune_metadata'
This is weird since it failed in the last epoch. And also in other experiments.
running resume with 'ERRORED_ONLY', fix this.
But why would it happen from the beginning?