Scenario Failure: eurac_pv_farm_detection

## Benchmark Failure: eurac_pv_farm_detection

**Scenario ID**: eurac_pv_farm_detection
**Backend System**: openeofed.dataspace.copernicus.eu
**Failure Count**: 1
**Timestamp**: 2025-05-15 06:58:40

**Links**:
- Workflow Run: https://github.com/ESA-APEx/apex_algorithms/actions/runs/15038544403
- Scenario Definition: https://github.com/ESA-APEx/apex_algorithms/blob/7ffaeda6f7dde0225e2e4cd02cc37b75223ec2c1/algorithm_catalog/eurac/eurac_pv_farm_detection/benchmark_scenarios/eurac_pv_farm_detection.json
- Artifacts: https://github.com/ESA-APEx/apex_algorithms/actions/runs/15038544403#artifacts

---
### Contact Information


**Point of Contact**:

| Name | Organization | Contact |
|------|--------------|---------|
| Michele Claus | Eurac Research | Contact via EURAC ([EURAC Website](https://www.eurac.edu/), [GitHub](https://github.com/clausmichele)) |

---

### Process Graph
```json
{
  "pvfarm": {
    "process_id": "eurac_pv_farm_detection",
    "namespace": "https://raw.githubusercontent.com/ESA-APEx/apex_algorithms/refs/heads/main/algorithm_catalog/eurac/eurac_pv_farm_detection/openeo_udp/eurac_pv_farm_detection.json",
    "arguments": {
      "bbox": {
        "east": 16.414,
        "north": 48.008,
        "south": 47.962,
        "west": 16.342
      },
      "temporal_extent": [
        "2023-05-01",
        "2023-09-30"
      ]
    },
    "result": true
  }
}
```

---
### Error Logs
```plaintext
scenario = BenchmarkScenario(id='eurac_pv_farm_detection', description='ML photovoltaic farm detection, developed by EURAC', back...al_extent': ['2023-05-01', '2023-09-30']}, 'result': True}}, job_options=None, reference_data={}, reference_options={})
connection_factory = <function connection_factory.<locals>.get_connection at 0x7f9a8ef691c0>
tmp_path = PosixPath('/home/runner/work/apex_algorithms/apex_algorithms/qa/benchmarks/tmp_path_root/test_run_benchmark_eurac_pv_fa0')
track_metric = <function track_metric.<locals>.append at 0x7f9a8ef694e0>
upload_assets_on_fail = <function upload_assets_on_fail.<locals>.collect at 0x7f9a8ef69260>
request = <FixtureRequest for <Function test_run_benchmark[eurac_pv_farm_detection]>>

    @pytest.mark.parametrize(
        "scenario",
        [
            # Use scenario id as parameterization id to give nicer test names.
            pytest.param(uc, id=uc.id)
            for uc in get_benchmark_scenarios()
        ],
    )
    def test_run_benchmark(
        scenario: BenchmarkScenario,
        connection_factory,
        tmp_path: Path,
        track_metric,
        upload_assets_on_fail,
        request
    ):
        track_metric("scenario_id", scenario.id)
        # Check if a backend override has been provided via cli options.
        override_backend = request.config.getoption("--override-backend")
        backend_filter = request.config.getoption("--backend-filter")
        if backend_filter and not re.match(backend_filter, scenario.backend):
            #TODO apply filter during scenario retrieval, but seems to be hard to retrieve cli param
            pytest.skip(f"skipping scenario {scenario.id} because backend {scenario.backend} does not match filter {backend_filter!r}")
        backend = scenario.backend
        if override_backend:
            _log.info(f"Overriding backend URL with {override_backend!r}")
            backend = override_backend
    
        connection: openeo.Connection = connection_factory(url=backend)
    
        # TODO #14 scenario option to use synchronous instead of batch job mode?
        job = connection.create_job(
            process_graph=scenario.process_graph,
            title=f"APEx benchmark {scenario.id}",
            additional=scenario.job_options,
        )
        track_metric("job_id", job.job_id)
    
        # TODO: monitor timing and progress
        # TODO: abort excessively long batch jobs? https://github.com/Open-EO/openeo-python-client/issues/589
>       job.start_and_wait()

tests/test_benchmarks.py:61: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <BatchJob job_id='cdse-j-250515065245452b8d981f749124b135'>
print = <built-in function print>, max_poll_interval = 60
connection_retry_interval = 30, soft_error_max = 10, show_error_logs = True

    def start_and_wait(
        self,
        print=print,
        max_poll_interval: float = DEFAULT_JOB_STATUS_POLL_INTERVAL_MAX,
        connection_retry_interval: float = DEFAULT_JOB_STATUS_POLL_CONNECTION_RETRY_INTERVAL,
        soft_error_max: int = DEFAULT_JOB_STATUS_POLL_SOFT_ERROR_MAX,
        show_error_logs: bool = True,
    ) -> BatchJob:
        """
        Start the batch job, poll its status and wait till it finishes (or fails)
    
        :param print: print/logging function to show progress/status
        :param max_poll_interval: maximum number of seconds to sleep between job status polls
        :param connection_retry_interval: how long to wait when status poll failed due to connection issue
        :param soft_error_max: maximum number of soft errors (e.g. temporary connection glitches) to allow
        :param show_error_logs: whether to automatically print error logs when the batch job failed.
    
        :return: Handle to the job created at the backend.
    
        .. versionchanged:: 0.37.0
            Added argument ``show_error_logs``.
        """
        # TODO rename `connection_retry_interval` to something more generic?
        start_time = time.time()
    
        def elapsed() -> str:
            return str(datetime.timedelta(seconds=time.time() - start_time)).rsplit(".")[0]
    
        def print_status(msg: str):
            print("{t} Job {i!r}: {m}".format(t=elapsed(), i=self.job_id, m=msg))
    
        # TODO: make `max_poll_interval`, `connection_retry_interval` class constants or instance properties?
        print_status("send 'start'")
        self.start()
    
        # TODO: also add  `wait` method so you can track a job that already has started explicitly
        #   or just rename this method to `wait` and automatically do start if not started yet?
    
        # Start with fast polling.
        poll_interval = min(5, max_poll_interval)
        status = None
        _soft_error_count = 0
    
        def soft_error(message: str):
            """Non breaking error (unless we had too much of them)"""
            nonlocal _soft_error_count
            _soft_error_count += 1
            if _soft_error_count > soft_error_max:
                raise OpenEoClientException("Excessive soft errors")
            print_status(message)
            time.sleep(connection_retry_interval)
    
        while True:
            # TODO: also allow a hard time limit on this infinite poll loop?
            try:
                job_info = self.describe()
            except requests.ConnectionError as e:
                soft_error("Connection error while polling job status: {e}".format(e=e))
                continue
            except OpenEoApiPlainError as e:
                if e.http_status_code in [502, 503]:
                    soft_error("Service availability error while polling job status: {e}".format(e=e))
                    continue
                else:
                    raise
    
            status = job_info.get("status", "N/A")
    
            progress = job_info.get("progress")
            if isinstance(progress, int):
                progress = f"{progress:d}%"
            elif isinstance(progress, float):
                progress = f"{progress:.1f}%"
            else:
                progress = "N/A"
            print_status(f"{status} (progress {progress})")
            if status not in ('submitted', 'created', 'queued', 'running'):
                break
    
            # Sleep for next poll (and adaptively make polling less frequent)
            time.sleep(poll_interval)
            poll_interval = min(1.25 * poll_interval, max_poll_interval)
    
        if status != "finished":
            # TODO: render logs jupyter-aware in a notebook context?
            if show_error_logs:
                print(f"Your batch job {self.job_id!r} failed. Error logs:")
                print(self.logs(level=logging.ERROR))
                print(
                    f"Full logs can be inspected in an openEO (web) editor or with `connection.job({self.job_id!r}).logs()`."
                )
>           raise JobFailedException(
                f"Batch job {self.job_id!r} didn't finish successfully. Status: {status} (after {elapsed()}).",
                job=self,
            )
E           openeo.rest.JobFailedException: Batch job 'cdse-j-250515065245452b8d981f749124b135' didn't finish successfully. Status: error (after 0:05:50).

/opt/hostedtoolcache/Python/3.12.10/x64/lib/python3.12/site-packages/openeo/rest/job.py:347: JobFailedException
----------------------------- Captured stdout call -----------------------------
0:00:00 Job 'cdse-j-250515065245452b8d981f749124b135': send 'start'
0:00:13 Job 'cdse-j-250515065245452b8d981f749124b135': created (progress 0%)
0:00:19 Job 'cdse-j-250515065245452b8d981f749124b135': created (progress 0%)
0:00:25 Job 'cdse-j-250515065245452b8d981f749124b135': queued (progress 0%)
0:00:33 Job 'cdse-j-250515065245452b8d981f749124b135': queued (progress 0%)
0:00:43 Job 'cdse-j-250515065245452b8d981f749124b135': queued (progress 0%)
0:00:56 Job 'cdse-j-250515065245452b8d981f749124b135': queued (progress 0%)
0:01:12 Job 'cdse-j-250515065245452b8d981f749124b135': queued (progress 0%)
0:01:32 Job 'cdse-j-250515065245452b8d981f749124b135': running (progress N/A)
0:01:56 Job 'cdse-j-250515065245452b8d981f749124b135': running (progress N/A)
0:02:26 Job 'cdse-j-250515065245452b8d981f749124b135': running (progress N/A)
0:03:04 Job 'cdse-j-250515065245452b8d981f749124b135': running (progress N/A)
0:03:51 Job 'cdse-j-250515065245452b8d981f749124b135': running (progress N/A)
0:04:49 Job 'cdse-j-250515065245452b8d981f749124b135': running (progress N/A)
0:05:50 Job 'cdse-j-250515065245452b8d981f749124b135': error (progress N/A)
Your batch job 'cdse-j-250515065245452b8d981f749124b135' failed. Error logs:
[{'id': '[1747292299366, 422133]', 'time': '2025-05-15T06:58:19.366Z', 'level': 'error', 'message': 'Task 1 in stage 52.0 failed 4 times; aborting job'}, {'id': '[1747292299373, 380320]', 'time': '2025-05-15T06:58:19.373Z', 'level': 'error', 'message': 'Stage error: Job aborted due to stage failure: Task 1 in stage 52.0 failed 4 times, most recent failure: Lost task 1.3 in stage 52.0 (TID 2538) (10.42.47.218 executor 2): org.apache.spark.api.python.PythonException: Traceback (most recent call last):\n  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 1247, in main\n    process()\n  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 1239, in process\n    serializer.dump_stream(out_iter, outfile)\n  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 146, in dump_stream\n    for obj in iterator:\n  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/util.py", line 83, in wrapper\n    return f(*args, **kwargs)\n  File "/opt/openeo/lib/python3.8/site-packages/openeogeotrellis/utils.py", line 66, in memory_logging_wrapper\n    return function(*args, **kwargs)\n  File "/opt/openeo/lib/python3.8/site-packages/epsel.py", line 44, in wrapper\n    return _FUNCTION_POINTERS[key](*args, **kwargs)\n  File "/opt/openeo/lib/python3.8/site-packages/epsel.py", line 37, in first_time\n    return f(*args, **kwargs)\n  File "/opt/openeo/lib/python3.8/site-packages/openeogeotrellis/geopysparkdatacube.py", line 805, in tile_function\n    result_data = run_udf_code(code=udf_code, data=data)\n  File "/opt/openeo/lib/python3.8/site-packages/epsel.py", line 44, in wrapper\n    return _FUNCTION_POINTERS[key](*args, **kwargs)\n  File "/opt/openeo/lib/python3.8/site-packages/epsel.py", line 37, in first_time\n    return f(*args, **kwargs)\n  File "/opt/openeo/lib/python3.8/site-packages/openeogeotrellis/udf.py", line 67, in run_udf_code\n    return openeo.udf.run_udf_code(code=code, data=data)\n  File "/opt/openeo/lib/python3.8/site-packages/openeo/udf/run_code.py", line 195, in run_udf_code\n    result_cube: xarray.DataArray = func(cube=data.get_datacube_list()[0].get_array(), context=data.user_context)\n  File "<string>", line 107, in apply_datacube\n  File "<string>", line 79, in apply_model\n  File "<string>", line 22, in load_onnx_model\n  File "onnx_deps/onnxruntime/capi/onnxruntime_inference_collection.py", line 419, in __init__\n    self._create_inference_session(providers, provider_options, disabled_optimizers)\n  File "onnx_deps/onnxruntime/capi/onnxruntime_inference_collection.py", line 452, in _create_inference_session\n    sess = C.InferenceSession(session_options, self._model_path, True, self._read_config_from_model)\nonnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Load model from onnx_models/EURAC_pvfarm_rf_1_median_depth_15.onnx failed:system error number 20\n\n\tat org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:572)\n\tat org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:784)\n\tat org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:766)\n\tat org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:525)\n\tat org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)\n\tat scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)\n\tat scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)\n\tat scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)\n\tat org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)\n\tat org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)\n\tat org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104)\n\tat org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)\n\tat org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)\n\tat org.apache.spark.scheduler.Task.run(Task.scala:141)\n\tat org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)\n\tat org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)\n\tat org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)\n\tat org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)\n\tat org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)\n\tat java.base/java.lang.Thread.run(Thread.java:829)\n\nDriver stacktrace:'}, {'id': '[1747292301836, 121997]', 'time': '2025-05-15T06:58:21.836Z', 'level': 'error', 'message': 'OpenEO batch job failed: UDF exception while evaluating processing graph. Please check your user defined functions. stacktrace:\n  File "<string>", line 107, in apply_datacube\n  File "<string>", line 79, in apply_model\n  File "<string>", line 22, in load_onnx_model\n  File "onnx_deps/onnxruntime/capi/onnxruntime_inference_collection.py", line 419, in __init__\n    self._create_inference_session(providers, provider_options, disabled_optimizers)\n  File "onnx_deps/onnxruntime/capi/onnxruntime_inference_collection.py", line 452, in _create_inference_session\n    sess = C.InferenceSession(session_options, self._model_path, True, self._read_config_from_model)\nonnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Load model from onnx_models/EURAC_pvfarm_rf_1_median_depth_15.onnx failed:system error number 20'}]
Full logs can be inspected in an openEO (web) editor or with `connection.job('cdse-j-250515065245452b8d981f749124b135').logs()`.
------------------------------ Captured log call -------------------------------
INFO     conftest:conftest.py:131 Connecting to 'openeofed.dataspace.copernicus.eu'
INFO     openeo.config:config.py:193 Loaded openEO client config from sources: []
INFO     conftest:conftest.py:144 Checking for auth_env_var='OPENEO_AUTH_CLIENT_CREDENTIALS_CDSEFED' to drive auth against url='openeofed.dataspace.copernicus.eu'.
INFO     conftest:conftest.py:148 Extracted provider_id='CDSE' client_id='openeo-apex-benchmarks-service-account' from auth_env_var='OPENEO_AUTH_CLIENT_CREDENTIALS_CDSEFED'
INFO     openeo.rest.connection:connection.py:232 Found OIDC providers: ['CDSE']
INFO     openeo.rest.auth.oidc:oidc.py:404 Doing 'client_credentials' token request 'https://identity.dataspace.copernicus.eu/auth/realms/CDSE/protocol/openid-connect/token' with post data fields ['grant_type', 'client_id', 'client_secret', 'scope'] (client_id 'openeo-apex-benchmarks-service-account')
INFO     openeo.rest.connection:connection.py:329 Obtained tokens: ['access_token', 'id_token']
- Generated track_metrics report: report/metrics.json, _ParquetS3StorageSettings(bucket='apex-benchmarks', key='metrics/v1/metrics.parquet') -
-------------------- `upload_assets` stats: {'uploaded': 0} --------------------
- tests/test_benchmarks.py::test_run_benchmark[eurac_pv_farm_detection]:
- Generated html report: file:///home/runner/work/apex_algorithms/apex_algorithms/qa/benchmarks/report/report.html -
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Scenario Failure: eurac_pv_farm_detection #179

Benchmark Failure: eurac_pv_farm_detection

Contact Information

Process Graph

Error Logs

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Scenario Failure: eurac_pv_farm_detection #179

Description

Benchmark Failure: eurac_pv_farm_detection

Contact Information

Process Graph

Error Logs

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions