-
Notifications
You must be signed in to change notification settings - Fork 11
Open
Description
I am running a harness server using docker compose. Here the steps to recreate the issue:
- Set up a engine with the following engine config:
{
"engineId": "test_ur",
"engineFactory": "com.actionml.engines.ur.UREngine",
"sparkConf": {
"master": "local",
"spark.driver.memory": "3g",
"spark.executor.memory": "1g",
"spark.serializer": "org.apache.spark.serializer.KryoSerializer",
"spark.kryo.registrator": "org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator",
"spark.kryo.referenceTracking": "false",
"spark.kryoserializer.buffer": "300m",
"spark.es.index.auto.create": "true",
"spark.es.nodes": "localhost",
"es.nodes":"localhost",
"spark.es.nodes.wan.only": "true",
"es.nodes.wan.only":"true"
},
"algorithm": {
"indicators": [
{
"name": "purchase"
},
{
"name": "view"
},
{
"name": "category-pref"
}
],
"num": 4
}
}- Add some data indicator events for testing.
- Run training job using:
POST http://localhost:9090/engines/test_ur/jobs HTTP/1.1
Content-Type: application/json
- You will get the following (similar) response for the above request:
{
"description": {
"jobId": "a6029311-ebb0-4120-90c9-fb40b1934264",
"status": {
"name": "queued"
},
"comment": "Spark job",
"createdAt": "2020-04-27T20:09:51.488Z"
},
"comment": "Started train Job on Spark"
}- After some time make the following request:
GET http://localhost:9090/engines/test_ur HTTP/1.1
Content-Type: application/json
- You will get following similar response:
"jobStatuses": [
{
"jobId": "a6029311-ebb0-4120-90c9-fb40b1934264",
"status": {
"name": "successful"
},
"comment": "Spark job",
"createdAt": "2020-04-27T20:09:51.488Z",
"completedAt": "2020-04-27T20:10:08.992Z"
}
]- Look at the last 500 lines in the harness log you will see the following messages:
harness | 20:10:08.973 INFO HttpMethodDirector - Retrying request
harness | 20:10:08.974 ERROR NetworkClient - Node [localhost:9200] failed (java.net.ConnectException: Connection refused (Connection refused)); no other nodes left - aborting...
harness | 20:10:08.981 ERROR URAlgorithm - Spark computation failed for engine test_ur with params {{"engineId":"test_ur","engineFactory":"com.actionml.engines.ur.UREngine","sparkConf":{"master":"local","spark.driver.memory":"3g","spark.executor.memory":"1g","spark.serializer":"org.apache.spark.serializer.KryoSerializer","spark.kryo.registrator":"org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator","spark.kryo.referenceTracking":"false","spark.kryoserializer.buffer":"300m","spark.es.index.auto.create":"true","spark.es.nodes":"localhost","es.nodes":"localhost","spark.es.nodes.wan.only":"true","es.nodes.wan.only":"true"},"algorithm":{"indicators":[{"name":"purchase"},{"name":"view"},{"name":"category-pref"}],"num":4}}}
harness | org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot detect ES version - typically this happens if the network/Elasticsearch cluster is not accessible or when targeting a WAN/Cloud instance without the proper setting 'es.nodes.wan.only'
harness | at org.elasticsearch.hadoop.rest.InitializationUtils.discoverClusterInfo(InitializationUtils.java:340)
harness | at org.elasticsearch.spark.rdd.EsSpark$.doSaveToEs(EsSpark.scala:104)
- In the logs just below the error message you will also notice the following:
harness | 20:10:08.990 INFO JobManager$ - Job a6029311-ebb0-4120-90c9-fb40b1934264 marked as failed
harness | 20:10:08.992 INFO SparkContextSupport$ - Job a6029311-ebb0-4120-90c9-fb40b1934264 completed in 1588018208990 ms [engine test_ur]
harness | 20:10:08.995 INFO JobManager$ - Job a6029311-ebb0-4120-90c9-fb40b1934264 completed successfully
harness | 20:10:09.004 INFO AbstractConnector - Stopped Spark@587618d3{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
harness | 20:10:09.014 INFO SparkUI - Stopped Spark web UI at http://7b946919f4f5:4040
- We can see it has conflicting messages for the same job ID.
Metadata
Metadata
Assignees
Labels
No labels