Add a parameter that controls the number of StreamLoad tasks committed per partition #92#99
Add a parameter that controls the number of StreamLoad tasks committed per partition #92#99baishaoisde wants to merge 10 commits intoapache:masterfrom
Conversation
…区只能执行一次StreamLoad,即将一个分区的所有数据通过一个StreamLoad任务写入Doris。避免task任务失败重试时对同样的数据提交多次StreamLoad任务。
|
Thank you for your contribution, can you resolve the conflict? |
|
|
||
| boolean DORIS_SINK_TASK_USE_REPARTITION_DEFAULT = false; | ||
|
|
||
| //设置每个分区仅提交一个StreamLoad任务,以保证任务失败时task重试不会导致对同一批数据重复提交StreamLoad任务。 |
There was a problem hiding this comment.
Ok, I will change it to English.
| flush | ||
| } | ||
| }) | ||
| flush |
There was a problem hiding this comment.
What does this flush deal with?
There was a problem hiding this comment.
This flush is to allow each partition to commit only one StreamLoad task when the partitionTaskAtomicity parameter is true, but seems redundant because the code already has the "if (! rowArray.isEmpty) {flush} "operation.
|
If the data is processed according to the partition, when a single partition is particularly large, there may be problems in a streamload? |
Yes, for this problem, we need to prompt the user in the parameter description that "reparation needs to be used to ensure that the data volume per partition is in a reasonable range after this parameter is enabled". Do you think it is appropriate? |
* use writer to write data * resolve conflicts * unify jackson version * remove useless code
…ults to false. If false, StreamLoad can be performed multiple times per partition. If true, limit StreamLoad to one time per partition, that is, all the data of a partition is written to Doris via a StreamLoad task. Avoid submitting multiple StreamLoad tasks with the same data when the task fails to retry.
# Conflicts: # spark-doris-connector/src/main/scala/org/apache/doris/spark/sql/DorisSourceProvider.scala # spark-doris-connector/src/main/scala/org/apache/doris/spark/sql/DorisStreamLoadSink.scala
Add a parameter that controls the number of StreamLoad tasks committed per partition
Issue Number: close #92