Skip to content
This repository was archived by the owner on Mar 10, 2025. It is now read-only.

Configuration references

Khoa Dang edited this page Aug 29, 2017 · 16 revisions

Below is the description of the available configurations of the CosmsoDB Spark Connector. Depends on the scenario, different configurations should be used to optimize the performance and throughput.

Note that the configuration key is case-insensitive and for now, the configuration value is always a string.

Reading CosmosDB collection

Many of the below configurations are passed on to the Java SDK when fetching the data from the CosmosDB collection.

  • query_maxretryattemptsonthrottledrequests: sets the maximum number of retries in the case where the request fails because the Azure CosmosDB database service has applied rate limiting on the client. If not specified, the default value is 9.
  • query_maxretrywaittimeinseconds: sets the maximum retry time in seconds. By default, it is 30 seconds.
  • query_maxdegreeofparallelism: sets the number of concurrent operations run client side during parallel query execution in the Azure DocumentDB database service. A positive property value limits the number of concurrent operations to the set value. If it is set to less than 0, the system automatically decides the number of concurrent operations to run. As the Connector maps each collection partition with an executor, this value won't have any effect on the reading operation.
  • query_maxbuffereditemcount: sets the maximum number of items that can be buffered client side during parallel query execution in the Azure DocumentDB database service. A positive property value limits the number of buffered items to the set value. If it is set to less than 0, the system automatically decides the number of items to buffer.
  • query_enablescan: sets the option to enable scans on the queries which couldn't be served as indexing was opted out on the requested paths in the Azure DocumentDB database service.
  • query_disableruperminuteusage: disables Request Units(RUs)/minute capacity to serve the query if regular provisioned RUs/second is exhausted.
  • query_emitverbosetraces: Sets the option to allow queries to emit out verbose traces for investigation.
  • query_pagesize: Sets the size of the query result page for each query request. To optimized for throughput, use a large page size to reduce the number of round trips to fetch queries results.
  • query_custom: Sets the CosmosDB query to override the default query when fetching data from CosmosDB. Note that when this is provided, it will be used in place of a query with pushed down predicates as well.

Reading CosmosDB collection change feed

  • readchangefeed: indicates that the collection content is fetched from CosmosDB Change Feed. The default value is false.
  • changefeedqueryname: a custom string to identify the query. The connector keeps track of the collection continuation tokens for different change feed queries separately. If readchangefeed is true, this is a required configuration which cannot take empty value.
  • rollingchangefeed: a boolean value indicating whether the change feed should be from the last query. The default value is false, which means the changes will be counted from the first read of the collection.
  • changefeedusenexttoken: a boolean value to support processing failure scenarios. It is used to indicate that the current change feed batch has been handled gracefully and the RDD should use the next continuation tokens to get the subsequent batch of changes.
  • changefeedcheckpointlocation: a path to local file storage to persist continuation tokens in case of node failures. This configuration is optional.
  • changefeedstartfromthebeginning: sets whether change feed should start from the beginning (true) or from the current point (false). By default, it starts from the current (false).

Writing to CosmosDB

  • WritingBatchSize: an integer string indicating the batch size to use when writing to CosmosDB collection. The connector sends createDocument/upsertDocument requests asynchronously in batch. The larger the batch size the more throughput we can achieve, as long as the cluster resources are available. On the other hand, specify a smaller number batch size to limit the rate and RU consumption. By default, writing batch size is 500.
  • Upsert: a boolean value string indicating whether upsertDocument should be used instead of CreateDocument when writing to CosmosDB collection.
Clone this wiki locally