Configuration Options¶
Various configuration options are available for the MongoDB Spark Connector.
Specify Configuration¶
Via SparkConf
¶
You can specify these options via SparkConf
using the --conf
setting or the $SPARK_HOME/conf/spark-default.conf
file, and
MongoDB Spark Connector will use the settings in SparkConf
as the
defaults.
Important
When setting configurations via SparkConf
, you must prefix the
configuration options. Refer to the configuration sections for the
specific prefix.
Via ReadConfig
and WriteConfig
¶
Various methods in the MongoDB Connector API accept an optional
ReadConfig
or a WriteConfig object.
ReadConfig
and WriteConfig
settings override any
corresponding settings in SparkConf
.
For examples, see Specify ReadConfig and Specify WriteConfig. For more details, refer to the source for these methods.
Via Options Map¶
In the Spark API, some methods (e.g. DataFrameReader
and
DataFrameWriter
) accept options in the form of a Map[String,
String]
.
You can convert custom ReadConfig
or WriteConfig
settings into
a Map
via the asOptions()
method.
Via System Property¶
The connector provides a cache for MongoClients
which can only be
configured via the System Property. See Cache Configuration.
Input Configuration¶
The following options for reading from MongoDB are available:
Note
If setting these connector input configurations via SparkConf
,
prefix these settings with spark.mongodb.input.
.
Property name | Description |
---|---|
uri |
Required. The connection string of the form
The other remaining input options may be appended to the |
database |
Required. The database name from which to read data. |
collection |
Required. The collection name from which to read data. |
localThreshold |
The threshold (in milliseconds) for choosing a server from multiple MongoDB servers. Default: 15 ms |
readPreference.name |
The Read Preference to use. Default: Primary |
readPreference.tagSets |
The ReadPreference TagSets to use. |
readConcern.level |
The Read Concern level to use. |
sampleSize |
The sample size to use when inferring the schema. Default: 1000 |
partitioner |
The class name of the partitioner to use to partition the data. The connector provides the following partitioners:
In addition to the provided partitioners, you can also specify a
custom partitioner implementation. For custom implementations of
the To configure options for the various partitioner, see Partitioner Configuration. Default: MongoDefaultPartitioner |
Partitioner Configuration¶
MongoSamplePartitioner
Configuration¶
Note
If setting these connector configurations via SparkConf
, prefix
these configuration settings with
spark.mongodb.input.partitionerOptions.
.
Property name | Description |
---|---|
partitionKey |
The field by which to split the collection data. The field should be indexed and contain unique values. Default: |
partitionSizeMB |
The size (in MB) for each partition Default: 64 MB |
samplesPerPartition |
The number of sample documents to take for each partition. Default: 10 |
MongoShardedPartitioner
Configuration¶
Note
If setting these connector configurations via SparkConf
, prefix
these configuration settings with
spark.mongodb.input.partitionerOptions.
.
Property name | Description |
---|---|
shardkey |
The field by which to split the collection data. The field should be indexed and contain unique values. Default: |
MongoSplitVectorPartitioner
Configuration¶
Note
If setting these connector configurations via SparkConf
, prefix
these configuration settings with
spark.mongodb.input.partitionerOptions.
.
Property name | Description |
---|---|
partitionKey |
The field by which to split the collection data. The field should be indexed and contain unique values. Default: |
partitionSizeMB |
The size (in MB) for each partition Default: 64 MB |
MongoPaginateByCountPartitioner
Configuration¶
Note
If setting these connector configurations via SparkConf
, prefix
these configuration settings with
spark.mongodb.input.partitionerOptions.
.
Property name | Description |
---|---|
partitionKey |
The field by which to split the collection data. The field should be indexed and contain unique values. Default: |
numberOfPartitions |
The number of partitions to create. Default: 64 |
MongoPaginateBySizePartitioner
Configuration¶
Note
If setting these connector configurations via SparkConf
, prefix
these configuration settings with
spark.mongodb.input.partitionerOptions.
.
Property name | Description |
---|---|
partitionKey |
The field by which to split the collection data. The field should be indexed and contain unique values. Default: |
partitionSizeMB |
The size (in MB) for each partition Default: 64 MB |
uri
Configuration Setting¶
You can set all Input Configuration via the input uri
setting.
For example, consider the following example which sets the input
uri
setting via SparkConf
:
Note
If configuring the MongoDB Spark input settings via SparkConf
,
prefix the setting with spark.mongodb.input.
.
spark.mongodb.input.uri=mongodb://127.0.0.1/databaseName.collectionName?readPreference=primaryPreferred
The configuration corresponds to the following separate configuration settings:
spark.mongodb.input.uri=mongodb://127.0.0.1/
spark.mongodb.input.database=databaseName
spark.mongodb.input.collection=collectionName
spark.mongodb.input.readPreference.name=primaryPreferred
If you specify a setting both in the uri
and in a separate
configuration, the uri
setting overrides the separate
setting. For example, given the following configuration, the input
database for the connection is foobar
:
spark.mongodb.input.uri=mongodb://127.0.0.1/foobar
spark.mongodb.input.database=bar
Output Configuration¶
The following options for writing to MongoDB are available:
Note
If setting these connector output configurations via SparkConf
,
prefix these settings with: spark.mongodb.output.
.
Property name | Description |
---|---|
uri |
Required. The connection string of the form Note The other remaining options may be appended to the |
database |
Required. The database name to write data. |
collection |
Required. The collection name to write data to |
localThreshold |
The threshold (milliseconds) for choosing a server from multiple MongoDB servers. Default: 15 ms |
writeConcern.w |
The write concern w value. Default |
writeConcern.journal |
The write concern journal value. |
writeConcern.wTimeoutMS |
The write concern wTimeout value. |
uri
Configuration Setting¶
You can set all Output Configuration via the output uri
.
For example, consider the following example which sets the input
uri
setting via SparkConf
:
Note
If configuring the configuration output settings via SparkConf
,
prefix the setting with spark.mongodb.output.
.
spark.mongodb.output.uri=mongodb://127.0.0.1/test.myCollection"
The configuration corresponds to the following separate configuration settings:
spark.mongodb.output.uri=mongodb://127.0.0.1/
spark.mongodb.output.database=test
spark.mongodb.output.collection=myCollection
If you specify a setting both in the uri
and in a separate
configuration, the uri
setting overrides the separate
setting. For example, given the following configuration, the output
database for the connection is foobar
:
spark.mongodb.output.uri=mongodb://127.0.0.1/foobar
spark.mongodb.output.database=bar
Cache Configuration¶
The MongoConnector includes a cache for MongoClients, so workers can share the MongoClient across threads.
Important
As the cache is setup before the Spark Configuration is available, the cache can only be configured via a System Property.
System Property name | Description |
---|---|
spark.mongodb.keep_alive_ms |
The length of time to keep a MongoClient available for sharing. Default: 5000 |