Configuration Options¶

On this page

Specify Configuration
Input Configuration
Output Configuration
Cache Configuration

Various configuration options are available for the MongoDB Spark Connector.

Specify Configuration¶

Via `SparkConf`¶

You can specify these options via SparkConf using the --conf setting or the $SPARK_HOME/conf/spark-default.conf file, and MongoDB Spark Connector will use the settings in SparkConf as the defaults.

Important

When setting configurations via SparkConf, you must prefix the configuration options. Refer to the configuration sections for the specific prefix.

Via `ReadConfig` and `WriteConfig`¶

Various methods in the MongoDB Connector API accept an optional ReadConfig or a WriteConfig object. ReadConfig and WriteConfig settings override any corresponding settings in SparkConf.

For examples, see Specify ReadConfig and Specify WriteConfig. For more details, refer to the source for these methods.

Via Options Map¶

In the Spark API, some methods (e.g. DataFrameReader and DataFrameWriter) accept options in the form of a Map[String, String].

You can convert custom ReadConfig or WriteConfig settings into a Map via the asOptions() method.

Via System Property¶

The connector provides a cache for MongoClients which can only be configured via the System Property. See Cache Configuration.

Input Configuration¶

The following options for reading from MongoDB are available:

Note

If setting these connector input configurations via SparkConf, prefix these settings with spark.mongodb.input..

Property name	Description
`uri`	Required. The connection string of the form `mongodb://host:port/` where `host` can be a hostname, IP address, or UNIX domain socket. If `:port` is unspecified, the connection uses the default MongoDB port 27017. The other remaining input options may be appended to the `uri` setting. See uri Configuration Setting.
`database`	Required. The database name from which to read data.
`collection`	Required. The collection name from which to read data.
`localThreshold`	The threshold (in milliseconds) for choosing a server from multiple MongoDB servers. Default: 15 ms
`readPreference.name`	The Read Preference to use. Default: Primary
`readPreference.tagSets`	The ReadPreference TagSets to use.
`readConcern.level`	The Read Concern level to use.
`sampleSize`	The sample size to use when inferring the schema. Default: 1000
`partitioner`	The class name of the partitioner to use to partition the data. The connector provides the following partitioners: `MongoDefaultPartitioner` Default. Wraps the MongoSamplePartitioner and provides help for users of older versions of MongoDB. `MongoSamplePartitioner` Requires MongoDB 3.2. A general purpose partitioner for all deployments. Uses the average document size and random sampling of the collection to determine suitable partitions for the collection. For configuration settings for the MongoSamplePartitioner, see MongoSamplePartitioner Configuration. `MongoShardedPartitioner` A partitioner for sharded clusters. Partitions the collection based on the data chunks. Requires read access to the `config` database. For configuration settings for the MongoShardedPartitioner, see MongoShardedPartitioner Configuration. `MongoSplitVectorPartitioner` A partitioner for standalone or replica sets. Uses the `splitVector` command on the standalone or the primary to determine the partitions of the database. Requires privileges to run `splitVector` command. For configuration settings for the MongoSplitVectorPartitioner, see MongoSplitVectorPartitioner Configuration. `MongoPaginateByCountPartitioner` A slow, general purpose partitioner for all deployments. Creates a specific number of partitions. Requires a query for every partition. For configuration settings for the MongoPaginateByCountPartitioner, see MongoPaginateByCountPartitioner Configuration. `MongoPaginateBySizePartitioner` A slow, general purpose partitioner for all deployments. Creates partitions based on data size. Requires a query for every partition. For configuration settings for the MongoPaginateBySizePartitioner, see MongoPaginateBySizePartitioner Configuration. In addition to the provided partitioners, you can also specify a custom partitioner implementation. For custom implementations of the `MongoPartitioner` trait, provide the full class name. If no package names are provided, then the default `com.mongodb.spark.rdd.partitioner` package is used. To configure options for the various partitioner, see Partitioner Configuration. Default: MongoDefaultPartitioner

Partitioner Configuration¶

`MongoSamplePartitioner` Configuration¶

Note

If setting these connector configurations via SparkConf, prefix these configuration settings with spark.mongodb.input.partitionerOptions..

Property name Description

partitionKey

The field by which to split the collection data. The field should be indexed and contain unique values.

Default: _id

partitionSizeMB

The size (in MB) for each partition

Default: 64 MB

samplesPerPartition

The number of sample documents to take for each partition.

Default: 10

`MongoShardedPartitioner` Configuration¶

Note

If setting these connector configurations via SparkConf, prefix these configuration settings with spark.mongodb.input.partitionerOptions..

Property name Description

shardkey

The field by which to split the collection data. The field should be indexed and contain unique values.

Default: _id

`MongoSplitVectorPartitioner` Configuration¶

Note

If setting these connector configurations via SparkConf, prefix these configuration settings with spark.mongodb.input.partitionerOptions..

Property name Description

partitionKey

The field by which to split the collection data. The field should be indexed and contain unique values.

Default: _id

partitionSizeMB

The size (in MB) for each partition

Default: 64 MB

`MongoPaginateByCountPartitioner` Configuration¶

Note

If setting these connector configurations via SparkConf, prefix these configuration settings with spark.mongodb.input.partitionerOptions..

Property name Description

partitionKey

The field by which to split the collection data. The field should be indexed and contain unique values.

Default: _id

numberOfPartitions

The number of partitions to create.

Default: 64

`MongoPaginateBySizePartitioner` Configuration¶

Note

If setting these connector configurations via SparkConf, prefix these configuration settings with spark.mongodb.input.partitionerOptions..

Property name Description

partitionKey

The field by which to split the collection data. The field should be indexed and contain unique values.

Default: _id

partitionSizeMB

The size (in MB) for each partition

Default: 64 MB

`uri` Configuration Setting¶

You can set all Input Configuration via the input uri setting.

For example, consider the following example which sets the input uri setting via SparkConf:

Note

If configuring the MongoDB Spark input settings via SparkConf, prefix the setting with spark.mongodb.input..

spark.mongodb.input.uri=mongodb://127.0.0.1/databaseName.collectionName?readPreference=primaryPreferred

The configuration corresponds to the following separate configuration settings:

spark.mongodb.input.uri=mongodb://127.0.0.1/
spark.mongodb.input.database=databaseName
spark.mongodb.input.collection=collectionName
spark.mongodb.input.readPreference.name=primaryPreferred

If you specify a setting both in the uri and in a separate configuration, the uri setting overrides the separate setting. For example, given the following configuration, the input database for the connection is foobar:

spark.mongodb.input.uri=mongodb://127.0.0.1/foobar
spark.mongodb.input.database=bar

Output Configuration¶

The following options for writing to MongoDB are available:

Note

If setting these connector output configurations via SparkConf, prefix these settings with: spark.mongodb.output..

Property name	Description
`uri`	Required. The connection string of the form `mongodb://host:port/` where `host` can be a hostname, IP address, or UNIX domain socket. If `:port` is unspecified, the connection uses the default MongoDB port 27017. Note The other remaining options may be appended to the `uri` setting. See uri Configuration Setting.
`database`	Required. The database name to write data.
`collection`	Required. The collection name to write data to
`localThreshold`	The threshold (milliseconds) for choosing a server from multiple MongoDB servers. Default: 15 ms
`writeConcern.w`	The write concern w value. Default `w: 1`
`writeConcern.journal`	The write concern journal value.
`writeConcern.wTimeoutMS`	The write concern wTimeout value.

`uri` Configuration Setting¶

You can set all Output Configuration via the output uri.

For example, consider the following example which sets the input uri setting via SparkConf:

Note

If configuring the configuration output settings via SparkConf, prefix the setting with spark.mongodb.output..

spark.mongodb.output.uri=mongodb://127.0.0.1/test.myCollection"

The configuration corresponds to the following separate configuration settings:

spark.mongodb.output.uri=mongodb://127.0.0.1/
spark.mongodb.output.database=test
spark.mongodb.output.collection=myCollection

If you specify a setting both in the uri and in a separate configuration, the uri setting overrides the separate setting. For example, given the following configuration, the output database for the connection is foobar:

spark.mongodb.output.uri=mongodb://127.0.0.1/foobar
spark.mongodb.output.database=bar

Cache Configuration¶

The MongoConnector includes a cache for MongoClients, so workers can share the MongoClient across threads.

Important

As the cache is setup before the Spark Configuration is available, the cache can only be configured via a System Property.

System Property name Description

spark.mongodb.keep_alive_ms

The length of time to keep a MongoClient available for sharing.

Default: 5000

← Getting Started Spark SQL →

Configuration Options¶

Specify Configuration¶

Via SparkConf¶

Via ReadConfig and WriteConfig¶

Via Options Map¶

Via System Property¶

Input Configuration¶

Partitioner Configuration¶

MongoSamplePartitioner Configuration¶

MongoShardedPartitioner Configuration¶

MongoSplitVectorPartitioner Configuration¶

MongoPaginateByCountPartitioner Configuration¶

MongoPaginateBySizePartitioner Configuration¶

uri Configuration Setting¶

Output Configuration¶

uri Configuration Setting¶

Cache Configuration¶

Via `SparkConf`¶

Via `ReadConfig` and `WriteConfig`¶

`MongoSamplePartitioner` Configuration¶

`MongoShardedPartitioner` Configuration¶

`MongoSplitVectorPartitioner` Configuration¶

`MongoPaginateByCountPartitioner` Configuration¶

`MongoPaginateBySizePartitioner` Configuration¶

`uri` Configuration Setting¶

`uri` Configuration Setting¶