Navigation

Configuration Options

Various configuration options are available for the MongoDB Spark Connector.

Specify Configuration

Via SparkConf

You can specify these options via SparkConf using the --conf setting or the $SPARK_HOME/conf/spark-default.conf file, and MongoDB Spark Connector will use the settings in SparkConf as the defaults.

Important

When setting configurations via SparkConf, you must prefix the configuration options. Refer to the configuration sections for the specific prefix.

Via ReadConfig and WriteConfig

Various methods in the MongoDB Connector API accept an optional ReadConfig or a WriteConfig object. ReadConfig and WriteConfig settings override any corresponding settings in SparkConf.

For examples, see Using a ReadConfig and Using a WriteConfig. For more details, refer to the source for these methods.

Via Options Map

In the Spark API, some methods (e.g. DataFrameReader and DataFrameWriter) accept options in the form of a Map[String, String].

You can convert custom ReadConfig or WriteConfig settings into a Map via the asOptions() method.

Via System Property

The connector provides a cache for MongoClients which can only be configured via the System Property. See Cache Configuration.

Input Configuration

The following options for reading from MongoDB are available:

Note

If setting these connector input configurations via SparkConf, prefix these settings with spark.mongodb.input..

Property name Description
uri

Required. The connection string of the form mongodb://host:port/ where host can be a hostname, IP address, or UNIX domain socket. If :port is unspecified, the connection uses the default MongoDB port 27017.

The other remaining input options may be appended to the uri setting. See uri Configuration Setting.

database Required. The database name from which to read data.
collection Required. The collection name from which to read data.
localThreshold

The threshold (in milliseconds) for choosing a server from multiple MongoDB servers.

Default: 15 ms

readPreference.name

The Read Preference to use.

Default: Primary

readPreference.tagSets The ReadPreference TagSets to use.
readConcern.level The Read Concern level to use.
sampleSize

The sample size to use when inferring the schema.

Default: 1000

partitioner

The class name of the partitioner to use to partition the data. The connector provides the following partitioners:

  • MongoDefaultPartitioner
    Default. Wraps the MongoSamplePartitioner and provides help for users of older versions of MongoDB.
  • MongoSamplePartitioner
    Requires MongoDB 3.2. A general purpose partitioner for all deployments. Uses the average document size and random sampling of the collection to determine suitable partitions for the collection. For configuration settings for the MongoSamplePartitioner, see MongoSamplePartitioner Configuration.
  • MongoShardedPartitioner
    A partitioner for sharded clusters. Partitions the collection based on the data chunks. Requires read access to the config database. For configuration settings for the MongoShardedPartitioner, see MongoShardedPartitioner Configuration.
  • MongoSplitVectorPartitioner
    A partitioner for standalone or replica sets. Uses the splitVector command on the standalone or the primary to determine the partitions of the database. Requires privileges to run splitVector command. For configuration settings for the MongoSplitVectorPartitioner, see MongoSplitVectorPartitioner Configuration.
  • MongoPaginateByCountPartitioner
    A slow, general purpose partitioner for all deployments. Creates a specific number of partitions. Requires a query for every partition. For configuration settings for the MongoPaginateByCountPartitioner, see MongoPaginateByCountPartitioner Configuration.
  • MongoPaginateBySizePartitioner
    A slow, general purpose partitioner for all deployments. Creates partitions based on data size. Requires a query for every partition. For configuration settings for the MongoPaginateBySizePartitioner, see MongoPaginateBySizePartitioner Configuration.

In addition to the provided partitioners, you can also specify a custom partitioner implementation. For custom implementations of the MongoPartitioner trait, provide the full class name. If no package names are provided, then the default com.mongodb.spark.rdd.partitioner package is used.

To configure options for the various partitioner, see Partitioner Configuration.

Default: MongoDefaultPartitioner

registerSQLHelperFunctions

New in version 2.0.0

Register helper methods for unsupported MongoDB data types.

Default: false

Partitioner Configuration

MongoSamplePartitioner Configuration

Note

If setting these connector configurations via SparkConf, prefix these configuration settings with spark.mongodb.input.partitionerOptions..

Property name Description
partitionKey

The field by which to split the collection data. The field should be indexed and contain unique values.

Default: _id

partitionSizeMB

The size (in MB) for each partition

Default: 64 MB

samplesPerPartition

The number of sample documents to take for each partition.

Default: 10

MongoShardedPartitioner Configuration

Note

If setting these connector configurations via SparkConf, prefix these configuration settings with spark.mongodb.input.partitionerOptions..

Property name Description
shardkey

The field by which to split the collection data. The field should be indexed and contain unique values.

Default: _id

MongoSplitVectorPartitioner Configuration

Note

If setting these connector configurations via SparkConf, prefix these configuration settings with spark.mongodb.input.partitionerOptions..

Property name Description
partitionKey

The field by which to split the collection data. The field should be indexed and contain unique values.

Default: _id

partitionSizeMB

The size (in MB) for each partition

Default: 64 MB

MongoPaginateByCountPartitioner Configuration

Note

If setting these connector configurations via SparkConf, prefix these configuration settings with spark.mongodb.input.partitionerOptions..

Property name Description
partitionKey

The field by which to split the collection data. The field should be indexed and contain unique values.

Default: _id

numberOfPartitions

The number of partitions to create.

Default: 64

MongoPaginateBySizePartitioner Configuration

Note

If setting these connector configurations via SparkConf, prefix these configuration settings with spark.mongodb.input.partitionerOptions..

Property name Description
partitionKey

The field by which to split the collection data. The field should be indexed and contain unique values.

Default: _id

partitionSizeMB

The size (in MB) for each partition

Default: 64 MB

uri Configuration Setting

You can set all Input Configuration via the input uri setting.

For example, consider the following example which sets the input uri setting via SparkConf:

Note

If configuring the MongoDB Spark input settings via SparkConf, prefix the setting with spark.mongodb.input..

spark.mongodb.input.uri=mongodb://127.0.0.1/databaseName.collectionName?readPreference=primaryPreferred

The configuration corresponds to the following separate configuration settings:

spark.mongodb.input.uri=mongodb://127.0.0.1/
spark.mongodb.input.database=databaseName
spark.mongodb.input.collection=collectionName
spark.mongodb.input.readPreference.name=primaryPreferred

If you specify a setting both in the uri and in a separate configuration, the uri setting overrides the separate setting. For example, given the following configuration, the input database for the connection is foobar:

spark.mongodb.input.uri=mongodb://127.0.0.1/foobar
spark.mongodb.input.database=bar

Output Configuration

The following options for writing to MongoDB are available:

Note

If setting these connector output configurations via SparkConf, prefix these settings with: spark.mongodb.output..

Property name Description
uri

Required. The connection string of the form mongodb://host:port/ where host can be a hostname, IP address, or UNIX domain socket. If :port is unspecified, the connection uses the default MongoDB port 27017.

Note

The other remaining options may be appended to the uri setting. See uri Configuration Setting.

database Required. The database name to write data.
collection Required. The collection name to write data to
localThreshold

The threshold (milliseconds) for choosing a server from multiple MongoDB servers.

Default: 15 ms

writeConcern.w

The write concern w value.

Default w: 1

writeConcern.journal The write concern journal value.
writeConcern.wTimeoutMS The write concern wTimeout value.

uri Configuration Setting

You can set all Output Configuration via the output uri.

For example, consider the following example which sets the input uri setting via SparkConf:

Note

If configuring the configuration output settings via SparkConf, prefix the setting with spark.mongodb.output..

spark.mongodb.output.uri=mongodb://127.0.0.1/test.myCollection

The configuration corresponds to the following separate configuration settings:

spark.mongodb.output.uri=mongodb://127.0.0.1/
spark.mongodb.output.database=test
spark.mongodb.output.collection=myCollection

If you specify a setting both in the uri and in a separate configuration, the uri setting overrides the separate setting. For example, given the following configuration, the output database for the connection is foobar:

spark.mongodb.output.uri=mongodb://127.0.0.1/foobar
spark.mongodb.output.database=bar

Cache Configuration

The MongoConnector includes a cache for MongoClients, so workers can share the MongoClient across threads.

Important

As the cache is setup before the Spark Configuration is available, the cache can only be configured via a System Property.

System Property name Description
spark.mongodb.keep_alive_ms

The length of time to keep a MongoClient available for sharing.

Default: 5000