Configuration Options¶
Various configuration options are available for the MongoDB Spark Connector.
Specify Configuration¶
Via SparkConf
¶
You can specify these options via SparkConf
using the --conf
setting or the $SPARK_HOME/conf/spark-default.conf
file, and
MongoDB Spark Connector will use the settings in SparkConf
as the
defaults.
Important
When setting configurations via SparkConf
, you must prefix the
configuration options. Refer to the configuration sections for the
specific prefix.
Via ReadConfig
and WriteConfig
¶
Various methods in the MongoDB Connector API accept an optional
ReadConfig
or a WriteConfig object.
ReadConfig
and WriteConfig
settings override any
corresponding settings in SparkConf
.
For examples, see Using a ReadConfig and Using a WriteConfig. For more details, refer to the source for these methods.
Via Options Map¶
In the Spark API, some methods (e.g. DataFrameReader
and
DataFrameWriter
) accept options in the form of a Map[String,
String]
.
You can convert custom ReadConfig
or WriteConfig
settings into
a Map
via the asOptions()
method.
Via System Property¶
The connector provides a cache for MongoClients
which can only be
configured via the System Property. See Cache Configuration.
Input Configuration¶
The following options for reading from MongoDB are available:
Note
If setting these connector input configurations via SparkConf
,
prefix these settings with spark.mongodb.input.
.
Property name | Description |
---|---|
uri |
Required. The connection string of the form
The other remaining input options may be appended to the |
database |
Required. The database name from which to read data. |
collection |
Required. The collection name from which to read data. |
localThreshold |
The threshold (in milliseconds) for choosing a server from multiple MongoDB servers. Default: 15 ms |
readPreference.name |
The Read Preference to use. Default: Primary |
readPreference.tagSets |
The ReadPreference TagSets to use. |
readConcern.level |
The Read Concern level to use. |
sampleSize |
The sample size to use when inferring the schema. Default: 1000 |
partitioner |
The class name of the partitioner to use to partition the data. The connector provides the following partitioners:
In addition to the provided partitioners, you can also specify a
custom partitioner implementation. For custom implementations of
the To configure options for the various partitioner, see Partitioner Configuration. Default: MongoDefaultPartitioner |
registerSQLHelperFunctions |
New in version 2.0.0 Register helper methods for unsupported MongoDB data types. Default: |
Partitioner Configuration¶
MongoSamplePartitioner
Configuration¶
Note
If setting these connector configurations via SparkConf
, prefix
these configuration settings with
spark.mongodb.input.partitionerOptions.
.
Property name | Description |
---|---|
partitionKey |
The field by which to split the collection data. The field should be indexed and contain unique values. Default: |
partitionSizeMB |
The size (in MB) for each partition Default: 64 MB |
samplesPerPartition |
The number of sample documents to take for each partition. Default: 10 |
MongoShardedPartitioner
Configuration¶
Note
If setting these connector configurations via SparkConf
, prefix
these configuration settings with
spark.mongodb.input.partitionerOptions.
.
Property name | Description |
---|---|
shardkey |
The field by which to split the collection data. The field should be indexed and contain unique values. Default: |
MongoSplitVectorPartitioner
Configuration¶
Note
If setting these connector configurations via SparkConf
, prefix
these configuration settings with
spark.mongodb.input.partitionerOptions.
.
Property name | Description |
---|---|
partitionKey |
The field by which to split the collection data. The field should be indexed and contain unique values. Default: |
partitionSizeMB |
The size (in MB) for each partition Default: 64 MB |
MongoPaginateByCountPartitioner
Configuration¶
Note
If setting these connector configurations via SparkConf
, prefix
these configuration settings with
spark.mongodb.input.partitionerOptions.
.
Property name | Description |
---|---|
partitionKey |
The field by which to split the collection data. The field should be indexed and contain unique values. Default: |
numberOfPartitions |
The number of partitions to create. Default: 64 |
MongoPaginateBySizePartitioner
Configuration¶
Note
If setting these connector configurations via SparkConf
, prefix
these configuration settings with
spark.mongodb.input.partitionerOptions.
.
Property name | Description |
---|---|
partitionKey |
The field by which to split the collection data. The field should be indexed and contain unique values. Default: |
partitionSizeMB |
The size (in MB) for each partition Default: 64 MB |
uri
Configuration Setting¶
You can set all Input Configuration via the input uri
setting.
For example, consider the following example which sets the input
uri
setting via SparkConf
:
Note
If configuring the MongoDB Spark input settings via SparkConf
,
prefix the setting with spark.mongodb.input.
.
spark.mongodb.input.uri=mongodb://127.0.0.1/databaseName.collectionName?readPreference=primaryPreferred
The configuration corresponds to the following separate configuration settings:
spark.mongodb.input.uri=mongodb://127.0.0.1/
spark.mongodb.input.database=databaseName
spark.mongodb.input.collection=collectionName
spark.mongodb.input.readPreference.name=primaryPreferred
If you specify a setting both in the uri
and in a separate
configuration, the uri
setting overrides the separate
setting. For example, given the following configuration, the input
database for the connection is foobar
:
spark.mongodb.input.uri=mongodb://127.0.0.1/foobar
spark.mongodb.input.database=bar
Output Configuration¶
The following options for writing to MongoDB are available:
Note
If setting these connector output configurations via SparkConf
,
prefix these settings with: spark.mongodb.output.
.
Property name | Description |
---|---|
uri |
Required. The connection string of the form Note The other remaining options may be appended to the |
database |
Required. The database name to write data. |
collection |
Required. The collection name to write data to |
localThreshold |
The threshold (milliseconds) for choosing a server from multiple MongoDB servers. Default: 15 ms |
writeConcern.w |
The write concern w value. Default |
writeConcern.journal |
The write concern journal value. |
writeConcern.wTimeoutMS |
The write concern wTimeout value. |
uri
Configuration Setting¶
You can set all Output Configuration via the output uri
.
For example, consider the following example which sets the input
uri
setting via SparkConf
:
Note
If configuring the configuration output settings via SparkConf
,
prefix the setting with spark.mongodb.output.
.
spark.mongodb.output.uri=mongodb://127.0.0.1/test.myCollection
The configuration corresponds to the following separate configuration settings:
spark.mongodb.output.uri=mongodb://127.0.0.1/
spark.mongodb.output.database=test
spark.mongodb.output.collection=myCollection
If you specify a setting both in the uri
and in a separate
configuration, the uri
setting overrides the separate
setting. For example, given the following configuration, the output
database for the connection is foobar
:
spark.mongodb.output.uri=mongodb://127.0.0.1/foobar
spark.mongodb.output.database=bar
Cache Configuration¶
The MongoConnector includes a cache for MongoClients, so workers can share the MongoClient across threads.
Important
As the cache is setup before the Spark Configuration is available, the cache can only be configured via a System Property.
System Property name | Description |
---|---|
spark.mongodb.keep_alive_ms |
The length of time to keep a MongoClient available for sharing. Default: 5000 |