Navigation

    Configuration Options

    Various configuration options are available for the MongoDB Spark Connector.

    You can specify these options via SparkConf using the --conf setting or the $SPARK_HOME/conf/spark-default.conf file, and MongoDB Spark Connector will use the settings in SparkConf as the defaults.

    Important With Circle IconCreated with Sketch.Important

    When setting configurations via SparkConf, you must prefix the configuration options. Refer to the configuration sections for the specific prefix.

    Various methods in the MongoDB Connector API accept an optional ReadConfig or a WriteConfig object. ReadConfig and WriteConfig settings override any corresponding settings in SparkConf.

    For examples, see Using a ReadConfig and Using a WriteConfig. For more details, refer to the source for these methods.

    In the Spark API, some methods (e.g. DataFrameReader and DataFrameWriter) accept options in the form of a Map[String, String].

    You can convert custom ReadConfig or WriteConfig settings into a Map via the asOptions() method.

    The connector provides a cache for MongoClients which can only be configured via the System Property. See Cache Configuration.

    The following options for reading from MongoDB are available:

    Info With Circle IconCreated with Sketch.Note

    If setting these connector input configurations via SparkConf, prefix these settings with spark.mongodb.input..

    Property nameDescription
    uri

    Required. The connection string of the form mongodb://host:port/ where host can be a hostname, IP address, or UNIX domain socket. If :port is unspecified, the connection uses the default MongoDB port 27017.

    The other remaining input options may be appended to the uri setting. See uri Configuration Setting.

    databaseRequired. The database name from which to read data.
    collectionRequired. The collection name from which to read data.
    batchSizeSize of the internal batches used within the cursor.
    localThreshold

    The threshold (in milliseconds) for choosing a server from multiple MongoDB servers.

    Default: 15 ms

    readPreference.name

    The Read Preference to use.

    Default: Primary

    readPreference.tagSets

    The ReadPreference

    TagSets to use.
    Info With Circle IconCreated with Sketch.Note

    If setting these connector configurations via SparkConf, prefix these configuration settings with spark.mongodb.input.partitionerOptions..

    Property nameDescription
    partitionKey

    The field by which to split the collection data. The field should be indexed and contain unique values.

    Default: _id

    partitionSizeMB

    The size (in MB) for each partition

    Default: 64 MB

    samplesPerPartition

    The number of sample documents to take for each partition.

    Default: 10

    Info With Circle IconCreated with Sketch.Note

    If setting these connector configurations via SparkConf, prefix these configuration settings with spark.mongodb.input.partitionerOptions..

    Property nameDescription
    shardkey

    The field by which to split the collection data. The field should be indexed and contain unique values.

    Default: _id

    Info With Circle IconCreated with Sketch.Note

    If setting these connector configurations via SparkConf, prefix these configuration settings with spark.mongodb.input.partitionerOptions..

    Property nameDescription
    partitionKey

    The field by which to split the collection data. The field should be indexed and contain unique values.

    Default: _id

    partitionSizeMB

    The size (in MB) for each partition

    Default: 64 MB

    Info With Circle IconCreated with Sketch.Note

    If setting these connector configurations via SparkConf, prefix these configuration settings with spark.mongodb.input.partitionerOptions..

    Property nameDescription
    partitionKey

    The field by which to split the collection data. The field should be indexed and contain unique values.

    Default: _id

    numberOfPartitions

    The number of partitions to create.

    Default: 64

    Info With Circle IconCreated with Sketch.Note

    If setting these connector configurations via SparkConf, prefix these configuration settings with spark.mongodb.input.partitionerOptions..

    Property nameDescription
    partitionKey

    The field by which to split the collection data. The field should be indexed and contain unique values.

    Default: _id

    partitionSizeMB

    The size (in MB) for each partition

    Default: 64 MB

    You can set all Input Configuration via the input uri setting.

    For example, consider the following example which sets the input uri setting via SparkConf:

    Info With Circle IconCreated with Sketch.Note

    If configuring the MongoDB Spark input settings via SparkConf, prefix the setting with spark.mongodb.input..

    spark.mongodb.input.uri=mongodb://127.0.0.1/databaseName.collectionName?readPreference=primaryPreferred

    The configuration corresponds to the following separate configuration settings:

    spark.mongodb.input.uri=mongodb://127.0.0.1/
    spark.mongodb.input.database=databaseName
    spark.mongodb.input.collection=collectionName
    spark.mongodb.input.readPreference.name=primaryPreferred

    If you specify a setting both in the uri and in a separate configuration, the uri setting overrides the separate setting. For example, given the following configuration, the input database for the connection is foobar:

    spark.mongodb.input.uri=mongodb://127.0.0.1/foobar
    spark.mongodb.input.database=bar

    Output Configuration

    The following options for writing to MongoDB are available:

    Info With Circle IconCreated with Sketch.Note

    If setting these connector output configurations via SparkConf, prefix these settings with: spark.mongodb.output..

    Property nameDescription
    uri

    Required. The connection string of the form mongodb://host:port/ where host can be a hostname, IP address, or UNIX domain socket. If :port is unspecified, the connection uses the default MongoDB port 27017.

    Info With Circle IconCreated with Sketch.Note

    The other remaining options may be appended to the uri setting. See uri Configuration Setting.

    databaseRequired. The database name to write data.
    collectionRequired. The collection name to write data to
    extendedBsonTypes

    Enables extended BSON types when writing data to MongoDB.

    Default: true

    localThreshold

    The threshold (milliseconds) for choosing a server from multiple MongoDB servers.

    Default: 15 ms

    replaceDocument

    Replace the whole document when saving Datasets that contain an _id field. If false it will only update the fields in the document that match the fields in the Dataset.

    Default: true

    maxBatchSize

    The maximum batch size for bulk operations when saving data.

    Default: 512

    writeConcern.w

    The write concern w value.

    Default w: 1

    writeConcern.journalThe write concern journal value.
    writeConcern.wTimeoutMSThe write concern wTimeout value.
    shardKey

    The field by which to split the collection data. The field should be indexed and contain unique values.

    Default: _id

    forceInsert

    Forces saves to use inserts, even if a Dataset contains _id.

    Default: false

    ordered

    Sets the bulk operations ordered property.

    Default: true

    You can set all Output Configuration via the output uri.

    For example, consider the following example which sets the input uri setting via SparkConf:

    Info With Circle IconCreated with Sketch.Note

    If configuring the configuration output settings via SparkConf, prefix the setting with spark.mongodb.output..

    spark.mongodb.output.uri=mongodb://127.0.0.1/test.myCollection

    The configuration corresponds to the following separate configuration settings:

    spark.mongodb.output.uri=mongodb://127.0.0.1/
    spark.mongodb.output.database=test
    spark.mongodb.output.collection=myCollection

    If you specify a setting both in the uri and in a separate configuration, the uri setting overrides the separate setting. For example, given the following configuration, the output database for the connection is foobar:

    spark.mongodb.output.uri=mongodb://127.0.0.1/foobar
    spark.mongodb.output.database=bar

    Cache Configuration

    The MongoConnector includes a cache for MongoClients, so workers can share the MongoClient across threads.

    Important With Circle IconCreated with Sketch.Important

    As the cache is setup before the Spark Configuration is available, the cache can only be configured via a System Property.

    System Property nameDescription
    mongodb.keep_alive_ms

    The length of time to keep a MongoClient available for sharing.

    Default: 5000

    Give Feedback