FAQ¶

How can I achieve data locality?¶

For any MongoDB deployment, the Mongo Spark Connector sets the preferred location for an RDD to be where the data is:

For a non sharded system, it sets the preferred location to be the hostname(s) of the standalone or the replica set.
For a sharded system, it sets the preferred location to be the hostname(s) of the shards.

To promote data locality,

Ensure there is a Spark Worker on one of the hosts for non-sharded system or one per shard for sharded systems.
Use a nearest read preference to read from the local mongod.
For a sharded cluster, you should have a mongos on the same nodes and use localThreshold configuration to connect to the nearest mongos. To partition the data by shard use the MongoShardedPartitioner Configuration.

How do I interact with Spark Streams?¶

Spark streams can be considered as a potentially infinite source of RDDs. Therefore, anything you can do with an RDD, you can do with the results of a Spark Stream.

For an example, see SparkStreams.scala

How do I resolve `Unrecognized pipeline stage name` Error?¶

In MongoDB deployments with mixed versions of mongod, it is possible to get an Unrecognized pipeline stage name: '$sample' error. To mitigate this situation, explicitly configure the partitioner to use and define the Schema when using DataFrames.

← Spark Connector SparkR API

FAQ¶

How can I achieve data locality?¶

How do I interact with Spark Streams?¶

How do I resolve Unrecognized pipeline stage name Error?¶

How do I resolve `Unrecognized pipeline stage name` Error?¶