How can I achieve data locality?
For any MongoDB deployment, the Mongo Spark Connector sets the preferred location for an RDD to be where the data is:
- For a non sharded system, it sets the preferred location to be the hostname(s) of the standalone or the replica set.
- For a sharded system, it sets the preferred location to be the hostname(s) of the shards.
To promote data locality,
- Ensure there is a Spark Worker on one of the hosts for non-sharded system or one per shard for sharded systems.
- Use a
nearestread preference to read from the local
- For a sharded cluster, you should have a
mongoson the same nodes and use localThreshold configuration to connect to the nearest
mongos. To partition the data by shard use the MongoShardedPartitioner Configuration.
How do I interact with Spark Streams?
Spark streams can be considered as a potentially infinite source of RDDs. Therefore, anything you can do with an RDD, you can do with the results of a Spark Stream.
For an example, see SparkStreams.scala
How do I resolve
Unrecognized pipeline stage name Error?
In MongoDB deployments with mixed versions of
mongod, it is
possible to get an
Unrecognized pipeline stage name: '$sample'
error. To mitigate this situation, explicitly configure the partitioner
to use and define the Schema when using DataFrames.