Navigation

    FAQ

    For any MongoDB deployment, the Mongo Spark Connector sets the preferred location for an RDD to be where the data is:

    • For a non sharded system, it sets the preferred location to be the hostname(s) of the standalone or the replica set.
    • For a sharded system, it sets the preferred location to be the hostname(s) of the shards.

    To promote data locality,

    • Ensure there is a Spark Worker on one of the hosts for non-sharded system or one per shard for sharded systems.
    • Use a nearest read preference to read from the local mongod.
    • For a sharded cluster, you should have a mongos on the same nodes and use localThreshold configuration to connect to the nearest mongos. To partition the data by shard use the MongoShardedPartitioner Configuration.

    Spark streams can be considered as a potentially infinite source of RDDs. Therefore, anything you can do with an RDD, you can do with the results of a Spark Stream.

    For an example, see SparkStreams.scala

    In MongoDB deployments with mixed versions of mongod, it is possible to get an Unrecognized pipeline stage name: '$sample' error. To mitigate this situation, explicitly configure the partitioner to use and define the Schema when using DataFrames.

    Give Feedback