Navigation
This version of the documentation is archived and no longer supported.

Read from MongoDBΒΆ

You can create a Spark DataFrame to hold data from the MongoDB collection specified in the spark.mongodb.input.uri option which your SparkSession option is using.

Consider a collection named fruit that contains the following documents:

{ "_id" : 1, "type" : "apple", "qty" : 5 }
{ "_id" : 2, "type" : "orange", "qty" : 10 }
{ "_id" : 3, "type" : "banana", "qty" : 15 }

Assign the collection to a DataFrame with spark.read() from within the pyspark shell.

df = spark.read.format("com.mongodb.spark.sql.DefaultSource").load()

Spark samples the records to infer the schema of the collection.

df.printSchema()

The above operation produces the following shell output:

root
 |-- _id: double (nullable = true)
 |-- qty: double (nullable = true)
 |-- type: string (nullable = true)

If you need to read from a different MongoDB collection, use the .option method when reading data into a DataFrame.

To read from a collection called contacts in a database called people, specify people.contacts in the input URI option.

df = spark.read.format("com.mongodb.spark.sql.DefaultSource").option("uri",
"mongodb://127.0.0.1/people.contacts").load()