Spark Connector Python API¶
Source Code
For the source code that contains the examples below, see introduction.py.
Prerequisites¶
- Basic working knowledge of MongoDB and Apache Spark. Refer to the MongoDB documentation and Spark documentation.
- Running MongoDB instance (version 2.6 or later).
- Spark 1.6.x.
- Scala 2.10.x if using the
mongo-spark-connector_2.10
package - Scala 2.11.x if using the
mongo-spark-connector_2.11
package
pyspark
Shell¶
This tutorial uses the pyspark
shell. When starting the pyspark
shell, you
can specify:
the
--packages
option to download the MongoDB Spark Connector package. The following packages are available:mongo-spark-connector_2.10
for use with Scala 2.10.xmongo-spark-connector_2.11
for use with Scala 2.11.x
the
--conf
option to configure the MongoDB Spark Connnector. These settings configure theSparkConf
object.Note
When specifying the Connector configuration via
SparkConf
, you must prefix the settings appropriately. For details and other available MongoDB Spark Connector options, see the Configuration Options.
For example,
- The spark.mongodb.input.uri specifies the
MongoDB server address(
127.0.0.1
), the database to connect (test
), and the collection (myCollection
) from which to read data, and the read preference. - The spark.mongodb.output.uri specifies the
MongoDB server address(
127.0.0.1
), the database to connect (test
), and the collection (myCollection
) to which to write data.
The examples in this tutorial will use this database and collection.
In the pyspark
shell, the SparkContext is available as sc
, and
the SQLContext is available as sqlContext
.
MongoDB Python API Basics¶
The python API works via DataFrames and uses the underlying Scala DataFrame.
Write to MongoDB¶
The Python API works via DataFrames and underlying Scala DataFrame.
To create a DataFrame, use the SQLContext.createDataFrame()
method.
To save the DataFrame to MongoDB, use the DataFrameWriter.save()
method. For example:
Read from MongoDB¶
To load the collection into a DataFrame, use the sqlContext.read()
method with com.mongodb.spark.sql.DefaultSource
as the format.
In the pyspark
shell, the operation prints the following output:
SQL¶
SQL can be used to filter data. To use SQL, you need to register a temporary table first, and then you can run SQL queries over the data.
The following example registers a temporary table characters
, then
uses SQL to filter for characters with ages greater than or equal to
100:
In the pyspark
shell, the operation prints the following output: