Spark Connector R Guide¶
Source Code
For the source code that contains the examples below, see introduction.R.
Prerequisites¶
- Basic working knowledge of MongoDB and Apache Spark. Refer to the MongoDB documentation, Spark documentation, and this MongoDB white paper for more details.
- Running MongoDB instance (version 2.6 or later).
- Spark 2.0.x.
- Scala 2.11.x
Getting Started¶
sparkR
Shell¶
This tutorial uses the sparkR
shell, but the code examples work
just as well with self-contained R applications.
When starting the sparkR
shell, you can specify:
the
--packages
option to download the MongoDB Spark Connector package. The following package is available:mongo-spark-connector_2.11
for use with Scala 2.11.x
the
--conf
option to configure the MongoDB Spark Connnector. These settings configure theSparkConf
object.Note
When specifying the Connector configuration via
SparkConf
, you must prefix the settings appropriately. For details and other available MongoDB Spark Connector options, see the Configuration Options.
For example,
- The spark.mongodb.input.uri specifies the
MongoDB server address (
127.0.0.1
), the database to connect (test
), and the collection (myCollection
) from which to read data, and the read preference. - The spark.mongodb.output.uri specifies the
MongoDB server address (
127.0.0.1
), the database to connect (test
), and the collection (myCollection
) to which to write data. Connects to port27017
by default. - The
packages
option specifies the Spark Connector’s Maven coordinates, in the formatgroupId:artifactId:version
.
Create a SparkSession
Object¶
Note
When you start sparkR
you get a SparkSession
object called
spark
by default. In a standalone R application, you need
to create your SparkSession
object explicitly, as show below.
If you specified the spark.mongodb.input.uri
and spark.mongodb.output.uri
configuration options when you
started sparkR
, the default SparkSession
object uses them.
If you’d rather create your own SparkSession
object from within
sparkR
, you can use sparkr.session()
and specify different
configuration options.
You can use a SparkSession
object to write data to MongoDB, read
data from MongoDB, create DataFrames, and perform SQL operations.