- Spark Connector Java Guide >
- Datasets and SQL
Datasets and SQL¶
Datasets¶
The Dataset API provides the type safety and functional programming
benefits of RDDs along with the relational model and performance
optimizations of the DataFrame API. DataFrame
no longer exists as a
class in the Java API, so Dataset<Row>
must be used to reference a
DataFrame going forward.
The following app demonstrates how to create a Dataset
with an
implicit schema, create a Dataset
with an explicit schema, and run
SQL queries on the dataset.
Consider a collection named characters
:
Implicitly Declare a Schema¶
To create a Dataset from MongoDB data, load the data via
MongoSpark
and call the JavaMongoRDD.toDF()
method. Despite
toDF()
sounding like a DataFrame
method, it is part of the
Dataset API and returns a Dataset<Row>
.
The dataset’s schema is inferred whenever data is read from MongoDB and
stored in a Dataset<Row>
without specifying a schema-defining
Java bean. The schema is inferred by sampling documents from
the database. To explicitly declare a schema, see
Explicitly Declare a Schema.
The following operation loads data from MongoDB then uses the Dataset API to create a Dataset and infer the schema:
implicitDS.printSchema()
outputs the following schema to the console:
implicitDS.show()
outputs the following to the console:
Explicitly Declare a Schema¶
By default, reading from MongoDB in a SparkSession
infers the
schema by sampling documents from the collection. You can also use a
Java bean
to define the schema explicitly, thus removing the extra
queries needed for sampling.
Note
If you provide a case class for the schema, MongoDB returns only the declared fields. This helps minimize the data sent across the wire.
The following statement creates a Character
Java bean
and then
uses it to define the schema for the DataFrame
:
The bean is passed to the toDS( Class<T> beanClass )
method to
define the schema for the Dataset:
explicitDS.printSchema()
outputs the following:
explicitDS.show()
outputs the following:
SQL¶
Before running SQL queries on your dataset, you must register a temporary view for the dataset.
The following operation registers a
characters
table and then queries it to find all characters that
are 100 or older:
centenarians.show()
outputs the following:
Save DataFrames to MongoDB¶
The MongoDB Spark Connector provides the ability to persist DataFrames to a collection in MongoDB.
The following operation saves centenarians
into the hundredClub
collection in MongoDB: