Fix This Page

Filters and SQL

On this page



When using filters with DataFrames or the R API, the underlying Mongo Connector code constructs an aggregation pipeline to filter the data in MongoDB before sending it to Spark.

Use filter() to read a subset of data from your MongoDB collection.

Consider a collection named fruit that contains the following documents:

{ "_id" : 1, "type" : "apple", "qty" : 5 }
{ "_id" : 2, "type" : "orange", "qty" : 10 }
{ "_id" : 3, "type" : "banana", "qty" : 15 }

First, set up a dataframe to connect with your default MongoDB data source:

df <- read.df("", source = "com.mongodb.spark.sql.DefaultSource")


The empty argument (“”) refers to a file to use as a data source. In this case our data source is a MongoDB collection, so the data source argument is empty.

The following operation filters the data and includes records where the qty field is greater than or equal to 10:

head(filter(df, df$qty >= 10))

The operation prints the following output:

_id qty   type
1   2  10 orange
2   3  15 banana


Before running SQL queries on your dataset, you must register a temporary view for the dataset.

The following example registers a temporary table called temp, then uses SQL to query for records in which the type field contains the letter e:

createOrReplaceTempView(df, "temp")
some_fruit <- sql("SELECT type, qty FROM temp WHERE type LIKE '%e%'")

In the sparkR shell, the operation prints the following output:

type qty
1  apple   5
2 orange  10
←   Aggregation FAQ  →