Docs Menu

Filters and SQL

On this page

  • Filters
  • SQL
Note

When using filters with DataFrames or the Python API, the underlying Mongo Connector code constructs an aggregation pipeline to filter the data in MongoDB before sending it to Spark.

Use filter() to read a subset of data from your MongoDB collection.

Consider a collection named fruit that contains the following documents:

{ "_id" : 1, "type" : "apple", "qty" : 5 }
{ "_id" : 2, "type" : "orange", "qty" : 10 }
{ "_id" : 3, "type" : "banana", "qty" : 15 }

First, set up a dataframe to connect with your default MongoDB data source:

df = spark.read.format("mongo").load()

The following example includes only records in which the qty field is greater than or equal to 10.

df.filter(df['qty'] >= 10).show()

The operation prints the following output:

+---+----+------+
|_id| qty| type|
+---+----+------+
|2.0|10.0|orange|
|3.0|15.0|banana|
+---+----+------+

Before you can run SQL queries against your DataFrame, you need to register a temporary table.

The following example registers a temporary table called temp, then uses SQL to query for records in which the type field contains the letter e:

df.createOrReplaceTempView("temp")
some_fruit = spark.sql("SELECT type, qty FROM temp WHERE type LIKE '%e%'")
some_fruit.show()

In the pyspark shell, the operation prints the following output:

+------+----+
| type| qty|
+------+----+
| apple| 5.0|
|orange|10.0|
+------+----+

On this page

Give Feedback
© 2021 MongoDB, Inc.

About

  • Careers
  • Legal Notices
  • Privacy Notices
  • Security Information
  • Trust Center
© 2021 MongoDB, Inc.