Getting Started with Hadoop

MongoDB and Hadoop are a powerful combination and can be used together to deliver complex analytics and data processing for data stored in MongoDB. The following guide shows how you can start working with the MongoDB Connector for Hadoop. Once you become familiar with the connector, you can use it to pull your MongoDB data into Hadoop Map-Reduce jobs, process the data and return results back to a MongoDB collection.



In order to use the following guide, you should already have Hadoop up and running. This can range from a deployed cluster containing multiple nodes or a single node pseudo-distributed Hadoop installation running locally. As long as you are able to run any of the examples on your Hadoop installation, you should be all set. The Hadoop connector supports all Apache Hadoop versions 2.X and up, including distributions based on these versions such as CDH4, CDH5, and HDP 2.0 and up.


Install and run the latest version of MongoDB. In addition, the MongoDB commands should be in your system search path (i.e. $PATH).

If your mongod requires authorization [1], the user account used by the Hadoop connector must be able to run the splitVector command on the input database. You can either give the user the clusterManager role, or create a custom role for this:

  role: "hadoopSplitVector",
  privileges: [{
    resource: {
      db: "myDatabase",
      collection: "myCollection"
    actions: ["splitVector"]
  user: "hadoopUser",
  pwd: "secret",
  roles: ["hadoopSplitVector", "readWrite"]

Note that the splitVector command cannot be run through a mongos, so the Hadoop connector will automatically connect to the primary shard in order to run the command. If your input collection is unsharded and the connector reads through a mongos, make sure that your MongoDB Hadoop user is created on the primary shard as well.

[1]mongod requires clients to authenticate as user when you specify the security.authorization or security.keyFile settings or the corresponding --auth or --keyFile command line options.


In addition to Hadoop, you should also have git, python, and Java 7 or greater installed.


The MongoDB Connector for Hadoop ships with a few examples of how to use the connector in your own setup. In this guide, we’ll focus on the Treasury Yield example. This example can be run via gradle with the following command:

./gradlew jar testJar historicalYield

You should see a lot of data scroll past. To understand what’s going on, let’s break down the steps. The first thing this task does is to import examples/treasury_yield/src/main/resources/yield_historical_in.json in to mongo. You can view that data as shown below:

$ mongo mongo_hadoop
MongoDB shell version: 3.0.4
connecting to: mongo_hadoop
> show collections
{ "_id" : ISODate("1990-01-02T00:00:00Z"), "dayOfWeek" : "TUESDAY", "bc3Year" : 7.9, "bc5Year" : 7.87, "bc10Year" : 7.94, "bc20Year" : null, "bc1Month" : null, "bc2Year" : 7.87, "bc3Month" : 7.83, "bc30Year" : 8, "bc1Year" : 7.81, "bc7Year" : 7.98, "bc6Month" : 7.89 }
{ "_id" : ISODate("1990-01-03T00:00:00Z"), "dayOfWeek" : "WEDNESDAY", "bc3Year" : 7.96, "bc5Year" : 7.92, "bc10Year" : 7.99, "bc20Year" : null, "bc1Month" : null, "bc2Year" : 7.94, "bc3Month" : 7.89, "bc30Year" : 8.04, "bc1Year" : 7.85, "bc7Year" : 8.04, "bc6Month" : 7.94 }
has more

When you run the examples, gradle will download the appropriate hadoop distribution bundle, extract it, and copy over the appropriate dependencies for you. Subsequent runs will not redownload those files so it’s safe to switch between versions of hadoop. Gradle will manage all that for you so you need not worry about any of those details. It is recommend to run ./gradlew clean when changing versions of hadoop to ensure everything is built against the correct versions of libraries.

Where to go from here

Read the full documentation on the MongoDB Connector for Hadoop here. To modify configuration options, you can put additional lines in the historicalYield task to set properties passed to hadoop at runtime. Read the full comments of the script to see details on using these options to read/write from BSON as well as mongoDB collections.