Docs Menu

$merge

On this page

  • Permissions Required
  • Syntax
  • Fields
  • Considerations
  • Behavior
  • Recommendations for Resolving Duplicates
  • Example

$merge writes the results of the aggregation pipeline to a specified collection. For more information, see $merge pipeline stage. In Atlas Data Lake, $merge can:

  • Write data from any of the supported data stores.
  • Write to the same or different Atlas cluster, database, or collection within the same Atlas project.

To allow writes to an Atlas cluster, Atlas Data Lake introduces alternate syntax for the required into field. See Syntax on this page for more information. Atlas Data Lake supports all other fields as described in the MongoDB server documentation for $merge.

To use $merge to write to a collection on the Atlas cluster, you must be a database user with the following privileges:

{
"$merge": {
"into": {
"atlas": {
"projectId": "<atlas-project-ID>",
"clusterName": "<atlas-cluster-name>",
"db": "<atlas-database-name>",
"coll": "<atlas-collection-name>"
}
},
"on": "<identifier field>"|[ "<identifier field1>", ...],
"let": { <variables> },
"whenMatched": "replace|keepExisting|merge|fail|pipeline",
"whenNotMatched": "insert|discard|fail"
}
}

This section only describes the alternate syntax that Atlas Data Lake provides for the into field. To learn more about the other fields, on, let, whenMatched, and whenNotMatched, see the MongoDB server documentation for $merge.

Field
Type
Description
Necessity
atlas
object
Location to write the documents from the aggregation pipeline.
Required
clusterName
string
Name of the Atlas cluster.
Required
coll
string
Name of the collection on the Atlas cluster.
Required
db
string
Name of the database on the Atlas cluster that contains the collection.
Required
projectId
string
Unique identifier of the project that contains the Atlas cluster. This is the ID of the project that contains your Data Lake. If omitted, defaults to the ID of the project that contains your Data Lake.
Optional

When writing documents from your archive or your data stores to your Atlas cluster, your documents might have duplicate _id fields. This section describes how Atlas Data Lake resolves duplicates and includes recommendations for resolving duplicates in your aggregation pipeline.

To resolve duplicates, Atlas Data Lake:

  1. Writes documents to an Atlas collection X in the order it receives the documents until it encounters a duplicate.
  2. Writes the document with the duplicate _id field and all subsequent documents to a new Atlas collection Y.
  3. Runs the specified $merge stage to merge collection Y into collection X.
  4. Writes the resulting documents into the target collection on the specified Atlas cluster.
Note

Atlas Data Lake only resolves duplicate values in the _id field. It doesn't resolve duplicate values in other uniquely indexed fields.

To resolve duplicate _id fields, you can:

  1. Include a $sort stage in your pipeline to specify the order in which Atlas Data Lake must process the resulting documents.
  2. Based on the order of documents flowing into the $merge stage, choose the value for the whenMatched and whenNotMatched options of the $merge stage carefully.

    Example

    The following examples show how Atlas Data Lake resolves duplicates during the $merge stage when whenMatched option is set to keepExisting or replace. These examples use the following documents:

    {
    "_id" : 1,
    "state" : "FL"
    },
    {
    "_id" : 1,
    "state" : "NJ"
    },
    {
    "_id" : 2,
    "state" : "TX"
    }
  3. Avoid using the whenNotMatched: discard option.

    Example

    This example shows how Atlas Data Lake resolves duplicates when whenNotMatched option is set to discard using the following documents:

    {
    "_id" : 1,
    "state" : "AZ"
    },
    {
    "_id" : 1,
    "state" : "CA"
    },
    {
    "_id" : 2,
    "state" : "NJ"
    },
    {
    "_id" : 3,
    "state" : "NY"
    },
    {
    "_id" : 4,
    "state" : "TX"
    }

    Suppose you run the following pipeline on the documents listed above:

    db.archivecoll.aggregate([
    {
    "$sort": {
    "_id": 1,
    "state": 1,
    }
    },
    {
    "$merge": {
    "into": {
    "atlas": {
    "clusterName": "clustername",
    "db": "clusterdb",
    "coll": "clustercoll"
    }
    },
    "on": "_id",
    "whenMatched": "replace",
    "whenNotMatched": "discard"
    }
    }
    ])

    Atlas Data Lake writes the following data to two collections named X and Y:

    Atlas Data Lake merges documents from collection Y into collection X. For whenMatched: replace option in the pipeline, Atlas Data Lake replaces the document with _id: 1 in collection X with the document with _id: 1 in collection Y. For whenNotMatched: discard option in the pipeline, Atlas Data Lake discards documents in collection Y that do not match a document in collection X. Therefore, the result of the pipeline with duplicates contains only the following document:

    {
    "_id" : 1,
    "state" : "CA"
    }

    Atlas Data Lake then merges this document into the target collection on the specified Atlas cluster.

The following example $merge syntax writes the results to a sampleDB.mySampleData collection on the Atlas cluster named myTestCluster. The example doesn't specify a project ID; the $merge stage uses the ID of the project that contains your Data Lake.

Example
1{
2 "$merge": {
3 "atlas": {
4 "clusterName": "myTestCluster",
5 "db": "sampleDB",
6 "coll": "mySampleData"
7 }
8 }
9}
←  $lookup$out →
Give Feedback
© 2021 MongoDB, Inc.

About

  • Careers
  • Legal Notices
  • Privacy Notices
  • Security Information
  • Trust Center
© 2021 MongoDB, Inc.