Docs Menu

SQL Schema Format

On this page

Note
Beta

The support for SQL format queries is available as a Beta feature. The feature and the corresponding documentation may change at any time during the Beta stage.

To query data using SQL, Atlas Data Lake needs to be aware of the schema for that data. Data Lake automatically generates a JSON schema for all new collections and views. To learn more about auto-generated schemas, see Querying with SQL.

By default, Data Lake samples data from a single document in your collection or view to generate a JSON schema. If your collection or view contains polymorphic data, you can provide a larger sampling size to Data Lake when manually generating the schema.

Data Lake maps JSON schemas to relational schemas. MongoDB's flexible schema model allows a given field to contain data of multiple types, while relational databases restrict columns to a single data type. The following sections describe the fields supported in the JSON schema, the BSON types that are supported in a relational schema, and how Data Lake resolves conflicts for polymorphic fields when mapped to relational schema.

The schema for a collection is a document with two fields: jsonSchema and version.

"schema" : {
"version" : NumberLong(1),
"jsonSchema" : {}
}

The version field represents the version of the schema format used by the document and the value is always 1. The jsonSchema field is a document that describes the schema of the namespace.

Data Lake supports the following JSON schema fields:

  • bsonType
  • items
  • properties

To learn more about these fields, see JSON Schema Keywords.

Data Lake only supports the following BSON types when mapping JSON schema to relational schema:

  • array
  • binData
  • bool
  • date
  • decimal
  • double
  • int
  • long
  • null
  • object
  • objectId
  • string

Other types are ignored in the relational schema. Fields with composite types, such as objects and arrays, are handled specially.

Object fields are flattened such that each nested field maps to its own column in the relational schema. For example, consider the following eg collection:

{ _id: 0, a: true, b: "bar", c: { x: 1.0, y: 2.0 } }
{ _id: 1, a: true, b: "baz", c: { x: 10.0, y: 20.0 } }

For the above collection, Data Lake generates the following JSON schema:

{
bsonType: "object",
properties: {
_id: { bsonType: "int" },
a: { bsonType: "bool" },
b: { bsonType: "string" },
c: {
bsonType: "object",
properties: {
x: { bsonType: "double" },
y: { bsonType: "double" }
}
}
}
}

In the relational schema, the nested fields c.x and c.y are flattened into their own columns. The following table is a representation of the above schema:

_id
a
b
c.x
c.y
0
true
bar
1.0
2.0
1
true
baz
10.0
20.0

Array fields are mapped to their own tables. For example, consider the following eg collection:

{ _id: 0, a: ["k", "l", "m"] }
{ _id: 1, a: ["x", "y", "z"] }

For the above collection, Data Lake generates the following JSON schema:

{
bsonType: "object",
properties: {
_id: { bsonType: "int" },
a: {
bsonType: "array",
items: {
bsonType: "string"
}
}
}
}

In the relational schema, the array field a is mapped to its own table. The array tables include a column suffixed with _idx to track the index of the row's value in the array. The following tables are representations of the above schema:

Table: eg

id
0
1

Table: eg_a

_id
a_idx
a
0
0
k
0
1
l
0
2
m
1
0
x
1
1
y
1
2
z

MongoDB's flexible schema model allows a given field to contain data of multiple types, while relational databases restrict columns to a single data type. When Data Lake maps the JSON schema to relational schema, type conflicts can occur if a field is polymorphic. There are two main categories of type conversion conflicts that might occur when there are multiple data types:

  • Conflicts between scalar types
  • Conflicts involving composite types like documents and arrays

When a scalar-scalar confict occurs, Data Lake uses the following lattice to determine the least upper bound of the two different data types:

scalar-scalar type lattice
Example

If a given field contains:

  • int, double, and string types, the least upper bound is string.
  • int64 and double types, the least upper bound is decimal128.

When a scalar-scalar conflict occurs, Data Lake takes the least upper bound type and uses that as the single type in the relational schema.

Composite conflicts are type conversion conflicts involving a document or an array. The following sections describe how Data Lake resolves both types of composite conflicts.

When a conflict occurs involving a document, Data Lake displays the fields of the document type as separate columns using dot notation. For example, consider a collection named conflict that contains the following documents:

{ _id: 0, a: "foo" } // "a" is scalar (string)
{ _id: 1, a: { a: "bar", b: "baz" } } // "a" is composite (document)

Data Lake detects a scalar-composite conflict and generates the schema. The following table is a representation of that schema:

Table: conflict

_id
a
a.a
a.b
0
"foo"
NULL
NULL
1
NULL
"bar"
"baz"

When a conflict occurs involving a document, Data Lake inserts a column in the table and lists the different sampled types in the column.

When a conflict occurs involving an array, Data Lake creates a new table for the conflict field with a column for the foreign key, array index, and value. Data Lake unwinds the array into multiple rows in a new table and fills in the columns accordingly. For example, the conflict collection contains the following documents:

{ _id: 0, a: "foo" } // "a" is scalar (string)
{ _id: 1, a: ["bar", "baz"] } // "a" is composite (array)

The following two tables are representations of the relational schema:

Table: conflict

_id
0
1

Table: conflict_a

_id
a_idx
a
0
NULL
"foo"
1
0
"bar"
1
1
"baz"
Give Feedback
MongoDB logo
© 2021 MongoDB, Inc.

About

  • Careers
  • Legal Notices
  • Privacy Notices
  • Security Information
  • Trust Center
© 2021 MongoDB, Inc.