This chapter introduces a detailed explanation of MongoDB statistical analysis, mainly achieved through the Aggregation Pipeline. It is similar to the "group by" statement in SQL. In MongoDB shell, statistical analysis is implemented using the db.collection.aggregate()
function.
Prerequisite Tutorial
General Steps
- Use
$match
to filter the target data - Use
$group
to group and calculate the data - Use
$sort
to sort the results (optional)
Test Data
The data in the orders
collection is as follows
{ _id: 1, cust_id: "abc1", ord_date: ISODate("2012-11-02T17:04:11.102Z"), status: "A", amount: 50 }
{ _id: 2, cust_id: "xyz1", ord_date: ISODate("2013-10-01T17:04:11.102Z"), status: "A", amount: 100 }
{ _id: 3, cust_id: "xyz1", ord_date: ISODate("2013-10-12T17:04:11.102Z"), status: "D", amount: 25 }
{ _id: 4, cust_id: "xyz1", ord_date: ISODate("2013-10-11T17:04:11.102Z"), status: "D", amount: 125 }
{ _id: 5, cust_id: "abc1", ord_date: ISODate("2013-11-12T17:04:11.102Z"), status: "A", amount: 25 }
aggregate Function
db.collection.aggregate(pipeline)
Explanation:
- The
pipeline
takes an array parameter, where each element represents a processing stage.
Example
db.orders.aggregate([
{ $match: { status: "A" } }, // First stage
{ $group: { _id: "$cust_id", total: { $sum: "$amount" } } }, // Second stage
{ $sort: { total: -1 } } // Third stage
])
Equivalent SQL
select sum(amount) as total from orders
where status="A"
group by cust_id
order by total desc
$match Stage
Format:
{ $match: { <query> } }
Explanation:
-
<query>
MongoDB query conditions
Used to set query conditions. If $match
is ignored, it implies querying all data.
Tip: If you are not familiar with MongoDB query syntax, please refer to the previous chapters.
$group Stage
Similar to the group by
clause in SQL, it is used to group the data and then perform a series of statistical calculations on the grouped data.
Basic Usage of $group
Syntax:
{
$group:
{
_id: <expression>, // Grouping condition, for example: group by which field
<field1>: { <accumulator1> : <expression1> }, // Aggregation operation, you can add N aggregation operations
...
}
}
Explanation:
-
- Name of the custom statistical indicator, can be N in total -
- Aggregation function, similar to SQL's sum, avg and other aggregation functions, the difference is that MongoDB's aggregation functions are named with $ as prefix, for example: $sum, $avg -
- Parameter of the aggregation function, usually the field value to be counted, referencing document fields using the format "$field name"
Example:
db.orders.aggregate([
{
$group: {
_id: "$cust_id",
total: { $sum: "$amount" }, // Add the first calculated indicator total, using $sum summation operator
amount_avg: {$avg: "$amount"} // Add the second calculated indicator avg, using $avg average calculation operator
}
}
])
Output:
{ "_id" : "abc1", "total" : 75, "amount_avg" : 37.5 }
{ "_id" : "xyz1", "total" : 250, "amount_avg" : 83.33333333333333 }
Equivalent SQL:
select
sum(amount) as total,
avg(amount) as amount_avg
from orders
group by cust_id
$group Aggregation Functions
Commonly used aggregation functions for $group are as follows:
Operator | Description | Example |
---|---|---|
$avg | Calculate average | {$avg: "$amount"} |
$sum | Summation | {$sum: "$amount"} |
$max | Maximum value | {$max: "$amount"} |
$min | Minimum value | {$min: "$amount"} |
$first | Return data after grouping, the content of the first document | {$first: "$amount"} |
$last | Return data after grouping, the content of the last document | {$last: "$amount"} |
$push | Return data after grouping | { $push: { ord_date: "$ord_date", amount: "$amount" } |
$addToSet | Return data after grouping, different from $push as it removes duplicates | { $addToSet: "$amount" } |
Example of $push
db.orders.aggregate(
[
{
$group:
{
_id: "$cust_id",
all: { $push: { ord_date: "$ord_date", amount: "$amount" } } // Values of ord_date and amount fields
}
}
]
)
Output
{ "_id" : "abc1", "all" : [ { "ord_date" : "2021-04-18 00:00:00", "amount" : 50 }, { "ord_date" : "2021-04-21 00:00:00", "amount" : 25 } ] }
{ "_id" : "xyz1", "all" : [ { "ord_date" : "2021-04-18 00:00:00", "amount" : 100 }, { "ord_date" : "2021-04-20 00:00:00", "amount" : 25 }, { "ord_date" : "2021-04-21 00:00:00", "amount" : 125 } ] }
Example of $addToSet
db.orders.aggregate(
[
{
$group:
{
_id: "$cust_id",
all_amount: { $addToSet: "$amount" } // Returns all distinct amount values
}
}
]
)
Output
{ "_id" : "abc1", "all_amount" : [ 25, 50 ] }
{ "_id" : "xyz1", "all_amount" : [ 100, 25, 125 ] }
$sort:
The $sort stage is typically placed at the end to sort the aggregated data.
Format:
{ $sort: { <field1>: <sort order>, <field2>: <sort order> ... } }
Explanation:
-
, - The names of the fields to be sorted, supports multiple fields. -
- The direction of the sorting, -1 for descending, 1 for ascending.
Example:
db.orders.aggregate([
{ $match: { status: "A" } },
{ $group: { _id: "$cust_id", total: { $sum: "$amount" } } },
{ $sort: { total: -1 } }
])
Aggregate Pagination
We can implement pagination using the $limit and $skip operators.
Example:
db.orders.aggregate([
{ $match: { status: "A" } },
{ $group: { _id: "$cust_id", total: { $sum: "$amount" } } },
{ $sort: { total: -1 } },
{ $limit: 5 }, // Limits the number of returned records, similar to page size in pagination.
{ $skip: 1 } // Skips a certain number of records, similar to the offset in SQL.
])