This chapter introduces a detailed explanation of MongoDB statistical analysis, mainly achieved through the Aggregation Pipeline. It is similar to the "group by" statement in SQL. In MongoDB shell, statistical analysis is implemented using the db.collection.aggregate() function.

Prerequisite Tutorial

MongoDB Aggregation Pipeline

General Steps

  • Use $match to filter the target data
  • Use $group to group and calculate the data
  • Use $sort to sort the results (optional)

Test Data

The data in the orders collection is as follows

{ _id: 1, cust_id: "abc1", ord_date: ISODate("2012-11-02T17:04:11.102Z"), status: "A", amount: 50 }
{ _id: 2, cust_id: "xyz1", ord_date: ISODate("2013-10-01T17:04:11.102Z"), status: "A", amount: 100 }
{ _id: 3, cust_id: "xyz1", ord_date: ISODate("2013-10-12T17:04:11.102Z"), status: "D", amount: 25 }
{ _id: 4, cust_id: "xyz1", ord_date: ISODate("2013-10-11T17:04:11.102Z"), status: "D", amount: 125 }
{ _id: 5, cust_id: "abc1", ord_date: ISODate("2013-11-12T17:04:11.102Z"), status: "A", amount: 25 }

aggregate Function

db.collection.aggregate(pipeline) Explanation:

  • The pipeline takes an array parameter, where each element represents a processing stage.

Example

db.orders.aggregate([
                     { $match: { status: "A" } },  // First stage
                     { $group: { _id: "$cust_id", total: { $sum: "$amount" } } },   // Second stage
                     { $sort: { total: -1 } }  // Third stage
                   ])

Equivalent SQL

select sum(amount) as total from orders 
        where status="A" 
        group by cust_id 
        order by total desc

$match Stage

Format:

{ $match: { <query> } }

Explanation:

  • <query> MongoDB query conditions

Used to set query conditions. If $match is ignored, it implies querying all data.

Tip: If you are not familiar with MongoDB query syntax, please refer to the previous chapters.

$group Stage

Similar to the group by clause in SQL, it is used to group the data and then perform a series of statistical calculations on the grouped data.

Basic Usage of $group

Syntax:

{
  $group:
    {
      _id: <expression>, // Grouping condition, for example: group by which field
      <field1>: { <accumulator1> : <expression1> },  // Aggregation operation, you can add N aggregation operations
      ...
    }
 }

Explanation:

  • - Name of the custom statistical indicator, can be N in total
  • - Aggregation function, similar to SQL's sum, avg and other aggregation functions, the difference is that MongoDB's aggregation functions are named with $ as prefix, for example: $sum, $avg
  • - Parameter of the aggregation function, usually the field value to be counted, referencing document fields using the format "$field name"

Example:

db.orders.aggregate([
                     {
					 	$group: { 
							_id: "$cust_id",
							total: { $sum: "$amount" }, // Add the first calculated indicator total, using $sum summation operator
							amount_avg: {$avg: "$amount"}  // Add the second calculated indicator avg, using $avg average calculation operator
						} 
					}
               ])

Output:

{ "_id" : "abc1", "total" : 75, "amount_avg" : 37.5 }
{ "_id" : "xyz1", "total" : 250, "amount_avg" : 83.33333333333333 }

Equivalent SQL:

select 
	sum(amount) as  total,
	avg(amount) as amount_avg
from orders 
group by cust_id

$group Aggregation Functions

Commonly used aggregation functions for $group are as follows:

Operator Description Example
$avg Calculate average {$avg: "$amount"}
$sum Summation {$sum: "$amount"}
$max Maximum value {$max: "$amount"}
$min Minimum value {$min: "$amount"}
$first Return data after grouping, the content of the first document {$first: "$amount"}
$last Return data after grouping, the content of the last document {$last: "$amount"}
$push Return data after grouping { $push: { ord_date: "$ord_date", amount: "$amount" }
$addToSet Return data after grouping, different from $push as it removes duplicates { $addToSet: "$amount" }

Example of $push

db.orders.aggregate(
   [
     {
       $group:
         {
           _id: "$cust_id",
           all: { $push: { ord_date: "$ord_date", amount: "$amount" } } // Values of ord_date and amount fields
         }
     }
   ]
)

Output

{ "_id" : "abc1", "all" : [ { "ord_date" : "2021-04-18 00:00:00", "amount" : 50 }, { "ord_date" : "2021-04-21 00:00:00", "amount" : 25 } ] }
{ "_id" : "xyz1", "all" : [ { "ord_date" : "2021-04-18 00:00:00", "amount" : 100 }, { "ord_date" : "2021-04-20 00:00:00", "amount" : 25 }, { "ord_date" : "2021-04-21 00:00:00", "amount" : 125 } ] }

Example of $addToSet

db.orders.aggregate(
   [
     {
       $group:
         {
           _id: "$cust_id",
           all_amount: { $addToSet: "$amount" } // Returns all distinct amount values
         }
     }
   ]
)

Output

{ "_id" : "abc1", "all_amount" : [ 25, 50 ] }
{ "_id" : "xyz1", "all_amount" : [ 100, 25, 125 ] }

$sort:

The $sort stage is typically placed at the end to sort the aggregated data.

Format:

{ $sort: { <field1>: <sort order>, <field2>: <sort order> ... } }

Explanation:

  • , - The names of the fields to be sorted, supports multiple fields.
  • - The direction of the sorting, -1 for descending, 1 for ascending.

Example:

db.orders.aggregate([
                     { $match: { status: "A" } },
                     { $group: { _id: "$cust_id", total: { $sum: "$amount" } } },
                     { $sort: { total: -1 } }
                   ])

Aggregate Pagination

We can implement pagination using the $limit and $skip operators.

Example:

db.orders.aggregate([
                     { $match: { status: "A" } },
                     { $group: { _id: "$cust_id", total: { $sum: "$amount" } } },
                     { $sort: { total: -1 } },
					 { $limit: 5 }, // Limits the number of returned records, similar to page size in pagination.
					 { $skip: 1 } // Skips a certain number of records, similar to the offset in SQL.
                   ])