- Aggregation >
- Aggregation Examples >
- Aggregation with the Zip Code Data Set
Aggregation with the Zip Code Data Set¶
The examples in this document use the zipcode collection. This collection is available at: media.mongodb.org/zips.json. Use mongoimport to load this data set into your mongod instance.
Data Model¶
Each document in the zipcode collection has the following form:
{
"_id": "10280",
"city": "NEW YORK",
"state": "NY",
"pop": 5574,
"loc": [
-74.016323,
40.710537
]
}
The _id field holds the zip code as a string.
The city field holds the city name. A city can have more than one zip code associated with it as different sections of the city can each have a different zip code.
The state field holds the two letter state abbreviation.
The pop field holds the population.
The loc field holds the location as a latitude longitude pair.
All of the following examples use the aggregate() helper in the mongo shell. aggregate() provides a wrapper around the aggregate database command. See the documentation for your driver for a more idiomatic interface for data aggregation operations.
Return States with Populations above 10 Million¶
To return all states with a population greater than 10 million, use the following aggregation operation:
db.zipcodes.aggregate( { $group :
{ _id : "$state",
totalPop : { $sum : "$pop" } } },
{ $match : {totalPop : { $gte : 10*1000*1000 } } } )
Aggregations operations using the aggregate() helper process all documents in the zipcodes collection. aggregate() connects a number of pipeline operators, which define the aggregation process.
In this example, the pipeline passes all documents in the zipcodes collection through the following steps:
the $group operator collects all documents and creates documents for each state.
These new per-state documents have one field in addition to the _id field: totalPop which is a generated field using the $sum operation to calculate the total value of all pop fields in the source documents.
After the $group operation the documents in the pipeline resemble the following:
{ "_id" : "AK", "totalPop" : 550043 }
the $match operation filters these documents so that the only documents that remain are those where the value of totalPop is greater than or equal to 10 million.
The $match operation does not alter the documents, which have the same format as the documents output by $group.
The equivalent SQL for this operation is:
SELECT state, SUM(pop) AS totalPop
FROM zipcodes
GROUP BY state
HAVING totalPop >= (10*1000*1000)
Return Average City Population by State¶
To return the average populations for cities in each state, use the following aggregation operation:
db.zipcodes.aggregate( [
{ $group : { _id : { state : "$state", city : "$city" }, pop : { $sum : "$pop" } } },
{ $group : { _id : "$_id.state", avgCityPop : { $avg : "$pop" } } }
] )
Aggregations operations using the aggregate() helper process all documents in the zipcodes collection. aggregate() connects a number of pipeline operators that define the aggregation process.
In this example, the pipeline passes all documents in the zipcodes collection through the following steps:
the $group operator collects all documents and creates new documents for every combination of the city and state fields in the source document. A city can have more than one zip code associated with it as different sections of the city can each have a different zip code.
After this stage in the pipeline, the documents resemble the following:
{ "_id" : { "state" : "CO", "city" : "EDGEWATER" }, "pop" : 13154 }
the second $group operator collects documents by the state field and use the $avg expression to compute a value for the avgCityPop field.
The final output of this aggregation operation is:
{
"_id" : "MN",
"avgCityPop" : 5335
},
Return Largest and Smallest Cities by State¶
To return the smallest and largest cities by population for each state, use the following aggregation operation:
db.zipcodes.aggregate( { $group:
{ _id: { state: "$state", city: "$city" },
pop: { $sum: "$pop" } } },
{ $sort: { pop: 1 } },
{ $group:
{ _id : "$_id.state",
biggestCity: { $last: "$_id.city" },
biggestPop: { $last: "$pop" },
smallestCity: { $first: "$_id.city" },
smallestPop: { $first: "$pop" } } },
// the following $project is optional, and
// modifies the output format.
{ $project:
{ _id: 0,
state: "$_id",
biggestCity: { name: "$biggestCity", pop: "$biggestPop" },
smallestCity: { name: "$smallestCity", pop: "$smallestPop" } } } )
Aggregation operations using the aggregate() helper process all documents in the zipcodes collection. aggregate() combines a number of pipeline operators that define the aggregation process.
All documents from the zipcodes collection pass into the pipeline, which consists of the following steps:
the $group operator collects all documents and creates new documents for every combination of the city and state fields in the source documents.
By specifying the value of _id as a sub-document that contains both fields, the operation preserves the state field for use later in the pipeline. The documents produced by this stage of the pipeline have a second field, pop, which uses the $sum operator to provide the total of the pop fields in the source document.
At this stage in the pipeline, the documents resemble the following:
{ "_id" : { "state" : "CO", "city" : "EDGEWATER" }, "pop" : 13154 }
$sort operator orders the documents in the pipeline based on the value of the pop field from largest to smallest. This operation does not alter the documents.
the second $group operator collects the documents in the pipeline by the state field, which is a field inside the nested _id document.
Within each per-state document this $group operator specifies four fields: Using the $last expression, the $group operator creates the biggestcity and biggestpop fields that store the city with the largest population and that population. Using the $first expression, the $group operator creates the smallestcity and smallestpop fields that store the city with the smallest population and that population.
The documents, at this stage in the pipeline resemble the following:
{ "_id" : "WA", "biggestCity" : "SEATTLE", "biggestPop" : 520096, "smallestCity" : "BENGE", "smallestPop" : 2 }
The final operation is $project, which renames the _id field to state and moves the biggestCity, biggestPop, smallestCity, and smallestPop into biggestCity and smallestCity sub-documents.
The output of this aggregation operation is:
{
"state" : "RI",
"biggestCity" : {
"name" : "CRANSTON",
"pop" : 176404
},
"smallestCity" : {
"name" : "CLAYVILLE",
"pop" : 45
}
}