Library`summarise()`: Summarizing Data

`summarise()`: Summarizing Data

Learn about `summarise()`: Summarizing Data as part of R Programming for Statistical Analysis and Data Science

Summarizing Data with dplyr's summarise()

In data analysis, a crucial step is to condense large datasets into meaningful summaries. The

code
summarise()
function from the
code
dplyr
package in R is a powerful tool for this purpose. It allows you to compute summary statistics for your data, such as means, medians, counts, and more, often grouped by specific categories.

The Core Functionality of summarise()

The

code
summarise()
function takes a data frame and one or more expressions that define how to compute summary statistics. These expressions typically involve aggregation functions like
code
mean()
,
code
median()
,
code
sd()
,
code
min()
,
code
max()
,
code
sum()
,
code
n()
(for count),
code
n_distinct()
(for count of unique values), etc. You can create new columns in the output data frame to store these summary statistics.

`summarise()` collapses rows into a single summary row.

Think of summarise() as an aggregation tool. It takes many rows of data and boils them down to a single row (or a few rows if used with group_by()) that represents a summary of the original data.

When you apply summarise() to a data frame without any prior grouping, it collapses the entire data frame into a single row. Each column in this output row represents a summary statistic calculated from the corresponding column in the original data frame. For example, you might calculate the average age of all individuals in a dataset or the total number of observations.

What is the primary purpose of the summarise() function in dplyr?

To collapse rows of a data frame into a single row (or grouped rows) by computing summary statistics.

Using summarise() with group_by()

The real power of

code
summarise()
is unleashed when combined with
code
group_by()
. This combination allows you to calculate summary statistics for different subgroups within your data. For instance, you can find the average salary for each department or the number of customers in each city.

When you use

code
group_by()
before
code
summarise()
, the aggregation functions operate independently on each group. The output will have one row for each group, containing the calculated summary statistics for that specific group.

Consider a dataset of car models with columns for 'manufacturer', 'model', 'cylinders', and 'mpg' (miles per gallon). If we want to find the average MPG for each manufacturer, we would first group by 'manufacturer' and then summarise by calculating the mean of 'mpg'. The output would show each manufacturer and their corresponding average MPG.

📚

Text-based content

Library pages focus on text content

FunctionPurposeOutput Structure
summarise() (alone)Compute summary statistics for the entire dataset.A single row with summary statistics.
group_by() + summarise()Compute summary statistics for each group within the dataset.One row per group, with summary statistics for each group.

Common Aggregation Functions

Here are some commonly used aggregation functions within

code
summarise()
:

  • code
    mean(x)
    : Calculates the arithmetic mean of a vector
    code
    x
    .
  • code
    median(x)
    : Calculates the median of a vector
    code
    x
    .
  • code
    sd(x)
    : Calculates the standard deviation of a vector
    code
    x
    .
  • code
    min(x)
    : Finds the minimum value in a vector
    code
    x
    .
  • code
    max(x)
    : Finds the maximum value in a vector
    code
    x
    .
  • code
    sum(x)
    : Calculates the sum of a vector
    code
    x
    .
  • code
    n()
    : Counts the number of observations in the current group.
  • code
    n_distinct(x)
    : Counts the number of unique values in a vector
    code
    x
    .

Remember to handle missing values (NA) appropriately. Many aggregation functions have an na.rm = TRUE argument to exclude NA values from calculations.

Practical Examples

Let's illustrate with a common scenario. Suppose we have a dataset called

code
iris
which contains measurements for different iris species.

To find the average sepal length for each iris species:

R
library(dplyr)
iris_summary <- iris %>%
group_by(Species) %>%
summarise(
avg_sepal_length = mean(Sepal.Length),
max_petal_width = max(Petal.Width),
count = n()
)
print(iris_summary)

This code will output a table showing each

code
Species
, the
code
avg_sepal_length
for that species, the
code
max_petal_width
observed in that species, and the
code
count
of observations for each species.

What does n() count within a summarise() operation after group_by()?

It counts the number of rows (observations) in the current group.

Learning Resources

dplyr: A Grammar of Data Manipulation(documentation)

The official documentation for dplyr, providing comprehensive details on all its functions, including `summarise()` and `group_by()`.

R for Data Science: Summarizing Data(blog)

Chapter 5 of 'R for Data Science' covers data transformation, with a dedicated section on `summarise()` and its usage with `group_by()`.

DataCamp: Introduction to dplyr(tutorial)

An interactive course that teaches the fundamentals of dplyr, including practical examples of using `summarise()` for data aggregation.

Stack Overflow: dplyr summarise examples(blog)

A collection of questions and answers on Stack Overflow related to `dplyr::summarise()`, offering solutions to common problems and diverse use cases.

RStudio Cheat Sheet: Data Transformation with dplyr(documentation)

A handy cheat sheet summarizing key dplyr functions, including `summarise()`, for quick reference.

Towards Data Science: Mastering Data Aggregation with R's dplyr(blog)

An article that delves into advanced techniques and best practices for data aggregation using `dplyr`'s `summarise()` function.

YouTube: R dplyr Tutorial - Grouping and Summarizing Data(video)

A video tutorial demonstrating how to use `group_by()` and `summarise()` in R for effective data summarization.

CRAN Task View: Analysis of Ecological and Environmental Data(documentation)

While broad, this view often links to packages and methods used for summarizing ecological data, where `dplyr` is frequently applied.

Kaggle: Learn R - Data Manipulation(tutorial)

Kaggle's interactive R course includes modules on data manipulation with `dplyr`, featuring practical exercises on summarization.

Introduction to R for Data Science (Book)(blog)

This online book provides a thorough introduction to R, with chapters dedicated to data manipulation and analysis using packages like `dplyr`.