This tutorial is a continuation on the discussion about the dplyr package. You’ll learn how to arrange, filter, and group rows in R.
The previous lesson was on column operations. This time, the focus will be on row operations in dplyr.
We’ll be covering the basics which includes sorting and filtering a dataset and aggregating and summarizing records. To give you an overview on what to expect for this lesson, think of a pivot table in MS Excel.
Open a new R script in RStudio.
Similar to the column operations lesson, this demonstration will use the Lahman dataset package. Download it by doing a quick google search.
To bring the Lahman package into R, run library (Lahman). To enable the dplyr package, run library (tidyverse). Also, remember that a best practice for naming conventions in R is using lowercase letters so assign Teams into teams.
Basic Functions For Row Operations
1. Arrange Rows In R
The first row operation in dpylr is arrange ( ). This function allows you to reorder rows. It works by first arranging the data frame df and then the given fields.
For example, let’s sort by teamID. Run arrange (teams, teamID).
If you want them to be arranged in descending order, you need to use the desc ( ) function.
As an example, if you want to sort by year in descending order, run arrange (teams, desc(yearID)).
When you do this, you’re not assigning the output back to teams. You’re just seeing the result in the Console.
It’s also possible to sort by multiple criteria. For example, if you want to sort by teamID and then yearID in descending order, you only need to run this code:
When you’re sorting rows, you’re not changing the data. The data is just being moved around. Nothing is being added or removed.
2. Filter Rows In R
The filter ( ) function adds or removes data depending on the criteria selected. Its basic code is:
As an example, let’s get all the data where the yearID is greater than or equal to 2000. Follow the filter function’s format and input the needed information. Then, run it. Don’t forget to assign this to a new object. In this case, it was assigned to modern.
To check if the rows were indeed filtered, you can use the dim ( ) function. It gives the number of rows and columns in the data frame.
If you run dim (teams), you’ll see that the data frame has 2,955 rows and 48 columns.
If you run the dim function on modern, you’ll see that the number of rows has been reduced to 630 while the number of columns remains the same.
The rows have been truncated because some of the records go beyond the year 2000.
Filter Rows By Multiple Fields
It’s also possible to filter rows by multiple fields in R. You’ll need to use the AND and OR statements.
For example, let’s filter teams by area. In this case, a new object ohio is created. The filter criteria are that teamID should only include Cleveland AND Cincinnati.
You need to use the double equal sign (==) to check equality. If you only use one equal sign, R will consider it as an assignment operator. Use the ampersand (&) to represent AND.
To check, use the dim function. You’ll see that the number of rows is 0.
This means that there aren’t any teams where they’re both based in Cleveland and Cincinnati.
Next, let’s try the Cleveland OR Cincinnati. The OR operator is represented by the pipe operator ( | ). So, all you need to do is replace the ampersand with the pipe operator and then run it. Afterwards, run the dim function again.
You’ll see that there are 251 rows rather than zero.
Now what if you forget to use a double equal sign and instead use just one? Here’s what happens:
RStudio will show a very helpful error message in the Console reminding you to use the double equal sign.
3. Group By And Summarise Rows In R
The group by ( ) function allows you to aggregate records by selected columns and then based on that aggregation, summarise another column.
The group by ( ) function follows this algorithm:
As an example, let’s group by teamID and assign it to a new object. In this case, the new object is called teams_ID. Then, print it.
In the Console, you’ll notice that the first line say it’s a tibble.
A tibble is a tidyverse improvement over the basic data frame. It’s a feature in the package that augments and improves what’s available out of the box.
The second line is Groups. So, the data is now grouped by the teamID column.
And with that, you can now use the summarise ( ) function on those groups.
Note: the summarise function can either be with an s or z, and will depend on the use of British or American English.
For example, let’s summarise teams_ID and get some basic summary statistics. Let’s look for the mean, minimum, and maximum of the Wins for each team. Remember to highlight the entire code before choosing to Run.
You can then see in the Console that a summary of each team’s statistics is displayed. This is very similar to a pivot table where you’re aggregating and summarizing data.
To recap, two operations in dplyr have been discussed. A previous tutorial focused on column operations. Meanwhile, this current lesson showed you how to perform row operations using the dplyr package in RStudio. Specifically, you learned how to arrange, filter, and group rows in R.
The next thing to learn is how to combine these two operations. Using all the functions you’ve learned so far will greatly assist you in creating codes in R. However, a more helpful technique would be a pipeline. This will help everything flow together. So, make sure to review the next tutorials as well.