Chapter 7 Cleaning data

Cleaning a data set, including fields and reshaping of the data, is a fundamental skill in order to get the data into an analyzable format. Real world data is messy, and will not typically be immediately analyzable immediately after downloading.

For this section we will still use the company database, along with some more packages. New packages in this chapter include lubridate and tidyr.

library(dplyr) # For manipulating data
library(tidyr) # For making data long and wide

library(lubridate) # For date handling

library(DBI) # For database connections

dir.create("data", showWarnings = FALSE)

download.file(url = "https://github.com/jonlinca/auditanalytics/raw/master/data/rauditanalytics.sqlite",
             destfile = "data/rauditanalytics.sqlite", mode = "wb")

con <- dbConnect(RSQLite::SQLite(), "data/rauditanalytics.sqlite")

gl <- tbl(con, 'gl') %>%
  collect()

vendors <- tbl(con, 'vendors') %>%
  collect()

industry <- dbReadTable(con, 'industry') %>%
  collect()

dbDisconnect(con)

7.1 Basic math

We covered some of the basic operators in R in the Completeness chapter, so we’ll cover some more calculations and transformations.

x <- c(1, 2, 3)
y <- c(4, 5, 6)

The basic mathematical symbols will calculate a vector with a position element against a different vector with the same position. In this case, the first number in x is added to the first number in y resulting in the first calculation, the second number in x is added to the second number in y resulting in the second calculation, and so on.

y + x
## [1] 5 7 9

This distinction is important - vectorization allows R to be fast, but it changes how the user needs to understand objects.

sum() is fairly straight forward - it will add all the elements given, together:

sum(x)
## [1] 6

If you specify multiple vectors - it will merely combine them all together, equivalent to sum(c(1, 2, 3), c(4, 5, 6)):

sum(x, y)
## [1] 21

We mentioned in the prior chapter about NA values, which were ‘Not Available’ or Missing Values. It becomes important to understand the implication of them in your calculations. For example, when a NA value exists in a vector, you will get an NA by default:

z <- c(1, 2, NA)
sum(z)
## [1] NA

While sum(), and several other functions like mean(), min() and max() all generally behave in predictable manners, they all have an argument of na.rm = FALSE by default (enter ?sum to see this argument in the help file. The reason why FALSE is by default is that it brings attention to the vector missing values. If you want to perform the calculation on this, you either need to substitute the NA out for a value (imputation), or simply ignore them. Both these decisions require professional judgment before substitution or ignoring. To substitute NA for a value (for example, 10):

is.na(z) # Shows which positions in the vector have an NA value
## [1] FALSE FALSE  TRUE
z[is.na(z)] <- 10 # Where the z value is is.na, assign 10
sum(z)
## [1] 13

Or alternatively, set the argument na.rm = TRUE to throw away these values completely:

z <- c(1, 2, NA)
sum(z, na.rm = TRUE)
## [1] 3

7.2 Dates and Times

Unlike numbers, date and times are inherently more complex. They require many additional considerations to properly wrangle:

  • What format is the date and time in?
  • Are the timezones as expected? This also applies for dates.
  • Do you want the the month the transaction took place in? The year? The day-of-week?
  • Do you care about daylight savings time?

Starting with a basic example, lets understand the basic characteristics of dates. When you define a new date, the lubridate::as_date() function (or base as.Date()) works to convert character strings to dates.

startString <- '2020-07-30' # ISO 8601 standards, follow them

# base::as.Date(startString) # Same thing as as_date, for now...
start <- lubridate::as_date(startString)

class(start)
## [1] "Date"

You can convert any date to a date, as long as you specify the mask or format. A mask specifies where the date, month, year exist (as well as for time, hours, minutes, seconds). If you check the help file for ?strptime, you will see several different formats for both specifying dates and times. For example, %Y is a four digit year while %y% is a two digit year. Here are a few examples with the same date (some are more unreasonable than others, but all can be captured):

lubridate::as_date('2020-07-30', format = "%Y-%m-%d")
lubridate::as_date('30/07/2020', format = "%d/%m/%Y") # Not ISO 8601
lubridate::as_date('July 30, 20', format = "%B %d, %y") # Really not ISO 8601
lubridate::as_date('Jul3020', format = "%b%d%y") # Definitely not ISO 8601

The same concept also applies to datetimes, where you can convert a given time to a ‘POSIXct’ or ‘POSIXlt’ format (the differences between ‘POSIXct’ and ‘POSIXlt’ are found in ?DateTimeClasses, under details):

starttime_base <- as.POSIXct('2020-07-30 12:00:00')
starttime_lub <- lubridate::as_datetime('2020-07-30 12:00:00')

class(starttime_base); class(starttime_lub)
## [1] "POSIXct" "POSIXt"
## [1] "POSIXct" "POSIXt"

And format masks apply as well to date time, just in case you get something unconventional:

lubridate::as_datetime('2020-07-30 12:00:00 PM -0600', 
                       format = "%Y-%m-%d %I:%M:%S %p %z")
## [1] "2020-07-30 18:00:00 UTC"

7.2.1 Understanding Default Time Zones

Did you consider the time zone that the times were created in? This is system-specific and base POSIXct and lubridate as_datetime behave differently, so you have to be careful!

print(starttime_base); attr(starttime_base,"tzone")
## [1] "2020-07-30 12:00:00 MDT"
## [1] ""

Notice that base treats the creation of the current timezone based off your local R machine, although its not extractable (and you will eventually find, hard to change).

print(starttime_lub); tz(starttime_lub)
## [1] "2020-07-30 12:00:00 UTC"
## [1] "UTC"

The lubridate version will assign the date-time automatically to UTC, unless you specify it. It is for this reason you will generally want to avoid using base R to convert to date times, and stick with lubridate for better predictability.

Having an automatic conversion to UTC isn’t bad per se - having a universal ground truth to UTC is good, as it removes Daylight Saving component, and is an easy way to standardize multiple data sets together. Unfortunately, many systems will not tell you what time zone their data is in, so you will have to figure it out and ensure R is aligned accordingly.

7.2.2 Override or display timezones

Hopefully you have been convinced to stick with lubridate for your date transformations. Now, lets get some confidence on these timezone transformations.

The first method is force_tz(), which forces the current time to become that same hour in a new timezone. When we assigned July 30th to the date, we would have assumed this date would have been in the current timezone we exist in (generally, the system time-zone you are in). To do this, specify the date, but also the time zone you want the date or time to be harmonized in:

starttime_lub # Starts as UTC
## [1] "2020-07-30 12:00:00 UTC"
force_tz(starttime_lub, tz = 'Canada/Mountain') # Force the saved time to be Mountain
## [1] "2020-07-30 12:00:00 MDT"

Alternatively, you may wish to only convert the time so its displayed in the correct timezone, but leave the base truth as is. with_tz() will allow you to change the timezone that is displayed, while honoring the original time in the data.

starttime_lub
## [1] "2020-07-30 12:00:00 UTC"
with_tz(starttime_lub, tz = 'Canada/Mountain') # Display the current time as Mountain
## [1] "2020-07-30 06:00:00 MDT"

7.2.3 Daylight Saving Implications

There are also timezone implications when it comes to daylight saving time. This is especially important when making audit decisions based on duration, or expected local norms for time-of-day.

In several locales across the world, daylight saving time will alter the clock for half a year locally, but has no impact on UTC. For example, your staff may start their day at 8:00 AM locally. When daylight saving kicks in, the local time stays at 8:00 AM and the UTC time shifts from 15:00 to 14:00.

pre <- as_datetime('2020-03-07 8:00:00', tz = 'Canada/Mountain') # The day before DST starts
post <- as_datetime('2020-03-08 8:00:00', tz = 'Canada/Mountain') # The day DST started

dst(pre); with_tz(pre, tz = 'UTC') # Was DST active before? ; What is the UTC time before? 
## [1] FALSE
## [1] "2020-03-07 15:00:00 UTC"
dst(post); with_tz(post, tz = 'UTC') # Was DST active after? ; What is the UTC time after?
## [1] TRUE
## [1] "2020-03-08 14:00:00 UTC"

7.2.4 Differences between time

Another common calculation is to calculate the time differences between two points of time. This seemingly simple action has several interpretations behind it. Do you want relative dates? Do you want exact elapsed time?

People are not always consistent in how they expect time to behave. Sometimes the passage of time is a monotone progression of instants that should be as mathematically reliable as the number line. On other occasions time must follow complex conventions and rules so that the clock times we see reflect what we expect to observe in terms of daylight, season, and congruence with the atomic clock. (Vitalie Spinu 2016)

Adding time is easy - you simply need to decide if you’re adding a period (i.e. human construct of days) or if you are following physical time (i.e. seconds). The main difference is choosing lubridate’s functions that honor periods or durations

pre <- as_datetime('2020-03-07 8:00:00', tz = 'Canada/Mountain') # The day before DST starts
post <- as_datetime('2020-03-08 8:00:00', tz = 'Canada/Mountain') # The day DST started, skipped ahead 1 hour

# Periods: Human construct of days
pre + days(1) ; pre + hours(24) # Equal to post, the human construct
## [1] "2020-03-08 08:00:00 MDT"
## [1] "2020-03-08 08:00:00 MDT"
# Duration: Physical construct of time
pre + ddays(1) ; pre + dhours(24) # NOT to post, but actually 1 hour after
## [1] "2020-03-08 09:00:00 MDT"
## [1] "2020-03-08 09:00:00 MDT"

Calculating intervals and periods is a bit different. If we go back up to our example of pre-and-post daylight saving time, we can either expect this answer to be 1 day or 23 hours. Again, depending on your use case, it is up to you.1

span <- interval(pre, post)

# Period: To find the number of days, in human constructs
span / days(1) # Period calculation by days. 
## [1] 1
# Duration - To find the hours between, in physical constructs
as.duration(span)
## [1] "82800s (~23 hours)"

7.2.5 Derived date information

A common task for auditors is to find out the month a transaction took place in. With lubridate, its quite straight forward - for example, to extract the month, you don’t need to have any messy substrings (or to extract other useful times, for that matter):

thedate <- as_datetime('2020-07-01 12:00:00', tz = 'Canada/Mountain') 

month(thedate)
## [1] 7
year(thedate)
## [1] 2020
quarter(thedate) # Canada Day starts on the first day of the third quarter. Does your company Financial Year not start on January 1? Just change the fiscal_start argument.
## [1] 3
wday(thedate, label = TRUE)
## [1] Wed
## Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat

7.2.6 Cleaning numerical dates

With dates, sometimes they may represented as a number - it is important to understand your source system date before you convert it.

Some systems may calculate a date from an ‘origin’ using the number. While our concept of time may start with 0 AD (as the origin), several computer systems and applications do not behave in that manner. Excel, for example, starts from 1900-01-01, whereas Unix operating systems have their origin as 1970-01-01. If converting a date from a number, you need to research and then specify an origin, otherwise you will have challenges in specifying dates.

In our vendors table, we have a numeric date but also a date in a character form for reference

vendors %>%
  select(date_added, date_added_char)
## # A tibble: 9 x 2
##   date_added date_added_char
##        <dbl> <chr>          
## 1      16631 2015-07-15     
## 2      16644 2015-07-28     
## 3      16581 2015-05-26     
## 4      16578 2015-05-23     
## 5      16620 2015-07-04     
## 6      16652 2015-08-05     
## 7      16578 2015-05-23     
## 8      16620 2015-07-04     
## 9      16652 2015-08-05

By specifying the origin a few different origins, we can see what the origin should have been in this case:

vendors %>%
  select(date_added, date_added_char) %>%
  mutate(date_added_unix = as_date(date_added, origin = '1970-01-01'),
         date_added_excel = as_date(date_added, origin = '1900-01-01'))
## # A tibble: 9 x 4
##   date_added date_added_char date_added_unix date_added_excel
##        <dbl> <chr>           <date>          <date>          
## 1      16631 2015-07-15      2015-07-15      1945-07-15      
## 2      16644 2015-07-28      2015-07-28      1945-07-28      
## 3      16581 2015-05-26      2015-05-26      1945-05-26      
## 4      16578 2015-05-23      2015-05-23      1945-05-23      
## 5      16620 2015-07-04      2015-07-04      1945-07-04      
## 6      16652 2015-08-05      2015-08-05      1945-08-05      
## 7      16578 2015-05-23      2015-05-23      1945-05-23      
## 8      16620 2015-07-04      2015-07-04      1945-07-04      
## 9      16652 2015-08-05      2015-08-05      1945-08-05

Dates can also appear in a completely different format. Julian (also known as mainframe or ordinal) dates are an example, where the year is specified but the date is relative to January 1. So 212 days is either July 30 or 31, depending if it is a leap year or not.

lubridate::as_date('19212', format = "%y%j")
## [1] "2019-07-31"
lubridate::as_date('20212', format = "%y%j")
## [1] "2020-07-30"

JDE E1’s dates also add an extra digit in front of it, to symbolize the century. While there is no R way to translate 0 to 19xx and 1 to 20xx, practically speaking we are safe with assuming the century as we are at least 20 years past that point (as of this writing). Amending the mask to ignore the 1 can be done in the format:

lubridate::as_date('119212', format = "1%y%j")
## [1] "2019-07-31"
lubridate::as_date('120212', format = "1%y%j")
## [1] "2020-07-30"

7.3 Separate

Occasionally, distinct data will be combined in the same cell. This can be due to data entry or an unusual method of storing information. There will generally a separator within the field (similar to a delimiter), but the data is still contained within one field.

For example, if we look at our vendor table and name column, we can see the company names appear to also have a Contact person.

vendors %>%
  select(type, name)
## # A tibble: 9 x 2
##   type  name                                          
##   <chr> <chr>                                         
## 1 E     al-Bangura, Habeeba                           
## 2 E     Akina, Duncan                                 
## 3 E     el-Kazmi, Muneefa                             
## 4 V     el-Shafi Company Ltd - Contact: Rivera, Evan  
## 5 V     Bray Company Ltd - Contact: el-Jaffer, Shaheer
## 6 V     Chrisman Company Ltd - Contact: Burch, Kryslyn
## 7 V     el-Shafi Company Ltd - Contact: Rivera, Evan  
## 8 V     Bray Company Ltd - Contact: el-Jaffer, Shaheer
## 9 V     Chrisman Company Ltd - Contact: Burch, Kryslyn

While the name is accurate, the contact name shouldn’t be in there. We can isolate the contact name by using separate(), allowing us to split a column by defining a separator and also new column names. In this case, the ‘-’ (dash) character could work:

vendors %>%
  select(type, name) %>%
  separate(name, c('name', 'company_contact'), sep = '-')
## Warning: Expected 2 pieces. Additional pieces discarded in 4 rows [4, 5, 7, 8].
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [2].
## # A tibble: 9 x 3
##   type  name                    company_contact           
##   <chr> <chr>                   <chr>                     
## 1 E     "al"                    "Bangura, Habeeba"        
## 2 E     "Akina, Duncan"          <NA>                     
## 3 E     "el"                    "Kazmi, Muneefa"          
## 4 V     "el"                    "Shafi Company Ltd "      
## 5 V     "Bray Company Ltd "     " Contact: el"            
## 6 V     "Chrisman Company Ltd " " Contact: Burch, Kryslyn"
## 7 V     "el"                    "Shafi Company Ltd "      
## 8 V     "Bray Company Ltd "     " Contact: el"            
## 9 V     "Chrisman Company Ltd " " Contact: Burch, Kryslyn"

Sometimes a character isn’t specific enough. We can actually customize the separator even further by changing its argument to be more specific. In this case, perhaps we want both the dash and the word “Contact:”:

vendors %>%
  select(id, type, name) %>%
  separate(name, c('name', 'company_contact'), sep = '- Contact:')
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 3 rows [1, 2, 3].
## # A tibble: 9 x 4
##      id type  name                    company_contact      
##   <int> <chr> <chr>                   <chr>                
## 1   793 E     "al-Bangura, Habeeba"    <NA>                
## 2   666 E     "Akina, Duncan"          <NA>                
## 3    77 E     "el-Kazmi, Muneefa"      <NA>                
## 4  2726 V     "el-Shafi Company Ltd " " Rivera, Evan"      
## 5  2612 V     "Bray Company Ltd "     " el-Jaffer, Shaheer"
## 6  2211 V     "Chrisman Company Ltd " " Burch, Kryslyn"    
## 7  2726 V     "el-Shafi Company Ltd " " Rivera, Evan"      
## 8  2612 V     "Bray Company Ltd "     " el-Jaffer, Shaheer"
## 9  2211 V     "Chrisman Company Ltd " " Burch, Kryslyn"

What is nice about this functionality is that if the separator isn’t found, the missing values are NA. The warning is good to see (as the employees don’t have a company contact), and can be safely ignored in this case.

7.4 Separate_rows

Similarly grouped data can also be combined in the same cell. This is typically representing a one-to-many (or many-to-one) relationship. This can be due to data entry or unusual output formats. Consider our industry table, which is an industry-to-company mapping table, telling us which companies and the respective industry they work in:

industry
##          id            industry
## 1      2726         Health Care
## 2 2612,2211 Commerical Supplies

As of now, there would be no way to join the industry, and specifically the value Commercial Supplies, into our vendor table. The unique key is id number of the company, which gets joined to the vendor table.

When we run into a situation where a value may belong to the same key multiple times (multiple IDs are associated with one industry), we can use separate_rows() to break up that row into several rows. For this function, we need to specify the columns that need to be separated - in this case, id needs to be broken apart.

clean_industry <- industry %>%
  separate_rows(id, convert = TRUE) # Convert will try to guess the data type for each field, based on the content that was separated

print(clean_industry)
## # A tibble: 3 x 2
##      id industry           
##   <int> <chr>              
## 1  2726 Health Care        
## 2  2612 Commerical Supplies
## 3  2211 Commerical Supplies

separate_rows() will automatically expand the column that needs to be separated, while replicating the content for that same value into subsequent rows.

7.5 Tidy data

The three key concepts in ‘tidy data’ (G. G. Hadley Wickham 2017) is that:

  • Each variable must have its own column.
  • Each observation must have its own row.
  • Each value must have its own cell.

While you can get away with performing basic analysis and checks, having the ability to move data between formats increases competency for cleaning, flexibility for plotting, and legibility for outputting tables in a readable format.

Lets start by cleaning the vendor table

clean_vendors <- vendors %>%
  separate(name, c('name', 'company_contact'), sep = '- Contact: ') %>% # From the separate() section
  select(-date_added_char) %>% # Remove one of the date fields 
  mutate(date_added = as_date(date_added, origin = '1970-01-01')) %>% # From the cleaning dates section
  filter(type == 'V')
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 3 rows [1, 2, 3].
print(clean_vendors)
## # A tibble: 6 x 10
##      id type  name  company_contact date_added summary `2015` `2016` `2017`
##   <int> <chr> <chr> <chr>           <date>     <chr>    <dbl>  <dbl>  <dbl>
## 1  2726 V     "el-… Rivera, Evan    2015-05-23 spend   65326. 8.34e4 1.50e5
## 2  2612 V     "Bra… el-Jaffer, Sha… 2015-07-04 spend   88270. 1.49e5 2.54e5
## 3  2211 V     "Chr… Burch, Kryslyn  2015-08-05 spend   51847. 8.93e4 1.42e5
## 4  2726 V     "el-… Rivera, Evan    2015-05-23 credits   509. 8.60e2 9.09e2
## 5  2612 V     "Bra… el-Jaffer, Sha… 2015-07-04 credits   777. 1.09e3 1.85e3
## 6  2211 V     "Chr… Burch, Kryslyn  2015-08-05 credits   221. 3.69e2 6.57e2
## # … with 1 more variable: `2018` <dbl>

This vendor data set is a combination of both long and wide, and needs to be standardized to fit the tidy data model. Specifically:

  • Each variable must have its own column - The vendor’s spend and credits each year is separated over multiple columns. That is, the values of the variable (the year) is in the column name. In the end output, every combination of vendor-year should be a row.
  • Each observation must have its own row - Each vendor has both a spend and credits value, within the summary column. Even though these are one observation for each vendor-year, they are separated by two rows.
  • Each value must have its own cell - Thankfully, this seems to be valid (although if not, you can clean it with separate() and separate_rows()).

Why would you want to subscribe to this tidy data philosophy? When you have a tidy data set, you are more empowered to explore your data (especially with tidyverse related commands), but you can also have the power to reshape data as necessary to fit the visualizing output you are imagining.

7.6 Pivots, wide and longer

Effectively pivoting your data into more suitable columns and rows requires you to consider the following:

  • What columns contains the values you want to aggregate?
  • What do you want the new name of these columns to be? Consider both the name and aggregated value.

In the clean_vendors data, we see each year represented as columns, and those columns represent dollar amounts.

clean_vendors
## # A tibble: 6 x 10
##      id type  name  company_contact date_added summary `2015` `2016` `2017`
##   <int> <chr> <chr> <chr>           <date>     <chr>    <dbl>  <dbl>  <dbl>
## 1  2726 V     "el-… Rivera, Evan    2015-05-23 spend   65326. 8.34e4 1.50e5
## 2  2612 V     "Bra… el-Jaffer, Sha… 2015-07-04 spend   88270. 1.49e5 2.54e5
## 3  2211 V     "Chr… Burch, Kryslyn  2015-08-05 spend   51847. 8.93e4 1.42e5
## 4  2726 V     "el-… Rivera, Evan    2015-05-23 credits   509. 8.60e2 9.09e2
## 5  2612 V     "Bra… el-Jaffer, Sha… 2015-07-04 credits   777. 1.09e3 1.85e3
## 6  2211 V     "Chr… Burch, Kryslyn  2015-08-05 credits   221. 3.69e2 6.57e2
## # … with 1 more variable: `2018` <dbl>

We can pivot this data by specifying the years, and then aggregating the amount. In this case, we want to take the columns and assign them to rows, and making the data longer. pivot_longer() will allow us to make this transformation:

clean_vendors %>%
  pivot_longer(cols = c(`2015`, `2016`, `2017`, `2018`), # The columns we want to pivot, in this case, the years. 
               names_to = 'year', # We want this collected column to be called the year
               values_to = 'amount') # the values of the aggregation
## # A tibble: 24 x 8
##       id type  name            company_contact   date_added summary year  amount
##    <int> <chr> <chr>           <chr>             <date>     <chr>   <chr>  <dbl>
##  1  2726 V     "el-Shafi Comp… Rivera, Evan      2015-05-23 spend   2015  6.53e4
##  2  2726 V     "el-Shafi Comp… Rivera, Evan      2015-05-23 spend   2016  8.34e4
##  3  2726 V     "el-Shafi Comp… Rivera, Evan      2015-05-23 spend   2017  1.50e5
##  4  2726 V     "el-Shafi Comp… Rivera, Evan      2015-05-23 spend   2018  5.56e4
##  5  2612 V     "Bray Company … el-Jaffer, Shahe… 2015-07-04 spend   2015  8.83e4
##  6  2612 V     "Bray Company … el-Jaffer, Shahe… 2015-07-04 spend   2016  1.49e5
##  7  2612 V     "Bray Company … el-Jaffer, Shahe… 2015-07-04 spend   2017  2.54e5
##  8  2612 V     "Bray Company … el-Jaffer, Shahe… 2015-07-04 spend   2018  1.79e5
##  9  2211 V     "Chrisman Comp… Burch, Kryslyn    2015-08-05 spend   2015  5.18e4
## 10  2211 V     "Chrisman Comp… Burch, Kryslyn    2015-08-05 spend   2016  8.93e4
## # … with 14 more rows

This data is more readable - each year is on its own row. However, we still have the issue where the same company will have two rows for each year - for the summary field, both a spend and credits exist for a vendor:

clean_vendors %>%
    pivot_longer(cols = c(`2015`, `2016`, `2017`, `2018`), # The columns we want to pivot, in this case, the years. 
               names_to = 'year', # We want this collected column to be called the year
               values_to = 'amount')  %>% # the values of the aggregation 
  arrange(id) %>%
  head(10)
## # A tibble: 10 x 8
##       id type  name            company_contact   date_added summary year  amount
##    <int> <chr> <chr>           <chr>             <date>     <chr>   <chr>  <dbl>
##  1  2211 V     "Chrisman Comp… Burch, Kryslyn    2015-08-05 spend   2015  5.18e4
##  2  2211 V     "Chrisman Comp… Burch, Kryslyn    2015-08-05 spend   2016  8.93e4
##  3  2211 V     "Chrisman Comp… Burch, Kryslyn    2015-08-05 spend   2017  1.42e5
##  4  2211 V     "Chrisman Comp… Burch, Kryslyn    2015-08-05 spend   2018  7.68e4
##  5  2211 V     "Chrisman Comp… Burch, Kryslyn    2015-08-05 credits 2015  2.21e2
##  6  2211 V     "Chrisman Comp… Burch, Kryslyn    2015-08-05 credits 2016  3.69e2
##  7  2211 V     "Chrisman Comp… Burch, Kryslyn    2015-08-05 credits 2017  6.57e2
##  8  2211 V     "Chrisman Comp… Burch, Kryslyn    2015-08-05 credits 2018  4.58e2
##  9  2612 V     "Bray Company … el-Jaffer, Shahe… 2015-07-04 spend   2015  8.83e4
## 10  2612 V     "Bray Company … el-Jaffer, Shahe… 2015-07-04 spend   2016  1.49e5

Since spend and credits belong to a single vendor-year pairing, it may be worthwhile to to put both summary classifications into their own column. In this case, we want to make the table wider by adding columns, so lets use pivot_wider(). This function is similar to pivot_longer in the same approach:

  • What columns contains the values you want to aggregate spread out? These values becomes the new column name.
  • Where is the value located? These will get moved under the new column.

Building from our last example, lets make this data bit wider

clean_vendors %>%
  pivot_longer(cols = c(`2015`, `2016`, `2017`, `2018`), # The columns we want to pivot, in this case, the years. 
               names_to = 'year', # We want this collected column to be called the year
               values_to = 'amount')  %>% # the values of the aggregation 
  pivot_wider(names_from = 'summary',
              values_from = 'amount')
## # A tibble: 12 x 8
##       id type  name            company_contact   date_added year   spend credits
##    <int> <chr> <chr>           <chr>             <date>     <chr>  <dbl>   <dbl>
##  1  2726 V     "el-Shafi Comp… Rivera, Evan      2015-05-23 2015  6.53e4    509.
##  2  2726 V     "el-Shafi Comp… Rivera, Evan      2015-05-23 2016  8.34e4    860.
##  3  2726 V     "el-Shafi Comp… Rivera, Evan      2015-05-23 2017  1.50e5    909.
##  4  2726 V     "el-Shafi Comp… Rivera, Evan      2015-05-23 2018  5.56e4    348.
##  5  2612 V     "Bray Company … el-Jaffer, Shahe… 2015-07-04 2015  8.83e4    777.
##  6  2612 V     "Bray Company … el-Jaffer, Shahe… 2015-07-04 2016  1.49e5   1094.
##  7  2612 V     "Bray Company … el-Jaffer, Shahe… 2015-07-04 2017  2.54e5   1854.
##  8  2612 V     "Bray Company … el-Jaffer, Shahe… 2015-07-04 2018  1.79e5   1395.
##  9  2211 V     "Chrisman Comp… Burch, Kryslyn    2015-08-05 2015  5.18e4    221.
## 10  2211 V     "Chrisman Comp… Burch, Kryslyn    2015-08-05 2016  8.93e4    369.
## 11  2211 V     "Chrisman Comp… Burch, Kryslyn    2015-08-05 2017  1.42e5    657.
## 12  2211 V     "Chrisman Comp… Burch, Kryslyn    2015-08-05 2018  7.68e4    458.

Each company now has its spend and credits in separate columns, but is now digestible by year.

7.7 Joining data

Often our data will come from different data sets or different tables. Having the ability to combine data from different systems is a powerful way to add more analyzable information from your dataset. In the Completeness chapter, we joined accounts from both the GL and TB to calculate whether there were any differences.

The Relational Data chapter in R for Data Science goes into depth how the different join operations work.