Chapter 7 Cleaning data
Cleaning a data set, including fields and reshaping of the data, is a fundamental skill in order to get the data into an analyzable format. Real world data is messy, and will not typically be immediately analyzable immediately after downloading.
For this section we will still use the company database, along with some more packages. New packages in this chapter include lubridate and tidyr.
library(dplyr) # For manipulating data
library(tidyr) # For making data long and wide
library(lubridate) # For date handling
library(DBI) # For database connections
dir.create("data", showWarnings = FALSE)
download.file(url = "https://github.com/jonlinca/auditanalytics/raw/master/data/rauditanalytics.sqlite",
destfile = "data/rauditanalytics.sqlite", mode = "wb")
<- dbConnect(RSQLite::SQLite(), "data/rauditanalytics.sqlite")
con
<- tbl(con, 'gl') %>%
gl collect()
<- tbl(con, 'vendors') %>%
vendors collect()
<- dbReadTable(con, 'industry') %>%
industry collect()
dbDisconnect(con)
7.1 Basic math
We covered some of the basic operators in R in the Completeness chapter, so we’ll cover some more calculations and transformations.
<- c(1, 2, 3)
x <- c(4, 5, 6) y
The basic mathematical symbols will calculate a vector with a position element against a different vector with the same position. In this case, the first number in x
is added to the first number in y
resulting in the first calculation, the second number in x
is added to the second number in y
resulting in the second calculation, and so on.
+ x y
## [1] 5 7 9
This distinction is important - vectorization allows R to be fast, but it changes how the user needs to understand objects.
sum()
is fairly straight forward - it will add all the elements given, together:
sum(x)
## [1] 6
If you specify multiple vectors - it will merely combine them all together, equivalent to sum(c(1, 2, 3), c(4, 5, 6))
:
sum(x, y)
## [1] 21
We mentioned in the prior chapter about NA values, which were ‘Not Available’ or Missing Values. It becomes important to understand the implication of them in your calculations. For example, when a NA value exists in a vector, you will get an NA by default:
<- c(1, 2, NA)
z sum(z)
## [1] NA
While sum()
, and several other functions like mean()
, min()
and max()
all generally behave in predictable manners, they all have an argument of na.rm = FALSE
by default (enter ?sum
to see this argument in the help file. The reason why FALSE
is by default is that it brings attention to the vector missing values. If you want to perform the calculation on this, you either need to substitute the NA
out for a value (imputation), or simply ignore them. Both these decisions require professional judgment before substitution or ignoring. To substitute NA
for a value (for example, 10):
is.na(z) # Shows which positions in the vector have an NA value
## [1] FALSE FALSE TRUE
is.na(z)] <- 10 # Where the z value is is.na, assign 10
z[sum(z)
## [1] 13
Or alternatively, set the argument na.rm = TRUE
to throw away these values completely:
<- c(1, 2, NA)
z sum(z, na.rm = TRUE)
## [1] 3
7.2 Dates and Times
Unlike numbers, date and times are inherently more complex. They require many additional considerations to properly wrangle:
- What format is the date and time in?
- Are the timezones as expected? This also applies for dates.
- Do you want the the month the transaction took place in? The year? The day-of-week?
- Do you care about daylight savings time?
Starting with a basic example, lets understand the basic characteristics of dates. When you define a new date, the lubridate::as_date()
function (or base as.Date()
) works to convert character strings to dates.
<- '2020-07-30' # ISO 8601 standards, follow them
startString
# base::as.Date(startString) # Same thing as as_date, for now...
<- lubridate::as_date(startString)
start
class(start)
## [1] "Date"
You can convert any date to a date, as long as you specify the mask or format. A mask specifies where the date, month, year exist (as well as for time, hours, minutes, seconds). If you check the help file for ?strptime
, you will see several different formats for both specifying dates and times. For example, %Y
is a four digit year while %y%
is a two digit year. Here are a few examples with the same date (some are more unreasonable than others, but all can be captured):
::as_date('2020-07-30', format = "%Y-%m-%d")
lubridate::as_date('30/07/2020', format = "%d/%m/%Y") # Not ISO 8601
lubridate::as_date('July 30, 20', format = "%B %d, %y") # Really not ISO 8601
lubridate::as_date('Jul3020', format = "%b%d%y") # Definitely not ISO 8601 lubridate
The same concept also applies to datetimes, where you can convert a given time to a ‘POSIXct’ or ‘POSIXlt’ format (the differences between ‘POSIXct’ and ‘POSIXlt’ are found in ?DateTimeClasses
, under details):
<- as.POSIXct('2020-07-30 12:00:00')
starttime_base <- lubridate::as_datetime('2020-07-30 12:00:00')
starttime_lub
class(starttime_base); class(starttime_lub)
## [1] "POSIXct" "POSIXt"
## [1] "POSIXct" "POSIXt"
And format masks apply as well to date time, just in case you get something unconventional:
::as_datetime('2020-07-30 12:00:00 PM -0600',
lubridateformat = "%Y-%m-%d %I:%M:%S %p %z")
## [1] "2020-07-30 18:00:00 UTC"
7.2.1 Understanding Default Time Zones
Did you consider the time zone that the times were created in? This is system-specific and base POSIXct
and lubridate as_datetime
behave differently, so you have to be careful!
print(starttime_base); attr(starttime_base,"tzone")
## [1] "2020-07-30 12:00:00 MDT"
## [1] ""
Notice that base treats the creation of the current timezone based off your local R machine, although its not extractable (and you will eventually find, hard to change).
print(starttime_lub); tz(starttime_lub)
## [1] "2020-07-30 12:00:00 UTC"
## [1] "UTC"
The lubridate version will assign the date-time automatically to UTC, unless you specify it. It is for this reason you will generally want to avoid using base R to convert to date times, and stick with lubridate for better predictability.
Having an automatic conversion to UTC isn’t bad per se - having a universal ground truth to UTC is good, as it removes Daylight Saving component, and is an easy way to standardize multiple data sets together. Unfortunately, many systems will not tell you what time zone their data is in, so you will have to figure it out and ensure R is aligned accordingly.
7.2.2 Override or display timezones
Hopefully you have been convinced to stick with lubridate for your date transformations. Now, lets get some confidence on these timezone transformations.
The first method is force_tz()
, which forces the current time to become that same hour in a new timezone. When we assigned July 30th to the date, we would have assumed this date would have been in the current timezone we exist in (generally, the system time-zone you are in). To do this, specify the date, but also the time zone you want the date or time to be harmonized in:
# Starts as UTC starttime_lub
## [1] "2020-07-30 12:00:00 UTC"
force_tz(starttime_lub, tz = 'Canada/Mountain') # Force the saved time to be Mountain
## [1] "2020-07-30 12:00:00 MDT"
Alternatively, you may wish to only convert the time so its displayed in the correct timezone, but leave the base truth as is. with_tz()
will allow you to change the timezone that is displayed, while honoring the original time in the data.
starttime_lub
## [1] "2020-07-30 12:00:00 UTC"
with_tz(starttime_lub, tz = 'Canada/Mountain') # Display the current time as Mountain
## [1] "2020-07-30 06:00:00 MDT"
7.2.3 Daylight Saving Implications
There are also timezone implications when it comes to daylight saving time. This is especially important when making audit decisions based on duration, or expected local norms for time-of-day.
In several locales across the world, daylight saving time will alter the clock for half a year locally, but has no impact on UTC. For example, your staff may start their day at 8:00 AM locally. When daylight saving kicks in, the local time stays at 8:00 AM and the UTC time shifts from 15:00 to 14:00.
<- as_datetime('2020-03-07 8:00:00', tz = 'Canada/Mountain') # The day before DST starts
pre <- as_datetime('2020-03-08 8:00:00', tz = 'Canada/Mountain') # The day DST started
post
dst(pre); with_tz(pre, tz = 'UTC') # Was DST active before? ; What is the UTC time before?
## [1] FALSE
## [1] "2020-03-07 15:00:00 UTC"
dst(post); with_tz(post, tz = 'UTC') # Was DST active after? ; What is the UTC time after?
## [1] TRUE
## [1] "2020-03-08 14:00:00 UTC"
7.2.4 Differences between time
Another common calculation is to calculate the time differences between two points of time. This seemingly simple action has several interpretations behind it. Do you want relative dates? Do you want exact elapsed time?
People are not always consistent in how they expect time to behave. Sometimes the passage of time is a monotone progression of instants that should be as mathematically reliable as the number line. On other occasions time must follow complex conventions and rules so that the clock times we see reflect what we expect to observe in terms of daylight, season, and congruence with the atomic clock. (Vitalie Spinu 2016)
Adding time is easy - you simply need to decide if you’re adding a period (i.e. human construct of days) or if you are following physical time (i.e. seconds). The main difference is choosing lubridate’s functions that honor periods or durations
<- as_datetime('2020-03-07 8:00:00', tz = 'Canada/Mountain') # The day before DST starts
pre <- as_datetime('2020-03-08 8:00:00', tz = 'Canada/Mountain') # The day DST started, skipped ahead 1 hour
post
# Periods: Human construct of days
+ days(1) ; pre + hours(24) # Equal to post, the human construct pre
## [1] "2020-03-08 08:00:00 MDT"
## [1] "2020-03-08 08:00:00 MDT"
# Duration: Physical construct of time
+ ddays(1) ; pre + dhours(24) # NOT to post, but actually 1 hour after pre
## [1] "2020-03-08 09:00:00 MDT"
## [1] "2020-03-08 09:00:00 MDT"
Calculating intervals and periods is a bit different. If we go back up to our example of pre-and-post daylight saving time, we can either expect this answer to be 1 day or 23 hours. Again, depending on your use case, it is up to you.1
<- interval(pre, post)
span
# Period: To find the number of days, in human constructs
/ days(1) # Period calculation by days. span
## [1] 1
# Duration - To find the hours between, in physical constructs
as.duration(span)
## [1] "82800s (~23 hours)"
7.2.5 Derived date information
A common task for auditors is to find out the month a transaction took place in. With lubridate, its quite straight forward - for example, to extract the month, you don’t need to have any messy substrings (or to extract other useful times, for that matter):
<- as_datetime('2020-07-01 12:00:00', tz = 'Canada/Mountain')
thedate
month(thedate)
## [1] 7
year(thedate)
## [1] 2020
quarter(thedate) # Canada Day starts on the first day of the third quarter. Does your company Financial Year not start on January 1? Just change the fiscal_start argument.
## [1] 3
wday(thedate, label = TRUE)
## [1] Wed
## Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat
7.2.6 Cleaning numerical dates
With dates, sometimes they may represented as a number - it is important to understand your source system date before you convert it.
Some systems may calculate a date from an ‘origin’ using the number. While our concept of time may start with 0 AD (as the origin), several computer systems and applications do not behave in that manner. Excel, for example, starts from 1900-01-01, whereas Unix operating systems have their origin as 1970-01-01. If converting a date from a number, you need to research and then specify an origin, otherwise you will have challenges in specifying dates.
In our vendors table, we have a numeric date but also a date in a character form for reference
%>%
vendors select(date_added, date_added_char)
## # A tibble: 9 x 2
## date_added date_added_char
## <dbl> <chr>
## 1 16631 2015-07-15
## 2 16644 2015-07-28
## 3 16581 2015-05-26
## 4 16578 2015-05-23
## 5 16620 2015-07-04
## 6 16652 2015-08-05
## 7 16578 2015-05-23
## 8 16620 2015-07-04
## 9 16652 2015-08-05
By specifying the origin a few different origins, we can see what the origin should have been in this case:
%>%
vendors select(date_added, date_added_char) %>%
mutate(date_added_unix = as_date(date_added, origin = '1970-01-01'),
date_added_excel = as_date(date_added, origin = '1900-01-01'))
## # A tibble: 9 x 4
## date_added date_added_char date_added_unix date_added_excel
## <dbl> <chr> <date> <date>
## 1 16631 2015-07-15 2015-07-15 1945-07-15
## 2 16644 2015-07-28 2015-07-28 1945-07-28
## 3 16581 2015-05-26 2015-05-26 1945-05-26
## 4 16578 2015-05-23 2015-05-23 1945-05-23
## 5 16620 2015-07-04 2015-07-04 1945-07-04
## 6 16652 2015-08-05 2015-08-05 1945-08-05
## 7 16578 2015-05-23 2015-05-23 1945-05-23
## 8 16620 2015-07-04 2015-07-04 1945-07-04
## 9 16652 2015-08-05 2015-08-05 1945-08-05
Dates can also appear in a completely different format. Julian (also known as mainframe or ordinal) dates are an example, where the year is specified but the date is relative to January 1. So 212 days is either July 30 or 31, depending if it is a leap year or not.
::as_date('19212', format = "%y%j") lubridate
## [1] "2019-07-31"
::as_date('20212', format = "%y%j") lubridate
## [1] "2020-07-30"
JDE E1’s dates also add an extra digit in front of it, to symbolize the century. While there is no R way to translate 0 to 19xx and 1 to 20xx, practically speaking we are safe with assuming the century as we are at least 20 years past that point (as of this writing). Amending the mask to ignore the 1 can be done in the format:
::as_date('119212', format = "1%y%j") lubridate
## [1] "2019-07-31"
::as_date('120212', format = "1%y%j") lubridate
## [1] "2020-07-30"
7.3 Separate
Occasionally, distinct data will be combined in the same cell. This can be due to data entry or an unusual method of storing information. There will generally a separator within the field (similar to a delimiter), but the data is still contained within one field.
For example, if we look at our vendor table and name
column, we can see the company names appear to also have a Contact person.
%>%
vendors select(type, name)
## # A tibble: 9 x 2
## type name
## <chr> <chr>
## 1 E al-Bangura, Habeeba
## 2 E Akina, Duncan
## 3 E el-Kazmi, Muneefa
## 4 V el-Shafi Company Ltd - Contact: Rivera, Evan
## 5 V Bray Company Ltd - Contact: el-Jaffer, Shaheer
## 6 V Chrisman Company Ltd - Contact: Burch, Kryslyn
## 7 V el-Shafi Company Ltd - Contact: Rivera, Evan
## 8 V Bray Company Ltd - Contact: el-Jaffer, Shaheer
## 9 V Chrisman Company Ltd - Contact: Burch, Kryslyn
While the name is accurate, the contact name shouldn’t be in there. We can isolate the contact name by using separate()
, allowing us to split a column by defining a separator and also new column names. In this case, the ‘-’ (dash) character could work:
%>%
vendors select(type, name) %>%
separate(name, c('name', 'company_contact'), sep = '-')
## Warning: Expected 2 pieces. Additional pieces discarded in 4 rows [4, 5, 7, 8].
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [2].
## # A tibble: 9 x 3
## type name company_contact
## <chr> <chr> <chr>
## 1 E "al" "Bangura, Habeeba"
## 2 E "Akina, Duncan" <NA>
## 3 E "el" "Kazmi, Muneefa"
## 4 V "el" "Shafi Company Ltd "
## 5 V "Bray Company Ltd " " Contact: el"
## 6 V "Chrisman Company Ltd " " Contact: Burch, Kryslyn"
## 7 V "el" "Shafi Company Ltd "
## 8 V "Bray Company Ltd " " Contact: el"
## 9 V "Chrisman Company Ltd " " Contact: Burch, Kryslyn"
Sometimes a character isn’t specific enough. We can actually customize the separator even further by changing its argument to be more specific. In this case, perhaps we want both the dash and the word “Contact:”:
%>%
vendors select(id, type, name) %>%
separate(name, c('name', 'company_contact'), sep = '- Contact:')
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 3 rows [1, 2, 3].
## # A tibble: 9 x 4
## id type name company_contact
## <int> <chr> <chr> <chr>
## 1 793 E "al-Bangura, Habeeba" <NA>
## 2 666 E "Akina, Duncan" <NA>
## 3 77 E "el-Kazmi, Muneefa" <NA>
## 4 2726 V "el-Shafi Company Ltd " " Rivera, Evan"
## 5 2612 V "Bray Company Ltd " " el-Jaffer, Shaheer"
## 6 2211 V "Chrisman Company Ltd " " Burch, Kryslyn"
## 7 2726 V "el-Shafi Company Ltd " " Rivera, Evan"
## 8 2612 V "Bray Company Ltd " " el-Jaffer, Shaheer"
## 9 2211 V "Chrisman Company Ltd " " Burch, Kryslyn"
What is nice about this functionality is that if the separator isn’t found, the missing values are NA. The warning is good to see (as the employees don’t have a company contact), and can be safely ignored in this case.
7.4 Separate_rows
Similarly grouped data can also be combined in the same cell. This is typically representing a one-to-many (or many-to-one) relationship. This can be due to data entry or unusual output formats. Consider our industry table, which is an industry-to-company mapping table, telling us which companies and the respective industry they work in:
industry
## id industry
## 1 2726 Health Care
## 2 2612,2211 Commerical Supplies
As of now, there would be no way to join the industry, and specifically the value Commercial Supplies, into our vendor table. The unique key is id number of the company, which gets joined to the vendor table.
When we run into a situation where a value may belong to the same key multiple times (multiple IDs are associated with one industry), we can use separate_rows()
to break up that row into several rows. For this function, we need to specify the columns that need to be separated - in this case, id
needs to be broken apart.
<- industry %>%
clean_industry separate_rows(id, convert = TRUE) # Convert will try to guess the data type for each field, based on the content that was separated
print(clean_industry)
## # A tibble: 3 x 2
## id industry
## <int> <chr>
## 1 2726 Health Care
## 2 2612 Commerical Supplies
## 3 2211 Commerical Supplies
separate_rows()
will automatically expand the column that needs to be separated, while replicating the content for that same value into subsequent rows.
7.5 Tidy data
The three key concepts in ‘tidy data’ (G. G. Hadley Wickham 2017) is that:
- Each variable must have its own column.
- Each observation must have its own row.
- Each value must have its own cell.
While you can get away with performing basic analysis and checks, having the ability to move data between formats increases competency for cleaning, flexibility for plotting, and legibility for outputting tables in a readable format.
Lets start by cleaning the vendor table
<- vendors %>%
clean_vendors separate(name, c('name', 'company_contact'), sep = '- Contact: ') %>% # From the separate() section
select(-date_added_char) %>% # Remove one of the date fields
mutate(date_added = as_date(date_added, origin = '1970-01-01')) %>% # From the cleaning dates section
filter(type == 'V')
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 3 rows [1, 2, 3].
print(clean_vendors)
## # A tibble: 6 x 10
## id type name company_contact date_added summary `2015` `2016` `2017`
## <int> <chr> <chr> <chr> <date> <chr> <dbl> <dbl> <dbl>
## 1 2726 V "el-… Rivera, Evan 2015-05-23 spend 65326. 8.34e4 1.50e5
## 2 2612 V "Bra… el-Jaffer, Sha… 2015-07-04 spend 88270. 1.49e5 2.54e5
## 3 2211 V "Chr… Burch, Kryslyn 2015-08-05 spend 51847. 8.93e4 1.42e5
## 4 2726 V "el-… Rivera, Evan 2015-05-23 credits 509. 8.60e2 9.09e2
## 5 2612 V "Bra… el-Jaffer, Sha… 2015-07-04 credits 777. 1.09e3 1.85e3
## 6 2211 V "Chr… Burch, Kryslyn 2015-08-05 credits 221. 3.69e2 6.57e2
## # … with 1 more variable: `2018` <dbl>
This vendor data set is a combination of both long and wide, and needs to be standardized to fit the tidy data model. Specifically:
- Each variable must have its own column - The vendor’s spend and credits each year is separated over multiple columns. That is, the values of the variable (the year) is in the column name. In the end output, every combination of vendor-year should be a row.
- Each observation must have its own row - Each vendor has both a spend and credits value, within the
summary
column. Even though these are one observation for each vendor-year, they are separated by two rows. - Each value must have its own cell - Thankfully, this seems to be valid (although if not, you can clean it with
separate()
andseparate_rows()
).
Why would you want to subscribe to this tidy data philosophy? When you have a tidy data set, you are more empowered to explore your data (especially with tidyverse related commands), but you can also have the power to reshape data as necessary to fit the visualizing output you are imagining.
7.6 Pivots, wide and longer
Effectively pivoting your data into more suitable columns and rows requires you to consider the following:
- What columns contains the values you want to aggregate?
- What do you want the new name of these columns to be? Consider both the name and aggregated value.
In the clean_vendors
data, we see each year represented as columns, and those columns represent dollar amounts.
clean_vendors
## # A tibble: 6 x 10
## id type name company_contact date_added summary `2015` `2016` `2017`
## <int> <chr> <chr> <chr> <date> <chr> <dbl> <dbl> <dbl>
## 1 2726 V "el-… Rivera, Evan 2015-05-23 spend 65326. 8.34e4 1.50e5
## 2 2612 V "Bra… el-Jaffer, Sha… 2015-07-04 spend 88270. 1.49e5 2.54e5
## 3 2211 V "Chr… Burch, Kryslyn 2015-08-05 spend 51847. 8.93e4 1.42e5
## 4 2726 V "el-… Rivera, Evan 2015-05-23 credits 509. 8.60e2 9.09e2
## 5 2612 V "Bra… el-Jaffer, Sha… 2015-07-04 credits 777. 1.09e3 1.85e3
## 6 2211 V "Chr… Burch, Kryslyn 2015-08-05 credits 221. 3.69e2 6.57e2
## # … with 1 more variable: `2018` <dbl>
We can pivot this data by specifying the years, and then aggregating the amount. In this case, we want to take the columns and assign them to rows, and making the data longer. pivot_longer()
will allow us to make this transformation:
%>%
clean_vendors pivot_longer(cols = c(`2015`, `2016`, `2017`, `2018`), # The columns we want to pivot, in this case, the years.
names_to = 'year', # We want this collected column to be called the year
values_to = 'amount') # the values of the aggregation
## # A tibble: 24 x 8
## id type name company_contact date_added summary year amount
## <int> <chr> <chr> <chr> <date> <chr> <chr> <dbl>
## 1 2726 V "el-Shafi Comp… Rivera, Evan 2015-05-23 spend 2015 6.53e4
## 2 2726 V "el-Shafi Comp… Rivera, Evan 2015-05-23 spend 2016 8.34e4
## 3 2726 V "el-Shafi Comp… Rivera, Evan 2015-05-23 spend 2017 1.50e5
## 4 2726 V "el-Shafi Comp… Rivera, Evan 2015-05-23 spend 2018 5.56e4
## 5 2612 V "Bray Company … el-Jaffer, Shahe… 2015-07-04 spend 2015 8.83e4
## 6 2612 V "Bray Company … el-Jaffer, Shahe… 2015-07-04 spend 2016 1.49e5
## 7 2612 V "Bray Company … el-Jaffer, Shahe… 2015-07-04 spend 2017 2.54e5
## 8 2612 V "Bray Company … el-Jaffer, Shahe… 2015-07-04 spend 2018 1.79e5
## 9 2211 V "Chrisman Comp… Burch, Kryslyn 2015-08-05 spend 2015 5.18e4
## 10 2211 V "Chrisman Comp… Burch, Kryslyn 2015-08-05 spend 2016 8.93e4
## # … with 14 more rows
This data is more readable - each year is on its own row. However, we still have the issue where the same company will have two rows for each year - for the summary
field, both a spend and credits exist for a vendor:
%>%
clean_vendors pivot_longer(cols = c(`2015`, `2016`, `2017`, `2018`), # The columns we want to pivot, in this case, the years.
names_to = 'year', # We want this collected column to be called the year
values_to = 'amount') %>% # the values of the aggregation
arrange(id) %>%
head(10)
## # A tibble: 10 x 8
## id type name company_contact date_added summary year amount
## <int> <chr> <chr> <chr> <date> <chr> <chr> <dbl>
## 1 2211 V "Chrisman Comp… Burch, Kryslyn 2015-08-05 spend 2015 5.18e4
## 2 2211 V "Chrisman Comp… Burch, Kryslyn 2015-08-05 spend 2016 8.93e4
## 3 2211 V "Chrisman Comp… Burch, Kryslyn 2015-08-05 spend 2017 1.42e5
## 4 2211 V "Chrisman Comp… Burch, Kryslyn 2015-08-05 spend 2018 7.68e4
## 5 2211 V "Chrisman Comp… Burch, Kryslyn 2015-08-05 credits 2015 2.21e2
## 6 2211 V "Chrisman Comp… Burch, Kryslyn 2015-08-05 credits 2016 3.69e2
## 7 2211 V "Chrisman Comp… Burch, Kryslyn 2015-08-05 credits 2017 6.57e2
## 8 2211 V "Chrisman Comp… Burch, Kryslyn 2015-08-05 credits 2018 4.58e2
## 9 2612 V "Bray Company … el-Jaffer, Shahe… 2015-07-04 spend 2015 8.83e4
## 10 2612 V "Bray Company … el-Jaffer, Shahe… 2015-07-04 spend 2016 1.49e5
Since spend and credits belong to a single vendor-year pairing, it may be worthwhile to to put both summary
classifications into their own column. In this case, we want to make the table wider by adding columns, so lets use pivot_wider()
. This function is similar to pivot_longer
in the same approach:
- What columns contains the values you want to
aggregatespread out? These values becomes the new column name. - Where is the value located? These will get moved under the new column.
Building from our last example, lets make this data bit wider
%>%
clean_vendors pivot_longer(cols = c(`2015`, `2016`, `2017`, `2018`), # The columns we want to pivot, in this case, the years.
names_to = 'year', # We want this collected column to be called the year
values_to = 'amount') %>% # the values of the aggregation
pivot_wider(names_from = 'summary',
values_from = 'amount')
## # A tibble: 12 x 8
## id type name company_contact date_added year spend credits
## <int> <chr> <chr> <chr> <date> <chr> <dbl> <dbl>
## 1 2726 V "el-Shafi Comp… Rivera, Evan 2015-05-23 2015 6.53e4 509.
## 2 2726 V "el-Shafi Comp… Rivera, Evan 2015-05-23 2016 8.34e4 860.
## 3 2726 V "el-Shafi Comp… Rivera, Evan 2015-05-23 2017 1.50e5 909.
## 4 2726 V "el-Shafi Comp… Rivera, Evan 2015-05-23 2018 5.56e4 348.
## 5 2612 V "Bray Company … el-Jaffer, Shahe… 2015-07-04 2015 8.83e4 777.
## 6 2612 V "Bray Company … el-Jaffer, Shahe… 2015-07-04 2016 1.49e5 1094.
## 7 2612 V "Bray Company … el-Jaffer, Shahe… 2015-07-04 2017 2.54e5 1854.
## 8 2612 V "Bray Company … el-Jaffer, Shahe… 2015-07-04 2018 1.79e5 1395.
## 9 2211 V "Chrisman Comp… Burch, Kryslyn 2015-08-05 2015 5.18e4 221.
## 10 2211 V "Chrisman Comp… Burch, Kryslyn 2015-08-05 2016 8.93e4 369.
## 11 2211 V "Chrisman Comp… Burch, Kryslyn 2015-08-05 2017 1.42e5 657.
## 12 2211 V "Chrisman Comp… Burch, Kryslyn 2015-08-05 2018 7.68e4 458.
Each company now has its spend and credits in separate columns, but is now digestible by year.
7.7 Joining data
Often our data will come from different data sets or different tables. Having the ability to combine data from different systems is a powerful way to add more analyzable information from your dataset. In the Completeness chapter, we joined accounts from both the GL and TB to calculate whether there were any differences.
The Relational Data chapter in R for Data Science goes into depth how the different join operations work.
Division is the intended mechanism - see https://github.com/tidyverse/lubridate/issues/105 and https://stackoverflow.com/questions/8765621/length-of-lubridate-interval↩︎