Chapter 6 Data Completeness
The first thing you should do when you get a new piece of data, before you do any analysis, is to validate the data for obvious data errors and to perform completeness testing. The intent of Completeness testing is to evaluate whether you have received a full set of data - data anomalies can be explored further after the preliminary testing (for example, malformed dates, NA information, etc.).
Ideally, completeness on a data set should be performed by comparing to another data set - for example, a detailed ledger can roll up into account balances. Other ways of performing completeness is to compare data against a third party, independently managed source of information.
For this section we will use the accounting database and validate our Journal Entry file prior to doing further work. We covered the importing of data in the prior chapter, so we’ll do that again along with some packages:
library(dplyr) # For manipulating data
library(tidyr) # For making data long and wide
library(DBI) # For database connections
dir.create("data", showWarnings = FALSE)
download.file(url = "https://github.com/jonlinca/auditanalytics/raw/master/data/rauditanalytics.sqlite",
destfile = "data/rauditanalytics.sqlite", mode = "wb")
<- dbConnect(RSQLite::SQLite(), "data/rauditanalytics.sqlite")
con
# Creates a table reference and collects the table, saving it as a gl object
<- tbl(con, 'gl') %>%
gl collect()
6.1 Exploration of General Ledger data
General Ledger (GL) data is where all the transactions against the accounting system are stored. Sub-ledgers may exist for specific systems (Accounts Payable, Accounts Receivable, Inventory) in a more detailed form, and these systems will also post to the General Ledger.
For the data set in our example, we’ve requested and received a full year’s worth of transactions.
A rapid preview of all your columns can be done quickly with summary()
. Not only does it list all the fields available in the table, but it also gives quick statistics on numeric columns as well (whether it makes sense is up to you):
summary(gl)
## je_num amount gl_date gl_date_char
## Min. : 1.0 Min. :-96355 Min. :17897 Length:2000
## 1st Qu.: 250.8 1st Qu.: -6846 1st Qu.:17986 Class :character
## Median : 500.5 Median : 0 Median :18080 Mode :character
## Mean : 500.5 Mean : 0 Mean :18078
## 3rd Qu.: 750.2 3rd Qu.: 6846 3rd Qu.:18171
## Max. :1000.0 Max. : 96355 Max. :18261
##
## vendor_id account invoice_date description
## Min. : 77 Length:2000 Min. :17887 Length:2000
## 1st Qu.:2211 Class :character 1st Qu.:17976 Class :character
## Median :2612 Mode :character Median :18072 Mode :character
## Mean :2464 Mean :18069
## 3rd Qu.:2726 3rd Qu.:18162
## Max. :2726 Max. :18254
## NA's :1000 NA's :44
## paid_date
## Min. :17900
## 1st Qu.:18003
## Median :18097
## Mean :18096
## 3rd Qu.:18188
## Max. :18291
## NA's :44
A high level summary scan is useful for us:
je_num
is a number value, although it has no meaning as a number, as it is a reference.vendor_id
is a number as well, but it contains NA’s. NA’s are R’s way of indicating that data does not exist.amount
is both positive and negative - indicating this column has indicators of both credits and debits- The dates (
gl_date
,invoice_date
andpaid_date
) are numeric whilegl_date_char
is a character. We will talk about converting these dates to something interpretable in the next chapter.
It may be useful to look at a few data samples more closely to understand patterns. head()
is useful in seeing the first few columns
head(gl)
## # A tibble: 6 x 9
## je_num amount gl_date gl_date_char vendor_id account invoice_date description
## <int> <dbl> <dbl> <chr> <int> <chr> <dbl> <chr>
## 1 1 22667. 18232 2019-12-02 2211 exp_ma… 18225 Packaging …
## 2 1 -22667. 18232 2019-12-02 NA liab_a… 18225 Packaging …
## 3 2 1397. 18228 2019-11-28 2612 exp_ma… 18218 Paper
## 4 2 -1397. 18228 2019-11-28 NA liab_a… 18218 Paper
## 5 3 2319. 17995 2019-04-09 2612 exp_ma… 17985 Paper
## 6 3 -2319. 17995 2019-04-09 NA liab_a… 17985 Paper
## # … with 1 more variable: paid_date <dbl>
We get a more detailed understanding of the fields:
je_num
indicates a set of lines within a journal entry.vendor_id
is usually associated to an expense account.amount
is both positive and negative for the same journal entry - this means theoretically, it should balance to zero.- The dates,
gl_date
specifically, are numeric whilegl_date_char
appears as a date (but is still a character).
6.2 Examination of potential errors
6.2.1 NA values
At this stage, you could explore obvious potential issues - in this case, NA values have surfaced themselves early through the summary()
command, so we should explore it a bit.
NAs are “Not available” or Missing Values. As an auditor, its important for to understand why NA values exist in your data set. Reasons I have heard in my career include:
Data was not recorded - A field may be blank because it was intentionally or accidentally omitted. A “void date” is quite commonly NA in a GL database as most entries have not been voided. Or perhaps a journal entry is NA because it has not yet been approved. There may be business rules that indicate why a row’s value may be NA.
Data was not recorded at the time - A data source is always evolving, and new columns may be introduced as new features are rolled out or data structure changes. For example, a relatively new requirement indicating companies must identify government companies within their databases, and only applicable for new companies in the database. Vendors entered prior to this change may be left as NA.
Inappropriate coercion - the column type was converted from one to another and a loss of value occured. For example, converting the letter ‘a’ using
as.numeric()
will give the following:
as.numeric('a')
## Warning: NAs introduced by coercion
## [1] NA
Its not that this value never existed. The letter ‘a’ did exist in the original format, but it doesn’t have a numerical representation within R. As it did not have a valid value when converted to a numeric type, it shows as NA.
In our data set, the vendor_id
has NA values. We can inspect these by isolating them to determine the nature of the pattern:
# Base R equivalent:
# gl[is.na(gl$vendor_id), ]
# Tidyverse / dplyr
%>%
gl filter(is.na(vendor_id))
## # A tibble: 1,000 x 9
## je_num amount gl_date gl_date_char vendor_id account invoice_date
## <int> <dbl> <dbl> <chr> <int> <chr> <dbl>
## 1 1 -22667. 18232 2019-12-02 NA liab_a… 18225
## 2 2 -1397. 18228 2019-11-28 NA liab_a… 18218
## 3 3 -2319. 17995 2019-04-09 NA liab_a… 17985
## 4 4 -20037. 18199 2019-10-30 NA liab_a… 18189
## 5 5 -43140. 18051 2019-06-04 NA liab_a… 18041
## 6 6 -13603. 17927 2019-01-31 NA liab_a… 17920
## 7 7 -7001. 18190 2019-10-21 NA liab_a… 18183
## 8 8 -24753. 18010 2019-04-24 NA liab_a… 18003
## 9 9 -2295. 18087 2019-07-10 NA liab_a… 18080
## 10 10 -34263. 18060 2019-06-13 NA liab_a… 18053
## # … with 990 more rows, and 2 more variables: description <chr>,
## # paid_date <dbl>
This indicates that several values are NA. This enables us to ask the proper questions - specifically, we should seek to understand and corroborate with the business if there is a certain pattern associated to these NAs. dplyr’s group_by()
and summarize()
are useful for identifying these patterns further:
%>%
gl filter(is.na(vendor_id)) %>%
group_by(account) %>%
summarize(n = n(),
.groups = 'drop') # Needed to suppress the ungrouping object message
## # A tibble: 2 x 2
## account n
## * <chr> <int>
## 1 liab_accountspayable_2000 990
## 2 liab_creditcardpayable_2100 10
6.2.2 Journal Entries balance to zero
A quick sanity check for the analysis of GL accounts is to do a quick summarization.
In this case, you will want to group_by()
and summarize()
again - in this case, by the je_num
will test whether all journal entries will net to zero.
%>%
gl group_by(je_num) %>%
summarize(amount = sum(amount),
.groups = 'drop') %>%
filter(amount != 0)
## # A tibble: 0 x 2
## # … with 2 variables: je_num <int>, amount <dbl>
6.3 Transforming of Trial Balance data
The Trial Balance (TB) is intended to track and record higher level movements of the General Ledger. It does so by maintaining an accurate balance of debits and credits made to the accounts.
Lets look at our TB:
<- tbl(con, 'tb') %>%
tb collect()
head(tb)
## # A tibble: 5 x 13
## account activity_2019_01 activity_2019_02 activity_2019_03 activity_2019_04
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 exp_co… 28785. 9942. 9715. 9931.
## 2 exp_ma… 1021508. 834766. 746621. 636011.
## 3 exp_me… 1.55 0 0 281.
## 4 liab_a… -1050295. -844708. -756336. -646057.
## 5 liab_c… 0 0 0 -166.
## # … with 8 more variables: activity_2019_05 <dbl>, activity_2019_06 <dbl>,
## # activity_2019_07 <dbl>, activity_2019_08 <dbl>, activity_2019_09 <dbl>,
## # activity_2019_10 <dbl>, activity_2019_11 <dbl>, activity_2019_12 <dbl>
The TB provided has the net change or net activity level by month. In your day-to-day work, you may also receive a TB that has an Opening and Closing balance - to obtain the net change for the audit period, simply deduct the Close from the Open to calculate the Change for the year, zeroing out the ‘Open’ component for income statement accounts.
In our case, we would like to perform completeness testing by account for the entire year, which means we compare the total activity of the account in the TB to the GL. As each column is its own month, we can approach this in several ways - each of the below methods demonstrates how to solve the problem.
6.3.1 Sum by absolute references
In traditional “Excel-esque” form, you would add up each column for each row - simply taking the values of each column and adding them together.
%>%
tb mutate(tb_activity = activity_2019_01 + activity_2019_02 + activity_2019_03 + activity_2019_04 + activity_2019_05 + activity_2019_06 + activity_2019_07 + activity_2019_08 + activity_2019_09 + activity_2019_10 + activity_2019_11 + activity_2019_12) %>%
select(account, tb_activity)
## # A tibble: 5 x 2
## account tb_activity
## <chr> <dbl>
## 1 exp_consulting_6500 248427.
## 2 exp_materials_6000 9984507.
## 3 exp_meals_7000 1706.
## 4 liab_accountspayable_2000 -10227945.
## 5 liab_creditcardpayable_2100 -6695.
While the above works, there are some risks to this code:
- the code is difficult to read, as it is a long string of column names,
- is prone to errors as you have to type out each column, and
- this approach could only be used once, this year, as when the next audit year rolls around, you would have to manually change the references.
6.3.2 Sum by numeric position
To make this code more reusable, we could make some changes. We notice that the activity columns are in the second column through the thirteenth position, so we can assume that the ‘position’ of the columns will never change - second column will always be January, and the thirteenth will be December.
names(tb) # This tells us the position number of each column names
## [1] "account" "activity_2019_01" "activity_2019_02" "activity_2019_03"
## [5] "activity_2019_04" "activity_2019_05" "activity_2019_06" "activity_2019_07"
## [9] "activity_2019_08" "activity_2019_09" "activity_2019_10" "activity_2019_11"
## [13] "activity_2019_12"
names(tb)[2] # Returns January
## [1] "activity_2019_01"
names(tb)[13] # Returns December
## [1] "activity_2019_12"
Therefore, we can reference a range of column numbers in our script:
%>%
tb select(2:13) # Selects just the numeric columns that we assume, by column number
## # A tibble: 5 x 12
## activity_2019_01 activity_2019_02 activity_2019_03 activity_2019_04
## <dbl> <dbl> <dbl> <dbl>
## 1 28785. 9942. 9715. 9931.
## 2 1021508. 834766. 746621. 636011.
## 3 1.55 0 0 281.
## 4 -1050295. -844708. -756336. -646057.
## 5 0 0 0 -166.
## # … with 8 more variables: activity_2019_05 <dbl>, activity_2019_06 <dbl>,
## # activity_2019_07 <dbl>, activity_2019_08 <dbl>, activity_2019_09 <dbl>,
## # activity_2019_10 <dbl>, activity_2019_11 <dbl>, activity_2019_12 <dbl>
And now we can summarize by row
<- tb %>%
tb_activity select(2:13) %>%
rowSums() # And sums up the row
print(tb_activity) # This is saved as a numeric vector
## [1] 248426.92 9984507.34 1705.97 -10227945.12 -6695.11
The newly calculated row summarization will return the total activity as a numeric vector; each item in the vector represents which row it belonged to. As this is a vector of numbers, we can a vector of the same TB account names and join them together.
print(tb$account) # Vector of account names as a character
## [1] "exp_consulting_6500" "exp_materials_6000"
## [3] "exp_meals_7000" "liab_accountspayable_2000"
## [5] "liab_creditcardpayable_2100"
With these two vectors, we can create a new data frame with the account name and TB activity we calculated.
data.frame(account = tb$account, tb_activity = tb_activity) # We create a new dataframe - one from the character vector in the original trial balance file, the other from the created tb_activity
## account tb_activity
## 1 exp_consulting_6500 248426.92
## 2 exp_materials_6000 9984507.34
## 3 exp_meals_7000 1705.97
## 4 liab_accountspayable_2000 -10227945.12
## 5 liab_creditcardpayable_2100 -6695.11
6.3.3 Sum by named references
In addition of referencing by position number, we can also reference by column name. We want to sum up all columns that start with “activity_.” The selecting by position and selecting by variable name are similar, so we’ll also introduce the ‘dot’ in this select statement.
%>%
tb mutate(tb_activity = rowSums(
select(., contains('activity_')) # Only keep the column names with the word 'activity_'
%>%
)) select(account, tb_activity)
## # A tibble: 5 x 2
## account tb_activity
## <chr> <dbl>
## 1 exp_consulting_6500 248427.
## 2 exp_materials_6000 9984507.
## 3 exp_meals_7000 1706.
## 4 liab_accountspayable_2000 -10227945.
## 5 liab_creditcardpayable_2100 -6695.
What occurs here is that the dot will take the preceding command (technically known as the ‘left hand side’ or LHS) and feed it directly into the function. So in this case, the command can be narrated as:
Using the TB table (this becomes our LHS), create a column named “tb_activity”…
Calculate “tb_activity” by identifying all column names containing the word “activity_” from the TB table (referenced by the dot). Using these columns, add them together with rowSums.
By introducing and referencing our columns by names, we’ve introduced a more specific and robust way to aggregate our information by account.
6.3.4 Pivot then summarize
While the prior methods focused on summing up multiple columns, you could also approach this problem as if it was a wide data set that needed to become long. The ability to pivot data longer and wider is incredibly useful - not only for cleaning, but also for reshaping data into other formats for plotting and preparing for databases.
If we look at our original TB data again, we notice there is:
- one unique identifier (the account name),
- multiple values for each month (example,
activity_2019_01
represents the period with a value January 2019), - and the dollar value itself for each month.
When we deconstruct our data, it becomes much easier to delve into the tidyr package and the functions pivot_longer()
and pivot_wider()
. We will discuss the details of pivoting in cleaning data.
head(tb) %>%
::datatable(options = list(scrollX = TRUE)) # Notice how this data looks wide DT
<- tb %>%
tb_long pivot_longer(cols = starts_with("activity_"), # We want to aggregate the values in these columns
names_to = "period", # What we want to call this new column
values_to = "activity") # And the values we want to take from it
tb_long
## # A tibble: 60 x 3
## account period activity
## <chr> <chr> <dbl>
## 1 exp_consulting_6500 activity_2019_01 28785.
## 2 exp_consulting_6500 activity_2019_02 9942.
## 3 exp_consulting_6500 activity_2019_03 9715.
## 4 exp_consulting_6500 activity_2019_04 9931.
## 5 exp_consulting_6500 activity_2019_05 28973.
## 6 exp_consulting_6500 activity_2019_06 37677.
## 7 exp_consulting_6500 activity_2019_07 19295.
## 8 exp_consulting_6500 activity_2019_08 9104.
## 9 exp_consulting_6500 activity_2019_09 28698.
## 10 exp_consulting_6500 activity_2019_10 28753.
## # … with 50 more rows
This data is now represented longer - there is now one unique value (activity) for each account and month. From here, we can now summarize()
:
%>%
tb_long group_by(account) %>%
summarize(tb_activity = sum(activity))
## # A tibble: 5 x 2
## account tb_activity
## * <chr> <dbl>
## 1 exp_consulting_6500 248427.
## 2 exp_materials_6000 9984507.
## 3 exp_meals_7000 1706.
## 4 liab_accountspayable_2000 -10227945.
## 5 liab_creditcardpayable_2100 -6695.
While any of these approaches will work to calculate the trial balance of activity, there are sustainability advantages where the columns are not verbatim and explicitly mentioned - what matters is how you believe your dataset can change over time, or how easily readable you can communicate your work to a new individual.
6.4 Asserting Completeness
The practical goal before we do any further testing is to ensure we’re not wasting time with a data set that is missing information. This completeness test will help validate that we received data for twelve months of GL activity. In Journal Entry testing, this means we will compare the summarized Trial Balance file against the General Ledger entries to obtain reasonableness that our data set we received is complete.
To compare both the GL and TB, we will want to aggregate the data in both datasets before joining them together. First, we will aggregate the GL:
<- gl %>%
gl_summarized group_by(account) %>%
summarize(gl_total = sum(amount),
.groups = 'drop') # Needed to suppress the ungrouping object message
gl_summarized
## # A tibble: 5 x 2
## account gl_total
## * <chr> <dbl>
## 1 exp_consulting_6500 248427.
## 2 exp_materials_6000 9984507.
## 3 exp_meals_7000 1706.
## 4 liab_accountspayable_2000 -10227945.
## 5 liab_creditcardpayable_2100 -6695.
And also aggregate the TB:
<- tb %>%
tb_summarized mutate(tb_activity = rowSums(
select(., contains('activity_'))
%>%
)) select(account, tb_activity)
tb_summarized
## # A tibble: 5 x 2
## account tb_activity
## <chr> <dbl>
## 1 exp_consulting_6500 248427.
## 2 exp_materials_6000 9984507.
## 3 exp_meals_7000 1706.
## 4 liab_accountspayable_2000 -10227945.
## 5 liab_creditcardpayable_2100 -6695.
6.4.1 Joining the GL to the TB
To perform a proper test of completeness, we should join both tables together. dplyr and the *_join()
family of functions can be used to join tables, and also used as diagnosis tools to help debug information as well.
Our GL and TB summarized datasets could be joined by the account
column, prior to performing the calculation to identify differences:
%>%
gl_summarized full_join(tb_summarized, by = 'account')
## # A tibble: 5 x 3
## account gl_total tb_activity
## <chr> <dbl> <dbl>
## 1 exp_consulting_6500 248427. 248427.
## 2 exp_materials_6000 9984507. 9984507.
## 3 exp_meals_7000 1706. 1706.
## 4 liab_accountspayable_2000 -10227945. -10227945.
## 5 liab_creditcardpayable_2100 -6695. -6695.
Once you have joined the columns together, a simple difference calculation will let you know what the differences are (if any):
%>%
gl_summarized full_join(tb_summarized, by = 'account') %>%
mutate(tb_diff = gl_total - tb_activity)
## # A tibble: 5 x 4
## account gl_total tb_activity tb_diff
## <chr> <dbl> <dbl> <dbl>
## 1 exp_consulting_6500 248427. 248427. 0
## 2 exp_materials_6000 9984507. 9984507. 0
## 3 exp_meals_7000 1706. 1706. 0
## 4 liab_accountspayable_2000 -10227945. -10227945. 0
## 5 liab_creditcardpayable_2100 -6695. -6695. 0
And if you like clean working papers, simply filter to identify where the reconciliation did not work out:
%>%
gl_summarized full_join(tb_summarized, by = 'account') %>%
mutate(tb_diff = gl_total - tb_activity) %>%
::filter(tb_diff != 0) dplyr
## # A tibble: 0 x 4
## # … with 4 variables: account <chr>, gl_total <dbl>, tb_activity <dbl>,
## # tb_diff <dbl>
6.5 Cautionary notes for Completeness
While performing completeness, you are proving you have a complete set of data. Exercise caution and ensure you are aware of the following:
A completeness check is only as good as the data provided. In the above case, if November was completely excluded in both GL and TB, you would not be able to detect it at this stage if you did not look through the TB data to see that all twelve months were included.
Using
*_join()
functions will generally return all combinations of matches. This is a strong motivation to ensure you have summarized rows by the join columns, prior to joining. Once you have summarized your data, the*_join()
functions are far more predictable.full_join()
will indicate if there are any accounts missing from either table. In our example, if thetb_summarized
data had missing accounts that did exist in thegl_summarized
table, the resultingtb_activity
column values would show up as NA.anti_join()
will show what columns are included on the left, but missing on the right. In our example, if thetb_summarized
data had missing accounts that existed in thegl_summarized
table, only this account would show up in the results.
You may want to consider testing for invalid values or missing dates up front. While the primary goal is to ensure the datasets received are appropriate, you may want to consider validating this information earlier. The next chapter will show you how to further manipulate and test these columns.