Chapter 6 Data Completeness

The first thing you should do when you get a new piece of data, before you do any analysis, is to validate the data for obvious data errors and to perform completeness testing. The intent of Completeness testing is to evaluate whether you have received a full set of data - data anomalies can be explored further after the preliminary testing (for example, malformed dates, NA information, etc.).

Ideally, completeness on a data set should be performed by comparing to another data set - for example, a detailed ledger can roll up into account balances. Other ways of performing completeness is to compare data against a third party, independently managed source of information.

For this section we will use the accounting database and validate our Journal Entry file prior to doing further work. We covered the importing of data in the prior chapter, so we’ll do that again along with some packages:

library(dplyr) # For manipulating data
library(tidyr) # For making data long and wide

library(DBI) # For database connections

dir.create("data", showWarnings = FALSE)

download.file(url = "https://github.com/jonlinca/auditanalytics/raw/master/data/rauditanalytics.sqlite",
             destfile = "data/rauditanalytics.sqlite", mode = "wb")

con <- dbConnect(RSQLite::SQLite(), "data/rauditanalytics.sqlite")

# Creates a table reference and collects the table, saving it as a gl object
gl <- tbl(con, 'gl') %>%
  collect()

6.1 Exploration of General Ledger data

General Ledger (GL) data is where all the transactions against the accounting system are stored. Sub-ledgers may exist for specific systems (Accounts Payable, Accounts Receivable, Inventory) in a more detailed form, and these systems will also post to the General Ledger.

For the data set in our example, we’ve requested and received a full year’s worth of transactions.

A rapid preview of all your columns can be done quickly with summary(). Not only does it list all the fields available in the table, but it also gives quick statistics on numeric columns as well (whether it makes sense is up to you):

summary(gl)

##      je_num           amount          gl_date      gl_date_char      
##  Min.   :   1.0   Min.   :-96355   Min.   :17897   Length:2000       
##  1st Qu.: 250.8   1st Qu.: -6846   1st Qu.:17986   Class :character  
##  Median : 500.5   Median :     0   Median :18080   Mode  :character  
##  Mean   : 500.5   Mean   :     0   Mean   :18078                     
##  3rd Qu.: 750.2   3rd Qu.:  6846   3rd Qu.:18171                     
##  Max.   :1000.0   Max.   : 96355   Max.   :18261                     
##                                                                      
##    vendor_id      account           invoice_date   description       
##  Min.   :  77   Length:2000        Min.   :17887   Length:2000       
##  1st Qu.:2211   Class :character   1st Qu.:17976   Class :character  
##  Median :2612   Mode  :character   Median :18072   Mode  :character  
##  Mean   :2464                      Mean   :18069                     
##  3rd Qu.:2726                      3rd Qu.:18162                     
##  Max.   :2726                      Max.   :18254                     
##  NA's   :1000                      NA's   :44                        
##    paid_date    
##  Min.   :17900  
##  1st Qu.:18003  
##  Median :18097  
##  Mean   :18096  
##  3rd Qu.:18188  
##  Max.   :18291  
##  NA's   :44

A high level summary scan is useful for us:

je_num is a number value, although it has no meaning as a number, as it is a reference.
vendor_id is a number as well, but it contains NA’s. NA’s are R’s way of indicating that data does not exist.
amount is both positive and negative - indicating this column has indicators of both credits and debits
The dates (gl_date, invoice_date and paid_date) are numeric while gl_date_char is a character. We will talk about converting these dates to something interpretable in the next chapter.

It may be useful to look at a few data samples more closely to understand patterns. head() is useful in seeing the first few columns

head(gl)

## # A tibble: 6 x 9
##   je_num  amount gl_date gl_date_char vendor_id account invoice_date description
##    <int>   <dbl>   <dbl> <chr>            <int> <chr>          <dbl> <chr>      
## 1      1  22667.   18232 2019-12-02        2211 exp_ma…        18225 Packaging …
## 2      1 -22667.   18232 2019-12-02          NA liab_a…        18225 Packaging …
## 3      2   1397.   18228 2019-11-28        2612 exp_ma…        18218 Paper      
## 4      2  -1397.   18228 2019-11-28          NA liab_a…        18218 Paper      
## 5      3   2319.   17995 2019-04-09        2612 exp_ma…        17985 Paper      
## 6      3  -2319.   17995 2019-04-09          NA liab_a…        17985 Paper      
## # … with 1 more variable: paid_date <dbl>

We get a more detailed understanding of the fields:

je_num indicates a set of lines within a journal entry.
vendor_id is usually associated to an expense account.
amount is both positive and negative for the same journal entry - this means theoretically, it should balance to zero.
The dates, gl_date specifically, are numeric while gl_date_char appears as a date (but is still a character).

6.2 Examination of potential errors

6.2.1 NA values

At this stage, you could explore obvious potential issues - in this case, NA values have surfaced themselves early through the summary() command, so we should explore it a bit.

NAs are “Not available” or Missing Values. As an auditor, its important for to understand why NA values exist in your data set. Reasons I have heard in my career include:

Data was not recorded - A field may be blank because it was intentionally or accidentally omitted. A “void date” is quite commonly NA in a GL database as most entries have not been voided. Or perhaps a journal entry is NA because it has not yet been approved. There may be business rules that indicate why a row’s value may be NA.
Data was not recorded at the time - A data source is always evolving, and new columns may be introduced as new features are rolled out or data structure changes. For example, a relatively new requirement indicating companies must identify government companies within their databases, and only applicable for new companies in the database. Vendors entered prior to this change may be left as NA.
Inappropriate coercion - the column type was converted from one to another and a loss of value occured. For example, converting the letter ‘a’ using as.numeric() will give the following:

as.numeric('a')

## Warning: NAs introduced by coercion

## [1] NA

Its not that this value never existed. The letter ‘a’ did exist in the original format, but it doesn’t have a numerical representation within R. As it did not have a valid value when converted to a numeric type, it shows as NA.

In our data set, the vendor_id has NA values. We can inspect these by isolating them to determine the nature of the pattern:

# Base R equivalent: 
#  gl[is.na(gl$vendor_id), ]

# Tidyverse / dplyr
gl %>%
  filter(is.na(vendor_id))

## # A tibble: 1,000 x 9
##    je_num  amount gl_date gl_date_char vendor_id account invoice_date
##     <int>   <dbl>   <dbl> <chr>            <int> <chr>          <dbl>
##  1      1 -22667.   18232 2019-12-02          NA liab_a…        18225
##  2      2  -1397.   18228 2019-11-28          NA liab_a…        18218
##  3      3  -2319.   17995 2019-04-09          NA liab_a…        17985
##  4      4 -20037.   18199 2019-10-30          NA liab_a…        18189
##  5      5 -43140.   18051 2019-06-04          NA liab_a…        18041
##  6      6 -13603.   17927 2019-01-31          NA liab_a…        17920
##  7      7  -7001.   18190 2019-10-21          NA liab_a…        18183
##  8      8 -24753.   18010 2019-04-24          NA liab_a…        18003
##  9      9  -2295.   18087 2019-07-10          NA liab_a…        18080
## 10     10 -34263.   18060 2019-06-13          NA liab_a…        18053
## # … with 990 more rows, and 2 more variables: description <chr>,
## #   paid_date <dbl>

This indicates that several values are NA. This enables us to ask the proper questions - specifically, we should seek to understand and corroborate with the business if there is a certain pattern associated to these NAs. dplyr’s group_by() and summarize() are useful for identifying these patterns further:

gl %>%
  filter(is.na(vendor_id)) %>%
  group_by(account) %>%
  summarize(n = n(),
           .groups = 'drop') # Needed to suppress the ungrouping object message

## # A tibble: 2 x 2
##   account                         n
## * <chr>                       <int>
## 1 liab_accountspayable_2000     990
## 2 liab_creditcardpayable_2100    10

6.2.2 Journal Entries balance to zero

A quick sanity check for the analysis of GL accounts is to do a quick summarization.

In this case, you will want to group_by() and summarize() again - in this case, by the je_num will test whether all journal entries will net to zero.

gl %>%
  group_by(je_num) %>%
  summarize(amount = sum(amount),
             .groups = 'drop') %>%
  filter(amount != 0)

## # A tibble: 0 x 2
## # … with 2 variables: je_num <int>, amount <dbl>

6.3 Transforming of Trial Balance data

The Trial Balance (TB) is intended to track and record higher level movements of the General Ledger. It does so by maintaining an accurate balance of debits and credits made to the accounts.

Lets look at our TB:

tb <- tbl(con, 'tb') %>%
  collect()

head(tb)

## # A tibble: 5 x 13
##   account activity_2019_01 activity_2019_02 activity_2019_03 activity_2019_04
##   <chr>              <dbl>            <dbl>            <dbl>            <dbl>
## 1 exp_co…         28785.              9942.            9715.            9931.
## 2 exp_ma…       1021508.            834766.          746621.          636011.
## 3 exp_me…             1.55               0                0              281.
## 4 liab_a…      -1050295.           -844708.         -756336.         -646057.
## 5 liab_c…             0                  0                0             -166.
## # … with 8 more variables: activity_2019_05 <dbl>, activity_2019_06 <dbl>,
## #   activity_2019_07 <dbl>, activity_2019_08 <dbl>, activity_2019_09 <dbl>,
## #   activity_2019_10 <dbl>, activity_2019_11 <dbl>, activity_2019_12 <dbl>

The TB provided has the net change or net activity level by month. In your day-to-day work, you may also receive a TB that has an Opening and Closing balance - to obtain the net change for the audit period, simply deduct the Close from the Open to calculate the Change for the year, zeroing out the ‘Open’ component for income statement accounts.

In our case, we would like to perform completeness testing by account for the entire year, which means we compare the total activity of the account in the TB to the GL. As each column is its own month, we can approach this in several ways - each of the below methods demonstrates how to solve the problem.

6.3.1 Sum by absolute references

In traditional “Excel-esque” form, you would add up each column for each row - simply taking the values of each column and adding them together.

tb %>%
  mutate(tb_activity = activity_2019_01 + activity_2019_02 + activity_2019_03 + activity_2019_04 + activity_2019_05 + activity_2019_06 + activity_2019_07 + activity_2019_08 + activity_2019_09 + activity_2019_10 + activity_2019_11 + activity_2019_12) %>%
  select(account, tb_activity)

## # A tibble: 5 x 2
##   account                     tb_activity
##   <chr>                             <dbl>
## 1 exp_consulting_6500             248427.
## 2 exp_materials_6000             9984507.
## 3 exp_meals_7000                    1706.
## 4 liab_accountspayable_2000    -10227945.
## 5 liab_creditcardpayable_2100      -6695.

While the above works, there are some risks to this code:

the code is difficult to read, as it is a long string of column names,
is prone to errors as you have to type out each column, and
this approach could only be used once, this year, as when the next audit year rolls around, you would have to manually change the references.

6.3.2 Sum by numeric position

To make this code more reusable, we could make some changes. We notice that the activity columns are in the second column through the thirteenth position, so we can assume that the ‘position’ of the columns will never change - second column will always be January, and the thirteenth will be December.

names(tb) # This tells us the position number of each column names

##  [1] "account"          "activity_2019_01" "activity_2019_02" "activity_2019_03"
##  [5] "activity_2019_04" "activity_2019_05" "activity_2019_06" "activity_2019_07"
##  [9] "activity_2019_08" "activity_2019_09" "activity_2019_10" "activity_2019_11"
## [13] "activity_2019_12"

names(tb)[2] # Returns January

## [1] "activity_2019_01"

names(tb)[13] # Returns December

## [1] "activity_2019_12"

Therefore, we can reference a range of column numbers in our script:

tb %>%
  select(2:13) # Selects just the numeric columns that we assume, by column number

## # A tibble: 5 x 12
##   activity_2019_01 activity_2019_02 activity_2019_03 activity_2019_04
##              <dbl>            <dbl>            <dbl>            <dbl>
## 1         28785.              9942.            9715.            9931.
## 2       1021508.            834766.          746621.          636011.
## 3             1.55               0                0              281.
## 4      -1050295.           -844708.         -756336.         -646057.
## 5             0                  0                0             -166.
## # … with 8 more variables: activity_2019_05 <dbl>, activity_2019_06 <dbl>,
## #   activity_2019_07 <dbl>, activity_2019_08 <dbl>, activity_2019_09 <dbl>,
## #   activity_2019_10 <dbl>, activity_2019_11 <dbl>, activity_2019_12 <dbl>

And now we can summarize by row

tb_activity <- tb %>%
  select(2:13) %>%
  rowSums() # And sums up the row

print(tb_activity) # This is saved as a numeric vector

## [1]    248426.92   9984507.34      1705.97 -10227945.12     -6695.11

The newly calculated row summarization will return the total activity as a numeric vector; each item in the vector represents which row it belonged to. As this is a vector of numbers, we can a vector of the same TB account names and join them together.

print(tb$account) # Vector of account names as a character

## [1] "exp_consulting_6500"         "exp_materials_6000"         
## [3] "exp_meals_7000"              "liab_accountspayable_2000"  
## [5] "liab_creditcardpayable_2100"

With these two vectors, we can create a new data frame with the account name and TB activity we calculated.

data.frame(account = tb$account, tb_activity = tb_activity) # We create a new dataframe - one from the character vector in the original trial balance file, the other from the created tb_activity

##                       account  tb_activity
## 1         exp_consulting_6500    248426.92
## 2          exp_materials_6000   9984507.34
## 3              exp_meals_7000      1705.97
## 4   liab_accountspayable_2000 -10227945.12
## 5 liab_creditcardpayable_2100     -6695.11

6.3.3 Sum by named references

In addition of referencing by position number, we can also reference by column name. We want to sum up all columns that start with “activity_.” The selecting by position and selecting by variable name are similar, so we’ll also introduce the ‘dot’ in this select statement.

tb %>%
  mutate(tb_activity = rowSums(
    select(., contains('activity_')) # Only keep the column names with the word 'activity_'
    )) %>%
  select(account, tb_activity)

## # A tibble: 5 x 2
##   account                     tb_activity
##   <chr>                             <dbl>
## 1 exp_consulting_6500             248427.
## 2 exp_materials_6000             9984507.
## 3 exp_meals_7000                    1706.
## 4 liab_accountspayable_2000    -10227945.
## 5 liab_creditcardpayable_2100      -6695.

What occurs here is that the dot will take the preceding command (technically known as the ‘left hand side’ or LHS) and feed it directly into the function. So in this case, the command can be narrated as:

Using the TB table (this becomes our LHS), create a column named “tb_activity”…

Calculate “tb_activity” by identifying all column names containing the word “activity_” from the TB table (referenced by the dot). Using these columns, add them together with rowSums.

By introducing and referencing our columns by names, we’ve introduced a more specific and robust way to aggregate our information by account.

6.3.4 Pivot then summarize

While the prior methods focused on summing up multiple columns, you could also approach this problem as if it was a wide data set that needed to become long. The ability to pivot data longer and wider is incredibly useful - not only for cleaning, but also for reshaping data into other formats for plotting and preparing for databases.

If we look at our original TB data again, we notice there is:

one unique identifier (the account name),
multiple values for each month (example, activity_2019_01 represents the period with a value January 2019),
and the dollar value itself for each month.

When we deconstruct our data, it becomes much easier to delve into the tidyr package and the functions pivot_longer() and pivot_wider(). We will discuss the details of pivoting in cleaning data.

head(tb) %>%
  DT::datatable(options = list(scrollX = TRUE)) # Notice how this data looks wide

tb_long <- tb %>%
  pivot_longer(cols = starts_with("activity_"), # We want to aggregate the values in these columns
               names_to = "period", # What we want to call this new column
               values_to = "activity") # And the values we want to take from it

tb_long

## # A tibble: 60 x 3
##    account             period           activity
##    <chr>               <chr>               <dbl>
##  1 exp_consulting_6500 activity_2019_01   28785.
##  2 exp_consulting_6500 activity_2019_02    9942.
##  3 exp_consulting_6500 activity_2019_03    9715.
##  4 exp_consulting_6500 activity_2019_04    9931.
##  5 exp_consulting_6500 activity_2019_05   28973.
##  6 exp_consulting_6500 activity_2019_06   37677.
##  7 exp_consulting_6500 activity_2019_07   19295.
##  8 exp_consulting_6500 activity_2019_08    9104.
##  9 exp_consulting_6500 activity_2019_09   28698.
## 10 exp_consulting_6500 activity_2019_10   28753.
## # … with 50 more rows

This data is now represented longer - there is now one unique value (activity) for each account and month. From here, we can now summarize():

tb_long %>%
  group_by(account) %>%
  summarize(tb_activity = sum(activity))

## # A tibble: 5 x 2
##   account                     tb_activity
## * <chr>                             <dbl>
## 1 exp_consulting_6500             248427.
## 2 exp_materials_6000             9984507.
## 3 exp_meals_7000                    1706.
## 4 liab_accountspayable_2000    -10227945.
## 5 liab_creditcardpayable_2100      -6695.

While any of these approaches will work to calculate the trial balance of activity, there are sustainability advantages where the columns are not verbatim and explicitly mentioned - what matters is how you believe your dataset can change over time, or how easily readable you can communicate your work to a new individual.

6.4 Asserting Completeness

The practical goal before we do any further testing is to ensure we’re not wasting time with a data set that is missing information. This completeness test will help validate that we received data for twelve months of GL activity. In Journal Entry testing, this means we will compare the summarized Trial Balance file against the General Ledger entries to obtain reasonableness that our data set we received is complete.

To compare both the GL and TB, we will want to aggregate the data in both datasets before joining them together. First, we will aggregate the GL:

gl_summarized <- gl %>%
  group_by(account) %>%
  summarize(gl_total = sum(amount),
           .groups = 'drop') # Needed to suppress the ungrouping object message

gl_summarized

## # A tibble: 5 x 2
##   account                       gl_total
## * <chr>                            <dbl>
## 1 exp_consulting_6500            248427.
## 2 exp_materials_6000            9984507.
## 3 exp_meals_7000                   1706.
## 4 liab_accountspayable_2000   -10227945.
## 5 liab_creditcardpayable_2100     -6695.

And also aggregate the TB:

tb_summarized <- tb %>%
  mutate(tb_activity = rowSums(
    select(., contains('activity_'))
    )) %>%
  select(account, tb_activity)

tb_summarized

## # A tibble: 5 x 2
##   account                     tb_activity
##   <chr>                             <dbl>
## 1 exp_consulting_6500             248427.
## 2 exp_materials_6000             9984507.
## 3 exp_meals_7000                    1706.
## 4 liab_accountspayable_2000    -10227945.
## 5 liab_creditcardpayable_2100      -6695.

6.4.1 Joining the GL to the TB

To perform a proper test of completeness, we should join both tables together. dplyr and the *_join() family of functions can be used to join tables, and also used as diagnosis tools to help debug information as well.

Our GL and TB summarized datasets could be joined by the account column, prior to performing the calculation to identify differences:

gl_summarized %>%
  full_join(tb_summarized, by = 'account')

## # A tibble: 5 x 3
##   account                       gl_total tb_activity
##   <chr>                            <dbl>       <dbl>
## 1 exp_consulting_6500            248427.     248427.
## 2 exp_materials_6000            9984507.    9984507.
## 3 exp_meals_7000                   1706.       1706.
## 4 liab_accountspayable_2000   -10227945.  -10227945.
## 5 liab_creditcardpayable_2100     -6695.      -6695.

Once you have joined the columns together, a simple difference calculation will let you know what the differences are (if any):

gl_summarized %>%
  full_join(tb_summarized, by = 'account') %>%
  mutate(tb_diff = gl_total - tb_activity)

## # A tibble: 5 x 4
##   account                       gl_total tb_activity tb_diff
##   <chr>                            <dbl>       <dbl>   <dbl>
## 1 exp_consulting_6500            248427.     248427.       0
## 2 exp_materials_6000            9984507.    9984507.       0
## 3 exp_meals_7000                   1706.       1706.       0
## 4 liab_accountspayable_2000   -10227945.  -10227945.       0
## 5 liab_creditcardpayable_2100     -6695.      -6695.       0

And if you like clean working papers, simply filter to identify where the reconciliation did not work out:

gl_summarized %>%
  full_join(tb_summarized, by = 'account') %>%
  mutate(tb_diff = gl_total - tb_activity) %>%
  dplyr::filter(tb_diff != 0)

## # A tibble: 0 x 4
## # … with 4 variables: account <chr>, gl_total <dbl>, tb_activity <dbl>,
## #   tb_diff <dbl>

6.5 Cautionary notes for Completeness

While performing completeness, you are proving you have a complete set of data. Exercise caution and ensure you are aware of the following:

A completeness check is only as good as the data provided. In the above case, if November was completely excluded in both GL and TB, you would not be able to detect it at this stage if you did not look through the TB data to see that all twelve months were included.
Using *_join() functions will generally return all combinations of matches. This is a strong motivation to ensure you have summarized rows by the join columns, prior to joining. Once you have summarized your data, the *_join() functions are far more predictable.
full_join() will indicate if there are any accounts missing from either table. In our example, if the tb_summarized data had missing accounts that did exist in the gl_summarized table, the resulting tb_activity column values would show up as NA.
anti_join() will show what columns are included on the left, but missing on the right. In our example, if the tb_summarized data had missing accounts that existed in the gl_summarized table, only this account would show up in the results.

You may want to consider testing for invalid values or missing dates up front. While the primary goal is to ensure the datasets received are appropriate, you may want to consider validating this information earlier. The next chapter will show you how to further manipulate and test these columns.