Chapter 5 Import data
In this chapter, we will download some datasets and import them. You will need the following packages to follow along.
library(dplyr) # For manipulating data
library(readr) # For reading flat files
library(DBI) # For database connections
# For API connections
library(httr)
library(jsonlite)
5.1 Delimited files
The most common method of obtaining data is via flat files, usually in the form of comma separated files (CSV). While delimited data sources are the most convenient for data sources where direct data connections are otherwise unobtainable, they are not set up for long term sustainability and automation.
The base package, installed with all instances of R, and read.table()
is a convenient built-in standard function for importing CSV files. Another package, readr, includes a similar function called read_delim()
, which is faster and allows for easy altering of column specifications, which directs the data types each column is imported as (for example, overriding an employee’s identification number as a character versus a numeric).
Before importing the file, lets download the file from the repository that contains the Vendor Master. This dataset contains the vendor system ID number, the date it was added, and other traits.
dir.create("data", showWarnings = FALSE) # Creates a directory in your project
download.file(url = "https://github.com/jonlinca/auditanalytics/raw/master/data/vendor_master.csv",
destfile = "data/vendor_master.csv", mode = "wb") # Downloads this csv file into the data folder
When importing delimited files, there will always be a few aspects to consider. The delimiter is the character that separates each field - most common delimiters are commas (,), pipes (|) and tab separators. Using an uncommon character was a typical workaround when exporting data from legacy systems, as commas within text fields were incorrectly parsed as extra columns. If possible, qualifiers should be used to enclose text with a field, typically quotes or double quotes. This will indicate to the system that everything within those quotes belongs to a specific field.
In this case, if you view the file (on the Files panel on the right, go to the data folder, click on vendor_master.csv, and select View File), you will see the data is separated by commas.
<- read_delim('data/vendor_master.csv', delim = ",") raw_vendors
## Parsed with column specification:
## cols(
## id = col_double(),
## type = col_character(),
## name = col_character(),
## date_added = col_date(format = ""),
## summary = col_character(),
## `2015` = col_double(),
## `2016` = col_double(),
## `2017` = col_double(),
## `2018` = col_double()
## )
The message indicates that default column types were assigned to each field imported. If all the fields imported as expected, this message can be ignored.
However, ID numbers, while presented as a number, don’t have a real value in any calculations. As a result, you can specify a column specification via the col_types
argument, copy and pasting the framework in the message and changing the fields as need be:
<- cols(
cols id = col_character(), # Changed from col_double()
name = col_character(),
date_added = col_date(format = ""),
`2015` = col_double(),
`2016` = col_double(),
`2017` = col_double(),
`2018` = col_double()
)
<- read_delim('data/vendor_master.csv', delim = ",", col_types = cols)
raw_vendors
glimpse(raw_vendors)
## Rows: 9
## Columns: 9
## $ id <chr> "793", "666", "77", "2726", "2612", "2211", "2726", "2612"…
## $ type <chr> "E", "E", "E", "V", "V", "V", "V", "V", "V"
## $ name <chr> "al-Bangura, Habeeba", "Akina, Duncan", "el-Kazmi, Muneefa…
## $ date_added <date> 2015-07-15, 2015-07-28, 2015-05-26, 2015-05-23, 2015-07-0…
## $ summary <chr> NA, NA, NA, "spend", "spend", "spend", "credits", "credits…
## $ `2015` <dbl> NA, NA, NA, 65325.5287, 88270.0937, 51847.2728, 509.4308, …
## $ `2016` <dbl> NA, NA, NA, 83413.7426, 148569.7780, 89260.3035, 860.3997,…
## $ `2017` <dbl> NA, NA, NA, 150261.9517, 254178.2672, 142074.7898, 908.947…
## $ `2018` <dbl> NA, NA, NA, 55569.8474, 179056.0944, 76761.0587, 347.9129,…
You may have noticed that while our normal columns are specified as-is, we had to wrap our years with backticks (`), because while a column can technically be a number, asking R to reference a number will interpret it as a value, and not a field name.
5.1.1 Handling problematic delimited files
While well exported delimited files can be useful, they often contain hidden surprises. Consider this rather innocuous CSV file from active directory (the controller for Windows authentication), with a username and manager fields:
# Active directory file, with just a username and manager field.
download.file(url = "https://github.com/jonlinca/auditanalytics/raw/master/data/active_directory.csv",
destfile = "data/active_directory.csv", mode = "wb") # Downloads this csv file into the data folder
<- read_delim('data/active_directory.csv', delim = ";") raw_ad
## Parsed with column specification:
## cols(
## uid = col_character(),
## manager = col_character()
## )
## Warning: 2 parsing failures.
## row col expected actual file
## 2 -- 2 columns 1 columns 'data/active_directory.csv'
## 4 -- 2 columns 1 columns 'data/active_directory.csv'
These warnings indicate that there were columns expected (as dictated by the first line of column headers), but missing in one or more lines. You can inspect the CSV file in the RStudio interface by clicking on the file on the navigation pane to the right, and select ‘View File.’ You will notice that the location for both accounts is on a new line, but it belongs to the prior record. The raw characters can be confirmed within R, by reading the file directly as is (i.e. raw):
<- readChar('data/active_directory.csv', file.info('data/active_directory.csv')$size)
ad_char
print(ad_char)
## [1] "uid;manager\r\njonlin; jason durrand\nlocation: calgary\r\nronaldlee; reid turra\nlocation: melbourne"
These special characters are hidden within the seemingly innocuous delimited file, and are typical of systems where information is extracted from, especially Windows, \r
represents a carriage return, and \n
represents a line feed. Together, \r\n
represents a new line and new record, while \n
can appear in a file when a new line is made by pressing Shift+Enter.
In this common yet inconvenient case, these can be substituted out with regular expressions. Regular expressions are a standard, cryptic yet powerful way to match text in strings. We will cover specific use cases of these in Searching Text.
In this case, the below regular expression only replaces \n
when there is no \r
preceding it. The gsub()
function will try to match the regular expression criteria in the in the first field, with its replacement value in the second field.
gsub("(?<!\\r)\\n"," ", ad_char, perl = TRUE)
## [1] "uid;manager\r\njonlin; jason durrand location: calgary\r\nronaldlee; reid turra location: melbourne"
The gsub output shows that the manager’s name and location no longer has a \n
in between them. As a result, it can now be imported cleanly.
<- read_delim(gsub("(?<!\\r)\\n"," ", ad_char, perl = TRUE), delim = ";")
ad_raw
print(ad_raw)
## # A tibble: 2 x 2
## uid manager
## <chr> <chr>
## 1 jonlin " jason durrand location: calgary"
## 2 ronaldlee " reid turra location: melbourne"
5.2 Databases
It is likely that the company you are auditing will have their data stored in a database. While having skills in SQL is recommended, having R skills means you are able to perform basic queries on databases. There are many different database brands and vendors in the world, and thus there are many different subtleties on how SQL works for each vendor, but they all operate on the same standard.
The Open Databases Connectivity (ODBC) standard allows different vendors to write drivers, or the technical back-end methods, to connect to their database. Generally, you will need a driver that matches the vendor and version of the database you’re using.
The Database Interface (DBI) is the interaction between R and the driver. Practically speaking, it enables R to send queries to the database via the driver that is defined in the ODBC.
The most common way to connect to a database on your network is to install the vendor drivers, and then create a Data Source Name (DSN). To properly create this DSN, you’ll need the name of your database, as well as read-only credentials. Alternatively, you may specify the server name, database schema and credentials explicitly, which offers some advantages from a portability perspective as your other team mates will not need to create DSNs, and only need to install the drivers themselves.
For this example, we will use an SQLite database. Unlike commercial databases you have to install, configure and hire database administrators to manage, a SQLite database is a small self-sustaining file. SQLite files can be kept in your normal documents folders, and are perfect for lightweight applications (including training!)
Again, lets start by downloading the file, or in this case, a database:
dir.create("data", showWarnings = FALSE)
download.file(url = "https://github.com/jonlinca/auditanalytics/raw/master/data/rauditanalytics.sqlite",
destfile = "data/rauditanalytics.sqlite", mode = "wb")
One thing that is different about connecting to databases is that you need to set up a database connection within R. This will usually consist of a driver (in this case, RSQLite::SQLite()
), a DSN or a file location (data/rauditanalytics.sqlite
). In the help file, it asks for other needed fields as well; user, password, host etc. At your company, having that information along with ports, schema names, and whether or not its a trusted connection (authenticating automatically with your own active directory credentials). If you haven’t yet already at your company, request a read-only account that can get this information.
We’re going to establish a connection with the database, creating the con
connection object:
<- dbConnect(RSQLite::SQLite(), "data/rauditanalytics.sqlite") con
You can confirm the connection works correctly by listing the tables in the database. One important thing to remember is that you’ll be passing this connection object each time as you perform your commands. For example, if you want to see what tables exist in the database and schema, you will still need to tell the command which database you want to connect to.
dbListTables(con)
## [1] "gl" "industry" "tb" "vendors"
To get data out of a database, you’ll need to communicate to it with SQL (Structured Query Language). Here is an example of how to select all the records in a table:
dbGetQuery(con, 'select * from gl')
## je_num amount gl_date gl_date_char vendor_id account
## 1 1 22667.46 18232 2019-12-02 2211 exp_materials_6000
## invoice_date description paid_date
## 1 18225 Packaging and boxes 18262
## [ reached 'max' / getOption("max.print") -- omitted 1999 rows ]
While knowing SQL is advantageous (and eventually critical), sometimes switching between languages is a hassle, especially when performing basic tasks within a database. Using the dplyr package, you can generate several of the same SQL queries using R syntax. All you need to do is create a reference to the table in the connection object - in this case, the con
connection object contains the gl
table:
<- tbl(con, 'gl')
db_gl
db_gl
## # Source: table<gl> [?? x 9]
## # Database: sqlite 3.30.1
## # [/Users/jon/Documents/R/auditanalytics/data/rauditanalytics.sqlite]
## je_num amount gl_date gl_date_char vendor_id account invoice_date
## <int> <dbl> <dbl> <chr> <int> <chr> <dbl>
## 1 1 22667. 18232 2019-12-02 2211 exp_ma… 18225
## 2 1 -22667. 18232 2019-12-02 NA liab_a… 18225
## 3 2 1397. 18228 2019-11-28 2612 exp_ma… 18218
## 4 2 -1397. 18228 2019-11-28 NA liab_a… 18218
## 5 3 2319. 17995 2019-04-09 2612 exp_ma… 17985
## 6 3 -2319. 17995 2019-04-09 NA liab_a… 17985
## 7 4 20037. 18199 2019-10-30 2612 exp_ma… 18189
## 8 4 -20037. 18199 2019-10-30 NA liab_a… 18189
## 9 5 43140. 18051 2019-06-04 2612 exp_ma… 18041
## 10 5 -43140. 18051 2019-06-04 NA liab_a… 18041
## # … with more rows, and 2 more variables: description <chr>, paid_date <dbl>
Not only is it pointing to the same table, but its also performing the same query, even though it was made up using R and dplyr:
%>%
db_gl show_query()
## <SQL>
## SELECT *
## FROM `gl`
There are a few differences though between using these approaches.
dbGetQuery()
will return a data.frame that is immediately usable, while creating the table connection viatbl()
results in a preview of the query but hasn’t been formally downloaded in the database. In order to use the data within R withtbl()
, a furthercollect()
is needed. Instead of performing a data download every time, it is advantageous to preview the results first before collecting it.dbGetQuery
requires the entire SQL query to be pre-written, and can not be further leveraged within the database. However, thetbl
object can still be further manipulated. This is especially useful when creating new fields or performing joins in same database.
Lets say that you wanted to filter amounts greater than $75,000 in the database. In the SQL method, you would need to type a whole new SQL query:
dbGetQuery(con, 'select * from gl where amount > 75000')
## je_num amount gl_date gl_date_char vendor_id account
## 1 140 96354.61 18249 2019-12-19 2211 exp_materials_6000
## invoice_date description paid_date
## 1 18242 Packaging and boxes 18279
## [ reached 'max' / getOption("max.print") -- omitted 1 rows ]
Where in the dplyr method, you would only need to build on the same connection object already established. Identical queries, and results
<- db_gl %>%
db_gl_threshold ::filter(amount > 75000)
dplyr
db_gl_threshold
## # Source: lazy query [?? x 9]
## # Database: sqlite 3.30.1
## # [/Users/jon/Documents/R/auditanalytics/data/rauditanalytics.sqlite]
## je_num amount gl_date gl_date_char vendor_id account invoice_date description
## <int> <dbl> <dbl> <chr> <int> <chr> <dbl> <chr>
## 1 140 96355. 18249 2019-12-19 2211 exp_ma… 18242 Packaging …
## 2 693 88524. 18121 2019-08-13 2726 exp_ma… 18114 Quality co…
## # … with 1 more variable: paid_date <dbl>
show_query(db_gl_threshold)
## <SQL>
## SELECT *
## FROM `gl`
## WHERE (`amount` > 75000.0)
And when you’re happy with this, simply collect()
to save the query results as a data frame. To be nice to your database administrators, you should also disconnect from the database to free up a connection.
<- db_gl_threshold %>%
gl_threshold collect()
dbDisconnect(con)
We will go through an advanced application of setting up database queries in Audit Package Creation, which builds upon leveraging dplyr to create the pieces that enable powerful queries.
5.3 APIs
As more applications are hosted on the cloud, it is an important skill to obtain information from them without resorting to manually triggered reports. Data can be accessed from these systems via an Application Programming Interface (API), and typically it is exposed via the method Representational State Transfer (REST). An API will allow one application to talk to another. Some examples of APIs are web sites with search functions, or loading a list of comments other users have published, or determining who is friends with whom. REST advises how the endpoint is structured, and also suggests how the data is returned.
While APIs may appear to be daunting, retrieving data from them is not difficult at all. A user needs to know what they want, then:
- match what is desired to the ‘endpoint,’ a pre-defined url or path where the data resides,
- send a valid request to
GET
the data, and - receive the response.
Here is an example. Lets say you want to see what repositories I have available to the public on GitHub.
- You look up the GitHub API documentation, and discover you can list what repositories a user has.
- You formulate your request, and the request URL becomes
https://api.github.com/users/jonlinca/repos
- Then you send the above URL using the httr package and the
GET
command.
library(httr)
library(jsonlite)
<- GET('https://api.github.com/users/jonlinca/repos') # Endpoint
response
response
## Response [https://api.github.com/users/jonlinca/repos]
## Date: 2021-02-13 03:55
## Status: 200
## Content-Type: application/json; charset=utf-8
## Size: 33.9 kB
## [
## {
## "id": 129155075,
## "node_id": "MDEwOlJlcG9zaXRvcnkxMjkxNTUwNzU=",
## "name": "ACLandR_DeployingModels",
## "full_name": "jonlinca/ACLandR_DeployingModels",
## "private": false,
## "owner": {
## "login": "jonlinca",
## "id": 29468012,
## ...
The structure of the data in an API looks different than a data frame. Typically there is an overall response (indicating success or the nature of the failure), the type of content (in this case, ‘application/json’), and the data itself. The entire download is usually stored as a list with multiple references:
names(response)
## [1] "url" "status_code" "headers" "all_headers" "cookies"
## [6] "content" "date" "times" "request" "handle"
And the data you want will be in the response’s $content
:
head(response$content)
## [1] 5b 0a 20 20 7b 0a
This raw data is generally uninterpretable, and it is because the structure of the data is in a different format - in this case, the Content-Type above indicates it is a json file. Its quite easy to convert the data in this format to something more understandable with the httr content()
and jsonlite fromJSON()
and toJSON()
functions:
<- httr::content(response)
content
<- jsonlite::fromJSON(jsonlite::toJSON(content), flatten = TRUE)
cleancontent
# Print it out
c('id', 'name')] cleancontent[,
## id name
## 1 129155075 ACLandR_DeployingModels
## 2 164948874 ACLandR_KMeans
## 3 269707027 auditanalytics
## 4 183726578 calgaryr
## 5 277917653 galvanizer
## 6 298054213 onepass
Some data sources may be difficult to obtain data from, or perhaps you’re not quite ready at the technical skill level to develop your own data connectivity for APIs. One alternative for such information is the use of a third party tool - for example, CData supports an interface that allows you to interact with programs like Office 365 (including emails and file audit logs) directly. Office 365 also offers APIs that expose Sharepoint and Outlook, although they take more time to understand.