Chapter 3 Architecture
The architecture and tools you select should support and amplify your team, and not be a burden to maintain. Maintenance of databases and applications can also be delegated to application support teams, so you can focus on implementing analytics and delivering data products.
Other things to consider: You should have, at a bare minimum, direct access to read-only internal databases or data warehouses. A generic service user account has its merits here when it comes to automation - tying an eventual automation to a user with password expiry every 90 days will make updating passwords feel like quarterly financial reporting, and also increases the risk that the entire analytics program is dependent on a single person to manage credentials.
Any software you consider should be able to talk to your company’s internal and external applications. Be wary of software that locks you in or makes access to data cumbersome, as it limits your ability to integrate with your company’s tools as you gain sophistication.
3.1 R and RStudio
R is a programming language with an emphasis on statistics. It is considered free software, free in the perspective that you can run, distribute, and change it as you like.
What makes R so great is the large suite of packages that are available to use to help analyze information. This is distributed through CRAN, which is the central repository for R packages that meet minimum quality standards. What doesn’t exist in base R will likely have been developed by someone else: there are hundreds of packages that support data connectivity (DBI, xlsx), day-to-day manipulation and analysis (dplyr, ggplot2), and even auditing (jfa, MUS).
As R is a language, you may want to consider an application to code in, similar to how you may write memos in Microsoft Word. RStudio Desktop is an integrated development environment (IDE) that has an open source version and is free to use. There are also free versions of RStudio Server, as well as commercially supported versions of its Desktop, Server, and two unique products that we will go into more later - Package Manager and Connect.
With both R and RStudio installed, you can perform the minimum requirements of your audit data analysis career: you can download data, wrangle it, visualize it, and save completed analyses. The potential is limitless though, as it enables all the other technologies to operate with it - machine learning, automation, dashboarding, and code sharing.
3.2 Code repositories
Here is a typical logistical challenge faced in even the smallest of audit teams. Person A will write the first version of a script to download data (version 1). Person B in the team may want to use it to download data from elsewhere, so they will get an emailed script file from person B, modify it, and start to use it. Person A may make improvements to their own file (now version 2), but Person B will not benefit from the improvements. Knowledge are immediately siloed, and changes become more difficult to share between auditors.
One of the most effective ways to solve this problem is to leverage a code repository (also known as version control). While version control has several software engineering advantages, the most notable advantages for audit teams are:
- Centralizing domain knowledge and sharing code,
- Tracking and tracing changes to code throughout time (including who, what was changed, and when), and
- Ability to test with an isolated code set before rolling out to production.
Code versioning technologies resonate closely with IT General Controls and even the COBIT framework.
git is a type of version control. Another free, open source software, and the basic usage of the tool is accessible.
The basics of git are:
- Pull code from the remote server to the local computer.
- Write and save code on your local computer.
- Commit code on your local computer.
- Push code from your local computer to the remote server.
While this is a superficial example of how to use git, it is enough to get the most basic of audit teams started. Trying to master git may take you many months of hands-on experience and collaboration. If a git technical issue results in infinite frustration, just remember that you can always make a new folder and start fresh.
Several different server technologies support git - GitHub, Azure Repos within Azure DevOps, gitlab are all willing to host your code. If you’re using these to hold proprietary or sensitive code, it would be wise to get your company’s support and pay for the ability to have a private repository that only your team can see.
If you’re learning, GitHub is a fantastic free source to set up shop at. Within your own company, see what the IT and developers are using, as its far more convenient to jump onto that. Like the R tutorial, Happy Git and GitHub for the useR by Jenny Bryan et. al. is a practical guide to installing R, setting up keys and the core fundamentals.
As you write more code and templates, you will eventually want to share these new techniques with others on your team. Packages put your best practices together, including templates, functions to solve common problems, and templates for common workflows. In short: packages contain tasks to help you do your job and save you time.
Packages go beyond the tangible and provide several qualitative benefits as well. They standardize your team’s workflow, create consistency and readability standards, and get your team to speak a single language. It creates a cohesive foundation where anyone on the team can contribute, and a library for those who wish to learn more.
How do you share packages? You certainly don’t not need to force your team to copy-and-paste code or even a templates from a central folder! The most elegant way to distribute your code is via hosting packages on a code or package repository. These repositories (hosted internally or externally) enable your team to download best practices with a simple line of code. You can also compile your packages and leave them on a network drive.
The best part? You can give your package a creative name that represents the culture of your internal audit team or organization. Just don’t call it auditR!
3.3 Data products
All auditors face this problem at one point or another. An audit finding doesn’t matter until someone takes accountability for it. And if an audit finding is too vague, or does not have the buy-in from the right stakeholders, it is as good as unsolicited advice.
The data analysis you deliver are vulnerable to the same expectations. Not only do they need to be objective and accurate, but they also need to be accessible, relevant, and delivered at the right time. If there are too many barriers for entry or usage, someone will revert back to their old processes and methods. Here is a sobering thought: when was the last time you made a beautiful visualization for someone, only to be asked for the detailed Excel worksheet afterwards? Or instead of an automated report that users can self-serve, someone asked you to just forward them an email instead?
Data products are deliverable that enhance your customer’s ability to gather, process and analyze information. They facilitate decision making and are integrated into their processes. They should be designed thoughtfully and simply, masking the complexity underneath. In short, they should make your co-auditors feel like rockstars. The reports you write and code you develop should strive for the goal of being accessible by your team, on-time and on-demand.
3.3.1 Galvanize Highbond
Galvanize, the parent company of ACL Analytics and one of the most popular companies in the audit software industry, invested significant efforts into their cloud-based working paper solution, Highbond.
Highbond Results is unrivaled for audit exception and remediation workflow. The idea with an audit exception workflow is that audit testing will identify an actionable transaction or outcome. This may be an exception within a process, a control requiring execution, or even a request for additional information or clarification. Once a process has been designed, Highbond Results will allow you to focus on the users who should action the workflow and the rules for setting up triggers.
Highbond Results also provides capabilities to do no-code based visualizations hosted on the web. Once set up, they offer a stable method of delivering storyboards and visualizations.
While Galvanize and its products (most notably, Projects and Results) are distinct products not directly in the R universe, it does integrate with R through its Application Programming Interface (API). An API allows you to interact with a cloud service via code, and Highbond’s API enables you to upload findings and results directly into a Results set. This enables you to handle the data on R, and then upload the analyzed results online so you can create the workflows on the website. The API also enables access to its Projects data that your audit team may already use to document audits, offering advantages to audit teams that design their workpaper environments effectively.
Galvanize cloud-based tools are fully hosted, meaning audit teams pay for high availability and security maintained by a professional team. Galvanize supports the security, design and coding for hosting a tool online, allowing you to focus on designing workflows for your internal customers.
3.3.2 RStudio Connect
RStudio offers commercially supported server software, including the RStudio Connect solution that allows R deliverable to be hosted and accessible via a web browser. Features include:
- R Markdown notebooks, which are fully self-contained analytics. These notebooks perform full analyses from start to finish, including downloading data, wrangling, analysis and data visualization, and can be scheduled on a regular basis.
- These notebooks can also act as Extract, Transform, and Load (ETL) processes. These notebooks have the advantage of being automatically scheduled, with rules that can notify stakeholders if need be. The loading component can be any destination - most popular is the audit data mart, or into other web applications via API connectivity (example: Galvanize Highbond Results).
- The hosting of Shiny apps, which are interactive web applications, offer a way to analyze and present information in an intelligent, slick manner. The analysis performed in R can be factored into a Shiny app, which can be hooked directly into your data.
For audit teams with expertise in programming, RStudio Connect offers some of the best capabilities for publishing visualizations and analysis to your teams and internal stakeholders. With its git capability, it can also receive updates from a code repository, integrating tightly with a team’s best practices of a code repository.
RStudio server software will require talent and cost to stand-up and maintain, and should be considered in an environment where automation and internal hosting of data products will bring advantages to an internal audit team.
For those socially conscious teams, you can check off the box by going with RStudio, which is a Certified B Corporation. A company that has has the Certified B Corporation is obligated to the mission of the company, and not to its shareholders. This means that they are to balance both profit and purpose. RStudio has long contributed to free and open-source data science with its packages that are widely adopted within the R ecosystem, and is a significant driver to creating free and accessible education for individuals all over the world.
3.3.3 RStudio Package Manager
As your team develops more code and functions, there becomes a greater need to distribute these best practices easily. RStudio Package Manager offers the capability to distribute code packages to your coworkers and even the broader organization. By integrating with your code repository, it can bundle new functionality added by your team and distribute it. It offers versioning of packages, for those audit environments where reproducability is paramount.
An alternative is also miniCRAN, a reduced feature set yet free and open source version, or even simply hosting packages on a network drive.
3.4 Data sources
To create and host a data product that is always up-to-date, it needs to be directly connected to data sources and can grab data on-demand to refresh itself. Any manual steps in this process, whether manually triggering a report on a website, or waiting for an email with a data file attachment, means the data has an intolerable source of error - the person who is tasked with manually copying the file to the folder and manually refreshing the data product.
A modern audit team will need direct, programmatic access to the data being audited. By programmatic, we mean that code can be used to access data, versus going through a front-end graphical user interface or website. This can be achieved whether internal or external data sources are used.
3.4.1 Internal databases
At its core, databases hold transactional information that runs the business. Databases generally can fit two separate use cases: Online Transaction Processing (OLTP), for high speed transaction writing, and Online Analytic Processing (OLAP), for analyzing to support business decisions. Typically OLTP databases act as a source of truth, and send updates to the OLAP database.
The language of choice to access internal databases is Structured Query Language (SQL). As a defacto standard for accessing databases since 1970’s, all relational databases still leverage SQL. While each brand may have subtle nuances in the way SQL works, this essential language will allow your audit team to access a majority of sources within the company. Non-relational databases and NoSQL are becoming more mainstream, as well as graph databases, so your audit team will need to tool up as necessary.
The preferred approach is to get data from an internal data warehouse. A data warehouse mirrors the data activity from an application’s database - the application processes data and acts as an OLTP, and it will send any processed data to the data warehouse which is an OLAP. That way, critical business functionality can continue on the application with its dedicated computing resources, and intense queries can be directed to the data warehouse with its own set of computing resources.
Audit teams historically had a bad reputation for ‘bringing down the database,’ meaning that an attempt to download data crashed the database and made it unaccessible to the business. While this historically used to be a common catastrophe, nowadays computing power is so accessible and cheap that the fear is generally unwarranted. An auditor should still take precautions, and you should avoid querying any production database that supports customers or staff directly unless absolutely necessary. If needed, some other strategies are useful depending on the circumstances:
- Test queries in a test application database, before trying them in production,
- Filter the data with WHERE clauses,
- Perform joins in the database, instead of downloading multiple tables individually and joining them on the desktop,
- Download data in chunks or segments, split by day, week or month, or
- Schedule queries during off-business hours.
3.4.2 External sources
As more applications move to the cloud, using SQL to access data becomes more difficult. While some of this data can be brought into an internal data warehouse or used as an integration, more often than not transactional data is left online within the tool.
Certain online cloud software providers will make an Application Programming Interface (API) available for customers These APIs open a window to the cloud application, where subsets of data can be downloaded from the system. Each vendor may provide API documentation, and then can be accessed via packages like httr and digested with jsonlite.
By having the knowledge to access APIs from cloud software, audit teams gain significant autonomy in being able to download the data they need directly from a vendor. It does, however, add a layer of difficulty in obtaining data:
- Each API behaves differently - authentication, calls, and the data return all may vary between systems.
- APIs endpoints, the part that allows a tool to query it, tend to be highly specific-use cases, and may provide a limited scope of data at one time.
- Data may not be in the format that you want, or even exist - you are at the mercy what the API supplies.
- If an API does allow for a larger data download, it may be limited by pagination, where multiple results are spread out over multiple pages.
- APIs may be ‘rate-limited,’ which means it may restrict the number of queries, whether in parallel or in sequence.
Tips for audit teams needing to rely on cloud providers:
- API access should be part of the requirements before signing an agreement with a provider.
- Ask for, and generate, an API key for usage, and treat it like a password.
- If APIs are out of the skill of your team, consider asking your IT department to schedule a data pull into your audit data mart.
3.5 Audit data mart
While internal databases are accessible, generally they hold the entirety of the company’s data, which is far too much information. Only a fraction of the data is considered important and significant to the audit team.
An audit data mart is a key piece of infrastructure that will sit between your data products and data source. This data mart should contain highly specific, refined and cleaned data, which improves the speed and responsiveness of data products and data-on-demand. The audit data mart can also be secured to your audit team members, so confidentiality is not compromised.
To take full advantage of an audit data mart, an automated ETL process should connect directly to internal databases, and perform transformations to get to a clean end product. ETLs can be created and scheduled within RStudio Connect.