Chapter 4 Tools of the trade

Before you start engaging in analysis development work within your agency, you’ll need to make sure you have the right tools to get the job done.

There are two software tools every analyst should utilize: a script-based statistical software platform and a version control system.

4.1 Data analysis software

The most commonly used data analysis software is Microsoft Excel. While nearly every analyst cuts their teeth crafting spreadsheets and pivot tables in Excel, it’s not an ideal platform for professional analysis developers. Excel’s cell-based structure obscures the formulas that drive analysis work, making it hard for others (and often yourself) to understand what’s actually going on “under the hood.”

Scripting-based statistical software packages are the cure for the common spreadsheet. They make the entire analysis process transparent by explicitly stating what actions will happen and the order in which they will be executed. This helps to not only clarify and codify your analytic approach, it also makes it much easier to catch bugs in your analyses.

4.1.1 The case for R

There are many software packages that are well-suited for data analysis, but I use and heartily recommend using R as your main platform for data analysis. R is free (a very important feature to analysis developers working in the public/non-profit sectors), open-source, and has a vibrant community of developers and practitioners working to improve the language.

One of the best features of the R language is the comprehensive and diverse collection of packages available on CRAN (the Comprehensive R Archive Network). Packages are the way users in the R community share the tools they create - they contain some combination of R code, documentation, and data. Some of the most popular packages allow users to easily perform statistical operations, create plots, and manipulate their data.

R users are also able to take advantage of one of the best integrated development environments (IDE’s), RStudio. This freely available software allows users to interact with an R console, write scripts, view plots, and more through a graphical user interface (GUI). RStudio also has built-in functionality to work with some of the most popular version control systems, including Git and GitHub (more on that later).

Perhaps the greatest advantage R has over other data analysis software packages is its community. Package development and support is a big part of this, but the R community is also very active in supporting the growth and development of new R users. There are many free ways to get started learning R, from courses on DataCamp and Coursera to online books. If you’re looking for a place to get started, I recommend R for Data Science.

4.2 Version control

At some point, we’ve all worked on a project where we’ve saved multiple versions of a file. “Save as…” is a great tool for simple projects when you’re going from “v1” to “v2” to “final_draft,” but it’s very easy to push on the edges of this approach’s usefulness.

If you want to go back to an earlier iteration of your work, it’s hard to know which version you’re looking for or what changed from “v3” to “v4.” If you’re working with multiple people, emailing documents back and forth, it’s amazing how quickly you can lose track of which “v3FINAL” document is the version you want to be looking at right now - is it the one Jane sent yesterday or is the one Johnny sent today?

Version control systems help to alleviate these problems. Instead of requiring you to save multiple versions of one file, version control software allows you to have one file, but track all of the changes made since its inception.

The most commonly used version control system is Git, which can be paired with the GitHub service to keep a copy of your documents and the changes you’ve made to them in the cloud. As mentioned above, RStudio integrates well with these services. To learn more about getting started with Git, GitHub, and RStudio, check out Happy Git and GitHub for the useR