Analyzing 200,000 records may not seem like a big task. But when those records are security incidents with potentially hundreds of attributes each -- types of bad actors, assets affected, category of organization and more -- it starts getting a little complex for a spreadsheet. So Verizon's annual security report, which was initially done in Excel, is now generated "soup to nuts" in R.
In fact, the Verizon Data Breach Report is somewhat of "a love letter to R," Bob Rudis, managing principal and senior data scientist at Verizon Enterprise Solutions, told the EARL (Effective Applications of the R Language) Boston conference earlier today.
R is "a lot of fun to work with," he said.
One of the main issues in deciding to move from a spreadsheet to R was the complexity of the data format. Verizon researchers receive incident data from contributing organizations as nested JSON, which means numerous categories also have subcategories. Importing and analyzing all that with Excel was problematic.
There were other advantages in using R, Rudis said. Because R's ggplot2 package can produce sophisticated publication-quality graphics, the company saved an estimated $15,000 to $20,000 by no longer needing an external graphics design firm. The only change made to the R-created graphics prior to release was swapping in new type fonts. "R [stinks] at fonts," Rudis said.
However, R has great tools for modeling, clustering and other statistical analysis that Verizon wants to do beyond counting, such as examining what attackers are likely to do depending on the type of organization. Even within financial services, he pointed out, top threats are considerably different for, say, banks compared with insurance companies.
The report team also used R to create interactive visualizations such as one that explores which industries have similar threat profiles.
Security data is in an open-source format called VERIS, the Vocabulary for Event Recording and Incident Sharing. For those who would like to analyze publicly reported breach data, there is a VERIS Community Database as well as an R package called verisr to easily work with that data. Rudis and Jay Jacobs also authored a book, Data Driven Security, which details how to use the VERIS schema and R to record and analyze security incidents.
There is considerably more data analyzed in the Verizon report than is available in the public database, including incidents sent in by agencies such as the U.S. Secret Service and FBI, Rudis said.
Among the lessons he said he learned about working with R to analyze this data:
- Use R Markdown to blend explanatory text with analysis and graphics. R Markdown "makes it super-amazingly awesomely easy to document, iterate, modify and share analyses," Rudis said.
- "Boil everything into packages," even internal analysis code you're not planning to share externally. This makes it easier to document functions and let others check your results.
- Version control such as git is "vital to survive everything."
Other open-source tools used in the project include GitLab for internal collaborative development and Slack for collaboration; Rudis wrote an R package called slackr to make it easy to send analysis from R directly into Slack.
Also used: SurveyGizmo, Room.co for secure video chats; Google Hangouts was a non-starter because Google records those sessions, he said, GPG Suite for encrypting communications and RStudio for working in R.
Rudis's slide presentation for the EARL Boston conference is available on Slideshare.