Automating data analysis with R

Digital analytics manager of Open Universities lays out the benefits of using the open source programming language, as well as what to watch out for

Employing the open source R programming language for statistical analysis can reduce the pain of dealing with large and complex data sets. That’s why R is becoming more popular, as it automates the analysis process, saving a substantial amount of time, according to Open Universities’ digital analytics manager, Johann de Boer.

Speaking at the Google Analytics User Conference in Sydney, de Boer demonstrated how he could do a full analysis, fetching and cleaning up data, merging data sets together, processing the data and visualising the results, using the R language in minutes.

“To do that manually, it would have taken maybe a week to be able to fetch all the data and clean it up and so on. With R you can do all of that a lot quicker, and because it’s reusable it means you can use parts of the analysis and other analyses you do in future so it doesn’t go to waste,” he said.

“Because it’s all written out in a script, if you make a mistake along the way it’s very easy for you to identify where you made that mistake. You can go back into your script and you can fix the mistake and then re-run the script and have all of the results updated.”

Another benefit of using the language is that it can eliminate the need to go and manually fetch data from different sources, de Boer said.

“If I wanted to get data from the Bureau of Meteorology website, some population statistics data or health data and I want to bring all of it into one place, there’s packages available that makes that easier.

“Even though people are doing different types of data analysis, they all benefit from each other’s work in the open source world because one person who is working in medical science field can solve a data analysis problem that can be used by someone in the Web analysis field.

"It’s all about numbers, really. You want to be able to analyse those numbers or in the case of analysing unstructured data there are similar problems you need to solve."

However, R is not a simple ‘one size fits all’ solution to all kinds of data analysis, de Boer said, as it may not save time for people who are not likely to use a certain analysis process again in future or do a lot of one-off kind of analyses.

Also, it can take a lot of time to get familiar with the language, even for a data analyst, he said.

“There does need to be that investment [in time for training] up front which is a big investment to make personally. Give it maybe three to six months to get to a point where you can start to use it or start to apply it to your work.

“[However], a major benefit of using R as open source over the other languages that are proprietary and commercial like SPSS and SAS and so on, which are also very powerful languages for data analysis, the R community is easy for people to get access to and to learn quickly.”

Follow Techworld Australia on Twitter: @Techworld_AU

Join the newsletter!

Error: Please check your email address.

Tags analyticsopen sourceautomationR programming languageGoogle Analytics User Conferencedata miningdata analysis

More about Bureau of MeteorologyGoogleSASSPSS

Show Comments

Market Place

[]