How to extract custom data from Google Analytics
- 25 November, 2013 11:36
Google Analytics is a useful tool for measuring website usage -- everything from simple page views to the kind of complex ad campaign tracking marketers might need. However, I find the user interface to be, well, less than ideal. The good news is that Google Analytics provides a robust API that enables you to tap into your data programmatically, meaning you can conveniently pull and package data in ways that might not be as easy to do on the Web.
You don't need to know R to follow along with the steps here. In fact, after extracting data, you can save it to a CSV file to use in Excel, if you prefer.
Step one: Get R
First, if it's not on your system already, download and install R from the R Project for Statistical Computing website. When you run the R application, you'll see a console window where you can type in text commands. And, of course, make sure you've got a Google Analytics account and some data to work with.
The R console window is where you can type in commands.
There are several R packages available that have functions specifically designed for Google Analytics, including ganalytics, RGoogleAnalytics and rga ("R Google Analytics"). I'll be using rga for this tutorial, but any of them would work.
Like ganalytics, rga resides on GitHub. To easily install any of the Google Analytics packages from GitHub, first install and load the R package devtools by typing the following commands into the R console window:
Then install and load rga from package author Bror Skardhamar's account:
install_github("rga", "skardhamar") library(rga)
(You only have to run the first three commands once per machine, but you need to load library(rga) each time you open R.)
Step two: Allow rga to access your Google Analytics account
On a Mac, authentication is as easy: Create an instance of the Google Analytics API authentication object by typing the following in your R console window:
That will open a browser window that asks you to give rga permission to access your Google data. When you accept, you'll be given a code to cut and paste back into your R console window where it says, "Please enter code here."
In Windows, I find that adding a line of code before opening an rga instance helps with any authentication errors:
options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl"))) rga.open(instance="ga")
Next, you need to find the profile ID for your Google account, which is not found in the tracking code that you add to a website to allow Google Analytics to monitor your site. Instead, on your Google Analytics Admin page, go to View Settings and you'll see the ID under "View ID."
You'll find your profile ID for your Google account by going to View Settings on your Google Analytics Admin page.
Or, run the command
in your R terminal window to get a list of all available profiles in your account; the profile ID will be listed in the first column.
Whichever way you find it, save that value in a variable so you don't have to keep typing it. You can use a command like:
id <- "1234567"
(Replace the number with your actual ID, and make sure to put it between quote marks.) This stores your profile ID as the variable "id."
Step 3: Extract data
Now we're ready to start pulling some data using the ga instance we just created. The getData method will actually extract data from your Google Analytics account that you can then store in another new R variable. If you want to see all available methods for your ga object, run:
You can query the Google API for metrics and dimensions. Metrics are things like page views, visits and organic searches; dimensions include information like traffic sources and visitor type. (See Google's Dimensions & Metrics Reference for full details.)
In addition, you can focus your query by criteria like visits from search, visits with conversions (assuming you've set that up in Google Analytics beforehand) and even visits just from tablets, by including segments in a query. Finally, you can also create your own filters to narrow your results.
Google's Query Explorer helps you figure out what data is available and how to structure a query.
Google has created a Query Explorer for the Google Analytics API. It's a great resource to help you figure out what data is available and how to structure a query. If you're new to the Google Analytics API, play around with Query Explorer for a bit to see what data you can extract and the variables you need to pull the data you want. Further information on the terms to use for various queries is available in the API documentation.
Once you decide on what you'd like to include in your query, here's the syntax for using R to get the data:
myresults <- ga$getData(id, start.date="", end.date="", metrics = "", dimensions = "", sort = "", filters = "", segment = "", start = 1, max = 1000)
You fill in information for your specific query between the various quotation marks, of course. Note that dates are in the format yyyy-mm-dd, such as "2013-10-30."
Here's a specific example: Say I want to see the top ten referrers for visits to my site in September. My start date is September 1 and my end date is September 30. My metric is visits -- called "ga:visits" by the API -- and my dimension is their sources -- called "ga:source."
I'll further refine the query to get just my top 10 referrers:
myresults <- ga$getData(id, start.date="2013-09-01", end.date="2013-09-30", metrics = "ga:visits", dimensions = "ga:source", sort = "-ga:visits", start = 1, max = 10)
Here's a breakdown of that query:
- ga$getData is using the getData method of my ga Google Analytics API-accessing object.
- The first argument, id, is the profile number for my account, which I already stored in a variable called id.
- Next are the start and end dates for my query, followed by the metric I want ( "ga:visits")).
- Since I want to know the visits by source, I specify the dimension as "ga.source".
- I only want the top 10 referrers, so I need to sort the ga:visits results in descending order. I do that on the next line by putting a minus sign in front of ga:visits when setting the sort criteria.
- Finally I specifically ask to start at the first result with a maximum of 10 to return at most 10 listings.
The results are stored in the variable myresults. Type
at the R prompt in your R terminal window to see what data has been returned.
The results from a query searching for a site's top 10 referrers.
If I wanted to see the overall number of visits without breaking it down by source, I wouldn't include the dimensions, sort, start or max in the query. Instead, I'd just use a simple:
myresults <- ga$getData(id, start.date="2013-09-01", end.date="2013-09-30", metrics = "ga:visits")
That returns a listing of number of visits per day. I can have it return results by different time periods by adding the time dimension of my choice -- for example by week:
myresultsPVsByWeek <- ga$getData(id, start.date="2013-09-01", end.date="2013-09-30", metrics = "ga:visits", dimensions = "ga:week")
Or, I can get page views for the entire year by month:
myresultsPVsByMonth <- ga$getData(id, start.date="2013-01-01", end.date="2013-12-31", metrics = "ga:pageviews", dimensions = "ga:month")
You can seek more than one metric at a time:
myresultsPVsVisits <- ga$getData(id, start.date="2013-01-01", end.date="2013-12-31", metrics = "ga:visits, ga:pageviews", dimensions = "ga:month")
(For those who know R and are used to combining items using R's concatenate c() function, you don't use that when combining items within a ga$getData query.)
Want to just see visits that came from, say, Google News each month this year? Add a filter, such as
myresultsGNvisits <- ga$getData(id, start.date = "2013-01-01", end.date = "2013-12-31", metrics = "ga:visits", filters = "ga:source=~news.google.com", dimensions = "ga:month")
I used =~ rather than == because the latter would set the filter to only those referrals that exactly equal news.google.com. By using the =~ operator instead, it uses more powerful regular expression searching, which in this case would match anything containing news.google.com. (Regular expressions allow much more robust pattern searching.)
As before, for each of these queries, type
(or the appropriate results variable) at the prompt in your R window to see what's returned.
The query has been refined to show the visits that came from Google News each month for a year.
Step 4: Manipulate your data
Now that you've got your data, what can you do with it?
If you're not an R enthusiast, the easiest thing is to save the results to a CSV file. R's write.csv() function first lists what you want to save and then the file name. To save the myresults variable to a file called data.csv, type:
write.csv(myresults, file="data.csv", row.names=FALSE)
The optional row.names=FALSE eliminates an extra column with the row numbers, just to keep the file uncluttered. The resulting file looks something like this (but hopefully with many more visits):
You can then use that data in the spreadsheet or graphing program of your choice.
You can also analyze your data right within R, of course, without exporting to a spreadsheet. Let me first pull some real data -- visits and page views -- from a personal site I set up years ago that I no longer tend to but that still gets occasional visitors:
mydata <- ga$getData(id, start.date="2013-01-01", end.date="2013-12-31", metrics = "ga:visits, ga:pageviews", dimensions = "ga:month")
Data on monthly visits to a site.
You can use R's str() function to find out how the mydata object is structured.
This shows how the mydata object is structured.
Like the other results above, it's an R data frame with character strings as the month number and numbers for the data. That makes it easy to run simple analyses and generate basic graphs within R, such as
barplot(mydata$visits, main="Visits by month", xlab="Month", names.arg=mydata$month, las=1, col=rainbow(9))
You can generate basic graphs within R, such as this one, which shows the number of visits to a site for each month.
The R barplot() command above uses the number of visits for the graph's y axis values (you can refer to a specific column in a data frame with the syntax dataframename$columnname) and names.arg as names on the x axis. The command main specifies the graph title, xlab is the x-axis label and col=rainbow(9) tells R to choose nine colors from its rainbow palette to color the bars. The nonintuitive command las=1 tells R to set both the x- and y-axis labels horizontally (0 makes them parallel to the axis, 2 perpendicular to the axis, and 3 vertical).
For more on creating graphs from R data frames, see our article "Beginner's guide to R: Painless data visualization."
Google Analytics is a powerful tool, but the Web interface is not always easy to navigate. If you'd like more customizable tools to extract data -- and easier automation of data requests -- consider using a programmatic approach with the Google Analytics API. And if you don't already have a favorite language for API work, R is a good choice.
Sharon Machlis is online managing editor at Computerworld. Her e-mail address is firstname.lastname@example.org. You can follow her on Twitter @sharon000, on Facebook, on Google+ or by subscribing to her RSS feeds:articles | blogs. See more by Sharon Machlis on Computerworld.com.
Read more about business intelligence/analytics in Computerworld's Business Intelligence/Analytics Topic Center.