When you have a busy site, large data can become a problem. NEXT can help with that.
The Problem with Large Data
- your reports get sampled data, or
- they take a long time to refresh because the software is constantly re-fetching historical data.
Avoid both these problems with NEXT.
What is Data Sampling?
Sampling in Google Analytics (or in any analytics software) refers to the practice of selecting a subset of data from your traffic and reporting on the trends detected in that sample set.
When does Google Analytics use Data Sampling?
- 500,000 sessions (where the data is not already stored).
- 1,000,000 maximum unique dimension combinations.
For a discussion on data sampling in Google Analytics, see: https://developers.google.com/analytics/resources/concepts/gaConceptsSampling
What do the Google Analytics Limits mean?
- If you build queries with many dimensions, the combination of all the possible dimension values might cause you to hit the 1,000,000 point.
- If you query a range of dates, and you are asking for data by the day, then the sum of all the unique combinations for each day may cause you to reach the limit.
- Certain dimensions will be prone to testing the limits such as PagePath and keywords.
- Some dimensions have enough distinct values that they become multipliers
For example: asking for pagePath by keyword for each day in the past year. If you have 100 pagePaths, and only 100 keywords throughout the year, and 365 days in a year, then 100 x 100 x 365 gives you 3.6 million metrics to report. In most sites, these numbers are much higher. NEXT Analytics lets you divide large queries into smaller ones, and then join the results afterward.
The Solution to Large Data
A typical small data dashboard will perform three steps internally:
- Query the server. For large data, break this into a separate process that’s independent of the other steps.
- Process the data. For large data, read the data locally, don’t assume that Step 1 occurs dynamically.
- Put the data in a worksheet for the Excel author to present it.
Break the single large queries into smaller ones, then concatenate the results. Put Step 1 in one workbook, and put Step 2 and Step 3 in a different workbook. Then break Step 1 into smaller parts, until there is no more data sampling. For example, you could ask for each day at a time.
How to convert a small data dashboard into a large data dashboard
This article will only describe what steps to take in general. If you are NEXT Script-savvy, you can perform these steps yourself. If not, don’t despair. At least now you know it’s possible and you can contact our support team to discuss one-on-one training and/or an implementation plan to see you through your first instance.
- Create multiple queries, each one small enough that you don’t get data sampling.
- a daily query, and/or
- segmented by dimension values e.g. countries; and/or
- segmented by type of content e.g. parameters on pagePath.
- Put these smaller queries into their own workbook, and refresh them as frequently as you need. Each query can be made to generate a unique data file that persists on your computer. The format of the query must remain constant throughout history, so only do this once you’ve finalized the layout of the report you’re building.
- In the (separate) dashboard workbook, remove the rows from the _actions worksheets that contain the query script lines. These will be the ones that start with “Getdata”
- Modify the remaining ImportDataFiles script command to read in multiple unique file names, as described in the next section.
Change ImportDataFiles so that it loads multiple data files at once
ImportDataFiles is often used to load a single CSV file into memory but it can actually load more than a single file.
Being able to load multiple files at once is a powerful enabling technology for large data handling.
ImportDataFiles can accept:
- multiple file names
- a zip file (full of similarly structured CSV files)
- a directory name in which case it loads all files. (requires v5.5 or later)
- wildcards, e.g. gaxyz*.csv (requires v5.5 or later)
This creates a few options to avoid data sampling.
- Create multiple small queries, and give each one a name with a common prefix such as “ytd”. The ImportDataFiles can say ImportDataFiles,ytd*.csv.
- Create multiple small queries, and put the results in a zip file. ImportDataFiles can open the zip file and will automatically load the matching CSV files within.
- Create multiple small queries, and move the data files to a subfolder dedicated to that one report. Then specify the folder name to the ImportDataFiles command.
Since your Step 1 workbooks will be generating one—and potentially more—new data files each day for each report, we advise you to:
- generate a data file that has a unique name that identifies it both for the report and for the calendar date; and
- move it to a sub-directory, below the nextanalytics\data directory.
That way, you can rely on ImportDataFiles’ ability to read all the files from a folder at once. We’ve created a few new script commands that will make it easier to implement this workflow.
This is our recommended workflow for each of your reports that handle large data. This technique will help you to avoid data sampling, and unnecessarily downloading historical data every time you refresh a dashboard.
Script Commands That Make Production of Large Data Reporting Easier
After running a query, you can append a script line that renames and moves a file to a folder.
- If you specify a wildcard in the second parameter, it will iterate on the matching files.
- If you don’t supply a dest_dir it uses data directory.
- If you don’t supply a new-file name, it uses the file being iterated upon
- dest_dir is a descendant of the data folder
This helps you take advantage of the fact that ImportDataFiles will read all files in a folder if you supply it the folder name, as in:
Select Files using the [PERIOD] parameter on ImportDataFiles
As of V5.5, ImportDataFiles now accepts a [PERIOD] parameter. When a [PERIOD] is present, ImportDataFiles will load files who’s Last Write Time is within the range of dates for the [PERIOD] command.
Run queries that go back in history, and alter their datestamp to identify the dates covered by the query
Touchfiles works in conjunction with the ImportDataFiles [PERIOD] parameter, which selectively loads files based on their last write time.
Touch datestamps a file according to a date that you specify.
The date format is provided in yyyy-MM-dd format. E.g September 30th, 2013 is 2013-09-30.
Each matching file will be given the date you specify, in this example 2013-09-30.
If [folder] has a value, then the program operates on files in that folder (starting at the nextanalytics\data directory).