Getting started with Singer and Docker

Using docker containers to automate singer.io ETL tap for Google Analytics with the PostgreSQL target.

Posted by on

Open Source Docker Data

I’ve been doing a lot of work with data recently and I wanted a way to be able to join some of our platform data to our traffic data from Google Analytics.

After trying a few methods to get data out of Google Analytics into our PostgreSQL database with not much success I discovered singer.io a “Simple, Composable, Open Source ETL” tool. So I started playing around with the Google Analytics tap and the PostgreSQL target (https://www.singer.io/tap/google-analytics/postgresql/). After a lot of digging through various documentations and source code I was able to get it working!

Now that I have it running it was time to try and automate it somehow, to do this I tried to set it up as a docker container that way it can be run in the cloud (In our case AWS) on a schedule to pull in data daily from Google Analytics.

Here is the Docker file that puts it all together.

FROM python:3.7.6-stretch

RUN pip install singer-target-postgres

COPY ./tap-google-analytics-master/ ./external/tap-google-analytics-master/
WORKDIR /external/tap-google-analytics-master
RUN pip install -e .

WORKDIR /
COPY ./singer/tap_google-analytics_config.json ./tap_google-analytics_config.json
COPY ./singer/client_secrets.json ./client_secrets.json
COPY ./singer/tap_google-analytics_reports.json ./tap_google-analytics_reports.json
COPY ./singer/target_postgres_config.json ./target_postgres_config.json

CMD tap-google-analytics -c ./tap_google-analytics_config.json | target-postgres -c ./target_postgres_config.json >> state.json

A few things here to call out to get this working. Installing the PostgreSQL target was straight forward but for the Google Analytics tap I used an unofficial library and built it from the source at https://gitlab.com/meltano/tap-google-analytics. After that we have the various config files.

tap_google-analytics_config.json This is the config for the Google Analytics tap, key file location is name of the config file that details the service account credentials. The reports property is a json file that defines the reports to pull from Google Analytics. The start date and end date here define the date range to pull the report data for.

{
  "key_file_location": "client_secrets.json",
  "view_id": "111111111",
  "reports": "tap_google-analytics_reports.json",
  "start_date": "2020-06-01T00:00:00Z",
  "end_date": "2020-06-15T00:00:00Z"
}

client_secrets.json this file contains the service account credentials, to get these follow the instructions defined in https://gitlab.com/meltano/tap-google-analytics#creating-service-account-credentials.

tap_google-analytics_reports.json this file defines the report data being pulled from Google Analytics. A full list of the available dimensions and metrics can be found here: https://ga-dev-tools.appspot.com/dimensions-metrics-explorer/. Here is an example of a report for pageviews.

[{
    "name": "ga_pageviews",
    "dimensions": [
        "ga:date",
        "ga:pagePath",
        "ga:source",
        "ga:medium",
        "ga:sourceMedium"
    ],
    "metrics": [
        "ga:pageviews"
    ]
}]

target_postgres_config.json This file is the config for the PostgreSQL target, mostly straight forward connection config.

{
  "postgres_host": "localhost",
  "postgres_port": 5432,
  "postgres_database": "ga_data",
  "postgres_username": "db_user",
  "postgres_password": "db_treasure",
  "postgres_schema": "public",
  "after_run_sql": ""
}

Now that we have everything setup to run in docker to truly automate this all that is left to do is being able to update the start and end date properties in the Google Analytics config file dynamically. This can be done with a bit of bash.