BigQuery integration

Google Cloud Platform (GCP) BigQuery is a columnar database tool that provides data analysis without having to take care of the underlying infrastructure. It also lets you visualize your data with an integrated tool called Data Monitor. You can now integrate DataStream with BigQuery to find meaningful insights, use familiar SQL, and take advantage of a pay-as-you-go model.

Note: You can integrate raw logs and aggregated metrics streams with BigQuery. This example shows how to integrate a raw logs stream that pushes data to BigQuery.

How to

  1. Get started with DataStream.
    In DataStream, configure a raw logs stream and select your data sets. For example, you can select Request Header Data to choose the headers that you want to receive when calling the API. You may want to receive headers such as Authorization, Range, Accept-Encoding, and many others.

    You can also choose a sample rate. Unless justified, you should select 100% to get all the traffic that hits your site. For more details, see Add a stream.

  2. Set up an API client for the DataStream API.
    To integrate with BigQuery, you need an API client with at least read-only access to the DataStream Pull API. You can create an API client in Control Center. For details, see https://developer.akamai.com/api/getting-started.
  3. Set up a GCP account.
    You need to open a new project in Google Cloud Platform and start creating the following products:
    Cloud Storage
    Set up two buckets. One to store the logs, and the other to store a cloud function script. For details, see https://cloud.google.com/storage/docs/creating-buckets.


    Compute Engine
    Set up one compute workload to call the DataStream Pull API and copy it to cloud storage.


    BigQuery Database
    Create a BigQuery database for your logs. You'll add a table later.


    Once you are done, go to API services in your Google Cloud Platform account and enable the following APIs cloud functions: BigQuery and Cloud Storage.

  4. Integrate DataStream with BigQuery.
    Compute Engine setup

    SSH into the compute engine that you previously set up. Then, install the gcloud API. For details, see https://cloud.google.com/sdk/install.

    Install the Akamai APIs and clients. Copy the previously created credentials and paste them in the .edgerc file. For more details, see https://developer.akamai.com/api/getting-started.

    Next, grant the compute engine access to the GCP resources such as storage, BigQuery, and the cloud function. For details, see https://cloud.google.com/iam/docs/granting-changing-revoking-access.

    BigQuery table setup

    First, you need to get the DataStream schema. You’ll find it here: https://developer.akamai.com/api/web_performance/datastream/v1-api.zip.

    Next, prepare a BigQuery schema that matches the DataStream schema. The BigQuery schema looks exactly like the one here.


    Note: The schema has a lot of nested records.
    Then, use the prepared schema to create a table in BiqQuery. This command uses the schema called schema.json to create a table called edgescapedemo:
    bq mk --table akamai-206503:datastream_logs.edgescapedemo ./schema.json
    Cloud Function setup

    You also need to write a cloud function. The cloud function is a serverless computing product. For details, see https://cloud.google.com/functions/.

    It can act on triggers. Here, the trigger that we use is google.storage.object.finalize. As soon as something is uploaded to cloud storage, the trigger will fire. For details, see https://cloud.google.com/functions/docs/calling/storage.

    Once you’ve prepared the cloud function, you can deploy it with this command:
    gcloud beta functions deploy datastream-cloud-function --trigger-resource=akamai-datastream --trigger-event google.storage.object.finalize --source=. --stage-bucket=gs://akamai-script-cloudfunction --entry-point=jsonLoad
    
  5. Call the DataStream API.
    Now all the pieces are in place, you can start your API calls script and push the DataStream JSON response file to cloud storage. Once the file is uploaded to cloud storage, the finalize trigger activates the cloud function and stores the file or your data in a BigQuery table.
    Here is the flow:
    • Make an API call for the DataStream API from the compute engine. This can be a Cron job:
      http --auth-type edgegrid -a datastream-pull-api: ":/datastream-pull-api/v1/streams/851/raw-logs?start=2018-10-30T06:30:00Z&end=2019-10-23T06:40:00Z&page=0&size=100"
    • Push the output to the bucket for DataStream logs:
      gsutil cp output.json gs://akamai-datastream
      
      As soon as it’s in the bucket, it’ll activate the cloud function. Looking at the cloud function logs, you can verify if it has successfully completed. You can return the logs with this command:
      gcloud beta functions logs read datastream-cloud-function 
      
    • Open the BigQuery interface and query the table. You’ll see something like this: