BigQuery integration

Google Cloud Platform (GCP) BigQuery is a columnar database tool that provides data analysis without having to take care of the underlying infrastructure. It also lets you visualize your data with an integrated tool called Data Monitor. You can now integrate DataStream with BigQuery to find meaningful insights, use familiar SQL, and take advantage of a pay-as-you-go model.

Note: You can integrate raw logs and aggregated metrics streams with BigQuery. This example shows how to integrate a raw logs stream that pushes data to BigQuery. See Key concepts and terms.

How to

  1. Get started with DataStream.
    In DataStream, configure a raw logs stream and select your data sets. For example, you can select Request Header Data to choose the headers that you want to receive when calling the API. You may want to receive headers such as Authorization, Range, Accept-Encoding, and many others.

    You can also choose a sample rate. For details, see Add a data stream.

  2. Set up an API client for the DataStream Pull API.
    To integrate with BigQuery, you need an API client with at least read-only access to the DataStream Pull API. You can create an API client in Control Center. For details, see Get Started with APIs.
  3. Set up a GCP account.
    Open a new project in Google Cloud Platform and start creating the following products:
    Cloud Storage
    Set up two buckets. One to store the logs, and the other to store a cloud function script. For details, see Creating buckets in Google Cloud Storage.

    Compute Engine
    Set up one compute workload to call the DataStream Pull API and copy it to cloud storage.

    BigQuery Database
    Create a BigQuery database for your logs. You'll add a table later.

    Once you are done, go to API services in your Google Cloud Platform account and enable the following APIs cloud functions: BigQuery and Cloud Storage.

  4. Integrate DataStream with BigQuery.
    Compute Engine setup

    SSH into the compute engine that you previously set up. Then, install the gcloud API. For details, see Install Google Cloud SDK.

    Install the Akamai APIs and clients. Copy the previously created credentials and paste them in the .edgerc file. For details, see Gett started with APIs.

    Next, grant the compute engine access to the GCP resources, such as storage, BigQuery, and the cloud function. For details, see Granting, changing, and revoking access to resources in Google Cloud Storage.

    BigQuery table setup

    First, you need to get the DataStream Pull API's schema. See DataStream Pll API schema.

    Next, prepare a BigQuery schema that matches the DataStream schema. The BigQuery schema looks exactly like the one here.

    Note: The schema has a lot of nested records.
    Then, use the prepared schema to create a table in BiqQuery. This command uses the schema called schema.json to create a table called edgescapedemo:
    bq mk --table akamai-206503:datastream_logs.edgescapedemo ./schema.json
    Cloud Function setup

    You also need to write a cloud function. The cloud function is a serverless computing product. For details, see Cloud function in Google Cloud.

    It can act on triggers. Here, the trigger that we use is As soon as something is uploaded to cloud storage, the trigger will fire. For details, see Storage triggers in Google Cloud.

    Once you’ve prepared the cloud function, you can deploy it with this command:
    gcloud beta functions deploy datastream-cloud-function --trigger-resource=akamai-datastream --trigger-event --source=. --stage-bucket=gs://akamai-script-cloudfunction --entry-point=jsonLoad
  5. Call the DataStream API.
    Now all the pieces are in place, you can start your API calls script and push the DataStream JSON response file to cloud storage. Once the file is uploaded to cloud storage, the finalize trigger activates the cloud function and stores the file or your data in a BigQuery table.
    Here is the flow:
    • Make an API call for the DataStream API from the compute engine. This can be a Cron job:
      http --auth-type edgegrid -a datastream-pull-api: ":/datastream-pull-api/v1/streams/851/raw-logs?start=2018-10-30T06:30:00Z&end=2019-10-23T06:40:00Z&page=0&size=100"
    • Push the output to the bucket for DataStream logs:
      gsutil cp output.json gs://akamai-datastream
      As soon as it’s in the bucket, it’ll activate the cloud function. Looking at the cloud function logs, you can verify if it has successfully completed. You can return the logs with this command:
      gcloud beta functions logs read datastream-cloud-function 
    • Open the BigQuery interface and query the table. You’ll see something similar: