Step 5: Creating Data Transformations

The secret to a successful data migration is really no secret at all: if you have “clean” data (defined as data that can be imported as-is into the Akamai user profile store) then the data migration process should go pretty smoothly. For example, suppose you have a custom attribute (newsletterSubscriber) that’s been configured with the Boolean datatype; as a Boolean, it can only accept the case-insensitive values true or false. If your datafile looks like the following, you shouldn’t have any problems, at least not with the newsletterSubscriber attribute:

givenName,familyName,newsletterSubscriber
Bob,Jones,True

If the user profile store is expecting true/false values and you provide true/false values, then you should be in pretty good shape. Suppose that your legacy system gives you the option of exporting a Boolean file as: 1) the value True or False; or, 2) the value 1 or 0. Trust us: you’ll make your data migration life much easier if you export Boolean values as True or False.

But what if you don’t have much say in how your legacy system exports data? In that case, you could end up with a datafile that uses the values 1 and 0 instead of the values true and false:

givenName,familyName,newsletterSubscriber
Bob,Jones,1
Karim,Nafir,1
Maria,Fuentes,0
Toni,Ng,0
August,Springer,1
Terrance,O’Reilly,1
Barry,Alvarez,0

Is that a problem? Yes, it is. In fact, the preceding datafile can’t be imported into the user profile store, period. Instead, the migration process is going to crash and burn when it tries to copy the values 1 and 0 to the Boolean attribute newsletterSubscriber:

batch,line,error
1,2,/newsletterSubscriber must be a boolean value
1,3,/newsletterSubscriber must be a boolean value
1,4,/newsletterSubscriber must be a boolean value

Once again, that leaves you with three choices:

  • Take newsletterSubscriber out of the datafile, and make your users resubscribe.
  • Write a script or a macro or something that can change all the values in the datafile. This approach will work, although, if you have millions of records, that’s a lot of data that must be processed.
  • Use a dataload data transformation.

A data transformation is exactly what the name implies: it transforms the data in your datafile. In this case, that means that, each time it encounters the newsletterSubscriber attribute, it’s going to convert the value 1 to true and the value 0 to false. Furthermore, these transformations happen on the fly, meaning that you don’t have to write a script that can read, process, and save millions of records. In fact, the datafile itself will not be changed at all. Instead, dataload.py will read the first record and then, when it encounters the newsletterSubscriber attribute, hand that value off to the data transformation. The data transformation will convert the 1 to True (or the 0 to False) and hand the converted data back to dataload.py. The script then writes the record, and the converted data to the user profile store (and not to the CSV file).

Data transformations are carried out using a second Python script: transformations.py. In transformations.py, individual data transformations are constructed as Python functions. For example:

def transform_password(value):
       if not value:
       return None
   try:
       formated_json = json.loads(value)
   except ValueError:
       return value
   return formated_json

Transformations.py comes with several predefined functions, including:

transform_password

This function turns the password definition into an object that specifies the type of hashing algorithm used. See Transforming Passwords for more information.

transform_date

Transforms dates entered using the standard US date format – e.g., 12/19/1989 – to the format used with user profiles: 1989-12-19. By default, this transformation is used with the birthday attribute:

reader.add_transformation("birthday", transform_date)

This has at least two implications for you. First, if you use the birthday attribute then, by default, this transformation is applied. That’s fine as long as your datafile uses birthdays like 12/19/1989. But what if your datafile doesn’t use that format; for example, what if it already uses the prescribed 1989-12-19 date format? In that case, your data migration will fail with an error message similar to the following:

File "/Users/greg/test/transformations.py", line 38, in transform_date
raise ValueError("Could not parse date: {}".format(value))
ValueError: Could not parse date: 1989-12-19

If you’re using the 1989-12-19 format in your datafile you can fix this problem by commenting out the call to the transformations.py (at least for the birthday attribute):

# reader.add_transformation("birthday", transform_date)

If you’re using a completely different format (such as the European 19/12/1989 format) then you’ll need to modify transform_date or create a new date transformation function. In many cases, this will be easy: you’ll only have to modify the input formats. For example (and without explaining the details of the Python code), this modified function transforms dates that use the Day/Month/Year format:

if not value:
       return
    input_formats = (
        "%d/%m/%Y",
    )
    for try_format in input_formats:
       try:
           parsed_time =  time.strptime(value.upper(),
                try_format)
           return time.strftime("%Y-%m-%d %H:%M:%S",
            parsed_time)
       except ValueError:
           pass
   raise ValueError("Could not parse date: {}".format(value))

transform_plural

This function takes a JSON-formatted string and converts the value into a Python object that can be used to populate a plural attribute. See Transforming Plurals for more information.

transform_boolean

Transforms a field into a Boolean value (either True or False). By default, transform_boolean first converts a field to all lowercase letters (for example, it converts TRUE to true). The function then looks for true, t, or 1 and, if found, converts the value to True. If it finds false, f, or 0, it converts the value to False.

But what if your legacy field is a yes/no field? Is there any way to modify this function to transforms yesses and nos?

You bet there is:

def transform_boolean(value):
    """
    Transform boolean values that are blank into NULL so that they are not
    imported as empty strings.
    """
    if value.lower in ("true", "t", "1", "yes"):
        return True
    elif value.lower in ("false", "f", "0", "no"):
        return False
    else:
        return None

In the preceding function, we added yes to the list of values that should be converted to True and no to the list of values that should be converted to False. And what if, instead of a yes/no field, your legacy system used an alligator/crocodile field. Well, that would be weird, but you could easily transform an alligator/crocodile field to its Boolean equivalent:

def transform_boolean(value):
    """
    Transform boolean values that are blank into NULL so that they are not
    imported as empty strings.
    """
    if value.lower in ("true", "t", "1", "alligator"):
        return True
    elif value.lower in ("false", "f", "0", "crocodile"):
        return False
    else:
        return None