Data Transformations vs. Data Validations

Dataload.py (with an assist from transformations.py) can convert data from one format to another; for example, you can convert a Yes to a True, or convert the date 7/12/1967 to 1967-07-12. As we’ve seen, it’s generally pretty easy to carry out data transformations as part of the data migration process. What isn’t so easy (at least not without a bit of effort on your part) is carrying out data validations.

Let’s explain what that means. If you’ve registered with an Akamai-powered website or if you’ve created a user profile in the Console, then you know what data validation is all about. For example, suppose you try to register a new account on an Akamai Identity Cloud website, and you forget to enter a display name. Instead of creating the account, the registration process stops and tells you that something is missing:

That’s an example of data validation: before even trying to create a new account the registration form has checked to see that everything is in order (e.g., that all the required fields have been filled out). If everything isn’t in order, the process stops dead in its tracks.

This is also an example of what the data migration process doesn’t do: dataload.py won’t verify that you’ve entered values for all the required fields, and it won’t verify that the data that you did enter is correctly formatted (for example, the script doesn’t ensure that an email address looks similar to karim.nafir@mail.com). We just saw that you can’t register on an Akamai-powered website without including a display name; on top of that, the display name you do enter must be unique. If it isn’t, you’ll be prevented from creating an account:

However, after doing a trial data migration, we ended up with several users who don’t have display names:

We also have three recently-migrated users who have the same display name:

How could those things happen? Those things can happen because, as we noted earlier, the data migration process doesn’t validate the data: to a very large extent it simply copies over whatever data you give it. If you don’t list a display name for a user then that user won’t have a display name. And if you list the same display name for 25 users, well ….

Here’s something else you need to know. It’s possible to import a million users who all have the same display name (we don’t recommend it, but you can do it). You can also import users who don’t have an email address:

However, if you try to import a user who has an email address that’s already in the system, that user account will not be copied over to the user profile store. Instead, the record will be skipped, and you’ll see an entry like this in the fail.csv log:

batch,line,error
1,2,Attempted to update a duplicate value

So what’s the deal here? Didn’t dataload.py do a data validation in this particular case?

Believe it or not, no, it didn’t: dataload.py never does data validations. Instead, the underlying user profile schema performed a data validation (and, as a result, would not allow the record to be written to the profile store). If you look at the schema (or at least at the schema we used for our data migration test), you’ll see that the email attribute is not required, but that it is globally unique:

In other words, and as far as the schema is concerned, you don’t have to have an email address, but, if you have one, that address has to be unique. As for displayName, the schema doesn’t flag that attribute as either required or unique:

Because of that, the records that we import don’t have to have a display name and, if they do have a display name, that display name doesn’t have to be unique. Something to be aware of.

If you need data validations, you’ll either have to write custom code that can perform those validations or work with your Akamai representative to see if those validations can be placed on your schema.