How to anonymize your Database with Masquerade on Hypernode

Given the current legislature concerning GDPR in Europe and the focus on privacy in our society as a whole, working with personally identifiable information in a thoughtful manner is becoming increasingly important. In this guest article Peter Jaap Blaakmeer from our partner Elgentos shows you how to anonymize your Magento 2 database using the command-line tool Masquerade.

Using Masquerade, you are able to take a MySQL database and change all fields that contain personally identifiable information to a randomly generated equivalent for that type of field.

The benefit of having an anonymized database is that you’ll be able to share it freely with a developer (external or not) without having to worry about privacy issues and client data leaking. Of course, there is also the option of creating a database dump that does not contain any customer data at all but often a developer needs certain data to be able to reproduce a bug.

What does Masquerade anonymize?

The following groups of data are anonymized by default using the Magento 2 ruleset:

  • Admin
  • Customer
  • Newsletter
  • Rating
  • Review
  • Rma
  • Sales
    • Order
    •  Quote
    •  Invoice
    •  Creditmemo
    •  Shipment
    • Archive

Here’s an example of the fields in the `customer_entity` table that will be replaced with random data:

  • Email
  • Prefix
  • Firstname
  • Middlename
  • Lastname
  • Suffix

The Masquerade CLI tool contains an up-to-date definition of all columns that may contain personally identifiable data (or at least tries to).

How does Masquerade work?

When you start Masquerade, you need to point it towards a database. Note that the update procedure will take place on this exact database, not a replica. So, be sure not to point it towards your production database!

Masquerade will then work its way through the definition file, updating all information in the stated columns with randomly generated data, which is supplied by the Faker PHP package.

How long this process takes depends on the size of your database and the size of the host machine. In our experience, it takes about an half hour per 1 gigabyte of customer data (on a Magento Professional Large Hypernode).

How do I install Masquerade on my Hypernode?

Log in to your Hypernode and create the directory `~/bin` if it doesn’t exists already. Then download the latest Phar release of Masquerade and make it executable:

mkdir -p ~/bin
curl -L -o masquerade.phar
chmod +x masquerade.phar

Now run `masquerade.phar` to see whether it starts. You should be greeted by its logo and the help information.

How do I configure and run Masquerade?

The easiest way to run Masquerade is to pass the configuration options inline when running the `run` command. These are the available options:

      --driver[=DRIVER]      Database driver [mysql]
      --host[=HOST]          Database host [localhost]
      --prefix[=PREFIX]      Database prefix [empty]
      --locale[=LOCALE]      Locale for Faker data [en_US]
      --group[=GROUP]        Which groups to run masquerade on [all]

Faker supports an extensive list of locales. You can check whether your locale is supported in the Faker repository.

First, we need a duplicate of the production database. We are going to duplicate your production database in a database called `anonymized`. We’ll create a dump of the production database, create our new database and import it into that database.

magerun2 --root-dir=magento2 db:dump ~/production_dump.sql
mysql -e "CREATE DATABASE IF NOT EXISTS anonymized"
mysql anonymized < ~/production_dump.sql

Now let’s start the anonymization process!

masquerade.phar run --platform=magento2 --database=anonymized --username=app --password=yourpassword --locale=yourlocale

Here’s a quick command to check whether it actually anonymized the data (this assumes your production database is called `magento`). The output should show you the differences in the first 5 customers’ names.

diff <(mysql magento -e "SELECT firstname,lastname FROM customer_entity LIMIT 5") <(mysql anonymized -e "SELECT firstname,lastname FROM customer_entity LIMIT 5")

When you have verified it works, you can add these commands to the crontab to run it every night.

Running Masquerade in a CI/CD environment

At our company Elgentos, we use a slightly different and more advanced approach. An addition to the backups Hypernode offers, we also create our own off-site backups in Amazon S3 buckets so our developers can retrieve these easily using the command-line tool `aws-cli`.

To be 100% sure we never accidentally anonymize the production database, we run Masquerade in a CI/CD pipeline in our Gitlab instance by creating a separate anonymize project with a scheduled pipeline. You can read more about this setup in the wiki ‘How to run Masquerade nightly with Gitlab CI/CD‘.

Other platforms

Masquerade offers a Magento 2 ruleset out of the box, but isn’t limited to the Magento framework. In fact, we use it to anonymize Laravel databases as well. You can define your own rulesets easily using YAML. You can also create your own Faker Providers and formatters. More information can be found in the Masquerde readme on Customization.