Information Technology Dark Side

Struggles of a Self-Taught Coder

Information Technology Dark Side header image 2

Obfuscating Private Data in a Database Dump File

October 3rd, 2008 · 1 Comment

Here’s a common problem in software development: you’re production application has sensitive/personal/private/legally protected data in it that probably isn’t wise to share with everybody and their dog. But, you need to debug a problem from prod, test a database upgrade, verify a new feed works properly, or some other task that would really be a lot easier if you could use production data to do it. Unfortunately, putting that data in your test environment would require everyone who has access to it to hand over their first born child in escrow against them stealing it, selling it, or abusing it in some other way.

Now, it’s worth noting that the problem I’m trying to solve is NOT the problem of hackers getting into your production database. I’m just trying to take a production database, create a database dump file, make that file safe for general use, and then load it in my test database. Simple right?

Yeah, actually it turns out it is. I wrote a simple Ruby script called Obfu that goes through the dump file, finds the fields I’m worried about, and changes them to random (but human readable) values.

Why not just encrypt the sensitive data? Well, sure, that would work, if you want to read X87rkd834k for a first name, I guess. I’m not sure what an encrypted email address, phone number, or street address looks like, but I have doubts about whether it would format properly.

So, here’s how Obfu works: First, you grab the source code from TriSano.org with this link: Obfu

Next, change the method get_obfu_config to include the tables and columns you want obfu’ed. What get_obfu_config is simultaneously complex and simple: it’s complex because it uses an array of hases of arrays of hashes, but it’s simple once you get over that.

Here’s an example from the code that might help you understand how that array/hash/array/hash mess works:
[
{#COPY telephones (id, location_id, country_code, area_code, phone_number, extension, created_at, updated_at, email_address) FROM stdin;
:table_name => ‘telephones’, :fields => [
{:field_loc => 4, :type => ‘value’, :value => ‘555’}, # Area code
{:field_loc => 5, :type => ‘num’, :digits => 7}, # Phone number
{:field_loc => 6, :type => ‘num’, :digits => 3}, # Extension
{:field_loc => 9, :type => ’email’} # email
]
},
{#COPY people (id, entity_id, birth_gender_id, ethnicity_id, primary_language_id, first_name, middle_name, last_name, birth_date, date_of_death, created_at, updated_at, food_handler_id, age_type_id, approximate_age_no_birthday, first_name_soundex, last_name_soundex, vector, live, next_ver, previous_ver, disposition_id) FROM stdin;
:table_name => ‘people’, :fields => [
{:field_loc => 6, :type => ‘text’, :word_count => 1}, # first_name
{:field_loc => 7, :type => ‘text’, :word_count => 1}, # middle_name
{:field_loc => 8, :type => ‘text’, :word_count => 1}, # last_name
{:field_loc => 9, :type => ‘date’, :keep_year => true} # birth_date
]
}
]

Each item in the big array is a table, and it really only has two values: table name and fields. The fields value is an array of the columns you want obfuscated. Each column has a hash of values that tell obfu what to do to the column. You’ll notice I don’t have a field name for the column name. That’s a little lazy on my part and I’d love you to fix it (the TriSano OSS license requires you contribute improvements back, if I understand it correctly). I use :field_loc instead, which is simply the position of the field in the COPY statement in the dump file.

Every column has two values field_loc and type. All the other values depend on the type you specify. Here’s how it breaks down:

type values
num :digits
value :value (I use this to always set area codes to '555')
text :word_count
email none
date :keep_current_year: true or false (useful for birthdates)

Obfu uses lorum ipsum dolor text to randomly generate words for text and email fields. Numbers are all randomly generated, and so are dates, but with some limits. You won’t get day values greater than 28, for example, and you might want to consider the fact that randomly generating certain dates will break validations.

Use Obfu all you like. Tell your friends about it if turns out to be helpful for you. I’m hoping to externalize the configuration and turn it into a gem next time I get some free time (unless you do it first, in which case I’ll just thank you and use yours!).

Obfu takes two arguments: input file and output file, in that order. I haven’t added any error handling, etc., so use at your own risk.

Obfu has only been tested against Postgresql database dump files. You might need to change Obfu up to make it work with other databases.

If you enjoyed this post, make sure you subscribe to my RSS feed!
Stumble it!

Tags: Development · Uncategorized

1 response so far ↓

  • 1 David York // Oct 10, 2008 at 1:43 pm

    So very cool, we used to do the same thing in COBOL. We picked on Disney characters, even addresses if would could find one.

Leave a Comment