I always hate having to generate test data… even with tools like Red Gate’s SQL Data Generator (which I use liberally when I need to do such tasks), there’s always a bunch of screwing around that needs to be done to get everything as tied up as neatly as possible, and in most cases you still can’t get actual, representative data.

Despite Troy Hunt’s less-than-satisfactory ode to Red Gate’s SQL Generator (sorry Troy, I really enjoy reading your stuff but I can only guess that you must have been paid well for this one), using a data generator – even a really good one like RG’s – still doesn’t get us all the way there.

In every database I’ve had to generate any sort of high volume test data for, I’ve never been able to get it quite right – sure, I can match it to schema requirements, that’s fine, but that doesn’t mean it resembles real world data or what the application expects to work with. I’d be hard pressed to come up with a definitive list of all the issues I’ve had with generated test data (every time, the list is as different as the projects themselves are different), but every time, there’s issues.

Recently we’ve been working on a sales/forecast analysis SaaS project for a client and have a whole new set of issues with generated data. To put the app requirements in scope, this will be a multi-tenant analysis application, so something as simple as customer sales and GP isn’t really all that easy to generate good data for (you’d think so, right? Pick a number range, pick a GPP range and let RedGate do it’s thing?)

The first issue is that we could have customers that work in thousands (let’s say a good set is between $1000 and $8000 per month), or other customers could be working in millions. RG (and most others) allows you to set a high/low limit for a numeric column, no problem, but we can’t very well have a customer with sales at $1234.00 for January and $12,345,678 for February (ok, maybe they’d like to, but let’s be real!). So what do we do, set up a RG project for each customer FK and choose a specific range for each FK? E.g., for CustomerID 456, generate between 4k and 8k ranged values, and for customer 789 generate between 1m and 2m values? Not really… RG is good at actually generating, but I don’t want to go through the project setup umpteen times to get a handful of customer sales dollars in there (besides which, I prefer my test data to not be tied so closely to redgate: I’d much prefer to dump a script out after and store that in source control for test data recreation rather than having everyone use RG and carry the RG Generator project around all the time).

The second issue – and this is one that pops up very often projects – is with data distribution. This tends to be a recurring issue with dates in general, but specifically on this project, the application needs to be able to analyze distribution and trends in sales and forecast data. Sure, having relatively random data in there is good for seeing how the application logic responds to such things, but in the real world we do tend to see trends in data, and we tend to see them often. How do we deal with that? Another good example is with birthdates and families or dependents. Generating a random set of dates in a range is simple, but how do you get that set up so that spouse birthdates are generally close to their counterpart while child birthdates are considerably newer?

To be fair, yes, I do realize that RedGate’s tool (among others, perhaps) allows for some custom expressions based on other columns, which is helpful, but still doesn’t quite get the job done (and again, this heavily ties the test datasets directly redgate’s project, which I much prefer not to do – and again, yes, I know the sqlgen file can be put into source control, but I don’t want to have to carry around RedGate licensing for every place this might need to be built).

Anyway, yes: data generation – even with best in class tools like RG’s data generator – still sucks. Generating accurate test data is still a major project and can be a significant cost due to the hours involved, unless your data test quality requirements happens to be very low.

So there’s two main issues I need to deal with; two main goals to achieve. First, generate medium to high quality test data, and second, do so in a way that can be easily recreated (preferably checked into VCS and ran as part of a build procedure… after all, sometimes we do need to develop against representative data, and adding rows in by hand is always a painful proposition).

The common approach, and certainly the one I started out with, is to more or less blindly open your generator tool and start setting up your configurations and then go ahead and dump boatloads of data directly into their target tables. This is generally adequate for low quality (e.g., not real world representative data) in mass amounts for stress testing, but even with careful planning you’re apt to run into some issues, and never mind the fact that it offers nothing in the way of dealing with any of the aforementioned problems.

The better approach, I’ve come to find, is to think not in terms of populating the final target tables with generated data directly, but instead to create an entirely separate set of tables for holding random values, which can be later applied to the final tables with much more integrity. By dumping the generated data into test tables, we can apply various metadata to those tables that further describes what the test table’s set represents.

Let’s take the case of birthdates and dependents. Rather than generating data directly to a Members or Persons or Patients table, let’s instead dump it into a birthdates table. As this will be a recurring theme, let’s go ahead and create a dedicated schema for these tables:

Now we can generate sets of birthdates that are categorized, say, by generation. As an example, you might use RedGate to generate four sets of test data that’ll resemble something like this:

Of course that’s not script you’d actually run, but it serves as an example well enough. Go ahead and generate a couple hundred random values into the table for each category, where the dates are ranged according to category. Then go to SSMS and generate scripts for the schema and data… save those to a file, then check that file into git, and there you have a repeatable, scriptable set of birtdates that actually represent what might be real birthdates roughly based on someone’s generation. Or, better yet, save that as part of your master toolset, as this export is absolutely a core artifact that can be reused on many projects.

Let’s do something similar with customer sales and gross profit percentages:

Again, use your generation tool to come up with separate sets for each volume range. Maybe Low will be $1000-$4000 where High will be measured in millions.

Ok, so that seems like a good idea, we can create categorized data sets from our data generator into test tables, but how do we correlate that to the actual target tables?

We’ll need to build the target tables’ test data in stages. For example, first we might use the “out of the box” RG generator to come up with a table of Member names, and maybe we’ll just pre-fill the DateOfBirth field with an arbitrary date. Let’s assume that we have set up some sort of correlation for ParentID to indicate whether the person’s parent is listed (which would allow us to differentiate between a child/adolescent and adult). Once you have a sufficient dataset generated there, we can use a CROSS APPLY query to select random values from the test data rows, like so:

Note the t.MemberID = t.MemberID in the cross-applied subquery there… without this, the update would select only one random value from the test set and apply that single value to all target rows. That ensures that each target row gets a new selection. The ORDER BY NEWID() is a simple means to sort randomly which works well in conjunction with SELECT TOP 1.

This is an effective means to populate specific data values with randomly generated data within a constrained range, and can be used in multiple places:

This helps a lot – certainly much better than out of the box generation from the various tools available and is relatively easy to implement.

The above cases and examples don’t touch base on more advanced scenarios such as sales distributions and/or data trends*, but by taking the same approach of generating categorized test datasets and then applying them to the target tables, we can come up with a much higher quality set of data with relatively little extra time involved. If we keep in mind that any test table can have any attributes defined as we like, and that we can play around with ordering and value selection in our application queries, our ability to repeatably generate medium to high quality test data remains feasibly within reach!

Cheers,
-jack

*for trends, you’ll find that some slightly more sophisticated test sets will need to be generated, but even so – often times there is requirement for such and generally speaking we don’t tend to walk away from such cases feeling like we’ve wasted time. For generating those more sophisticated test sets, consider also that you can use this cross-application technique to fill more advanced test sets from more basic test sets as well!