Yesterday’s post by Seth Petry-Johnson on Data-Driven Tests got me pondering another aspect of test data: baseline datasets. I love baseline datasets, but you need to carefully think when and how to use them.
Why Use Baseline Sets?
Seth’s spot on in his assertion that tests should be responsible for setting up their own test data. Most of the time.
In some cases, however, it makes terrific sense to have a pre-built set of data carefully constructed to cover specific scenarios. Here are a few scenarios I’ve run in to in the past where a baseline dataset has really helped us out:
- Manual testing. Baseline datasets can be a great help in a couple areas of manual testing.
- Setup. Yeah, I really want to spend three days setting up data by hand so I can run through effective exploratory testing, or through our manual testing guides. NOT! A baseline dataset gets me rolling right away.
- Visual UI validation. A baseline dataset with large sets of data is also a great help for validating your UI handles lots of data well. Paging of grids? Long threads of conversations? Thousands of users in your site’s admin section? Look to your pre-built data to help here!
- Load testing. You can’t be serious about your load testing if you’re doing it against an empty or nearly empty database. You’ve got to have a realistic set of data in order to understand how your data access layer/module/strategery works, and how your business and UX layers deal with things.
- Automated testing. Yes, even your regular tests can be helped out greatly with a baseline dataset. Sure, your tests need to handle their prerequisites, but you can also look to have supporting data pre-configured—users in place, content created, etc.
- Testing BI, reporting, or data analysis systems. Systems that process large amounts of data need, well, large amounts of data to test with. It’s insane to think your test harness/suite/fixtures can create this sort of infrastructure each time you need to run your tests. Look to a baseline dataset!
- Ease of test setup and teardown. Yes, tests should clean up after themselves, but this can be extremely hard to do and quite brittle. You know what the easiest way for your test suite to handle that clean up is? DROP DATABASE is a thing of beauty in many scenarios.
Considerations for Baseline Datasets
Be careful, very, very careful, when moving to implementing a baseline dataset. While baseline datasets are extremely useful, you need to make sure you construct them to meet specific scenarios, and you need to make sure you understand how these datasets will impact all aspects of your testing.
Here are some questions I’ve asked myself and my teams when we’ve dealt with standing up baseline datasets in the past.
- What scenarios are we testing for? Make sure you know the test cases you’re trying to cover. Don’t just go creating scads of content in your database. Create sets of data to solve specific test needs, otherwise you’ll potentially risk side effects. (See below.)
- How much data do you need? What’s realistic for your scenarios? Do you honestly need 250GB of blog post content and comments? Are you trying to test blogs.msdn.microsoft.com? No? Then reel that desire for insane amounts of data back in a bit, please. Ensure you’ve got enough to meet your needs, but not too much more.
- What “shape” of data do you need? I’m sure the hard core data folks have a fancy term for this, but I’ve always used “shape” of data to describe patterns in how the data is laid out. “Shape” to me is a number of properties around how your data is constructed: how many users; users in what roles; types of content created by those users; the time period over which that content is created or your users interact with the system; etc., etc. Shape of your data becomes crucially important when you’re working with BI, reporting, or analysis testing.
- If shape is important, how long will your data be good for? You spent a lot of time this year building up a carefully crafted baseline dataset for your neat analysis tool. Will that dataset work as expected come 1 January of next year? Now all of a sudden you’re looking at a “Previous Year” window, not the “Current Year.” Ooops.
- Do we need real data or will dummy data be good enough? Working with geographic systems? Actuarial or financial systems? You’ll likely need something close to realism. Maybe you can get away with generated data, maybe not. Ask the question and get the answer.
You’ll notice all these considerations are extremely specific to the project you’re working on. There’s little or no chance you’ll be able to re-use anything about a baseline dataset from one project to another. Sorry.
Constructing Baseline Datasets
So you’ve carefully looked at the considerations above, and all the others unique to your project. You’ve got an idea of what you need, now how do you build it?
- Sanitize and reshape live data. This is generally most applicable to load testing where you just need masses of data that’s close in shape to real data. Why not go get real data from one of your large customers and use that? You’ll likely need some form of non-disclosure agreement in place, and you’ll likely have to scrub out personal information, but that’s easily done with some SQL-fu.
- Use a data generation tool. Visual Studio’s Data Dude (or whatever the official marketing-speak name for it is) will look at a database’s schema and help you create data in amazing detail. Red Gate has their SQL Data Generator tool, and there are any number of other tools as well. Spend a little time researching and finding out what will work for your team, environment, and project.
- Use your automated tests. If you have a suitable suite of automated tests, use those! Ensure you’re not deleting any created data or dropping databases after each run, then simply set up a series of runs and have your tests create everything you need. Handy, that.
- Build some custom SQL. Extremely useful if you have critically sensitive shapes required for your data. You’ll have to hand-build your data to ensure you’re meeting your very specific needs.
Once you’ve got your data constructed, zip it up and get it in your source control system. Yes, store this huge binary file in your source control. It’s a critical project asset, and will likely be tied to specific releases of your software. Treat this asset with the same respect you’re treating other project artifacts.
By all means, keep your tests granular, self-reliant, and have them handle their own setup/teardown where feasible.
That said, there are situations where a baseline dataset makes extreme sense. Keep an eye open for those situations and make use of baseline datasets to help save your sanity and deliver better software