FrazzledDad: 31 Days of Testing—Day 17: Rules for Effective Data-Driven Tests

Updated: Index to all posts in this series is here!

Today’s post comes from guest author Seth Petry-Johnson who’s posted it at his blog. Seth contacted me when I started this series and expressed interest in writing about keeping database tests in good shape. That topic is music to my ears, so I took him up on his offer.

Seth will be following up his post with more at his site, so I encourage you to keep an eye on his blog too!

Tests that hit the database are slow, fragile, and difficult to automate. All the cool kids are using mocking and stubbing and in-memory databases to keep their unit tests isolated and fast, and that's awesome. I do that too (though my "cool kid" status is debatable).

However, there are times when talking to a real database is necessary. Maybe you're testing actual data access logic, or maybe you're writing some high end integration/acceptance tests, or maybe you're just working in an architecture that doesn't let you mock/stub/inject your way to isolated bliss. If any of that sounds familiar, then this post is for you!

Below is a list of strategies and suggestions for effective data testing that I've collected from years of experience testing large, "enterprisey", data-driven applications. Data tests will never be painless, but following these rules makes it suck less.

Rules for good data tests

Tests should create their own scenario data; never assume it already exists. Magic row IDs kill kittens!
Make liberal use of data helper and scenario setup classes.
Don't use your data access layer to test your data access layer.
Tests should make no permanent changes to the database - leave no data behind!

Rule 1: Tests should create their own data

One of the worst things you can do in a data test is to assume that some record (a customer, an order, etc) exists that fulfills your scenario requirements. This is a cardinal sin for many reasons:

It's extremely fragile; databases change over time, and tests that rely on pre-existing data often break (causing false-negative test failures).
It obscures the test's purpose. A test's setup communicates the data context in which our assertions are valid. If you omit that setup logic, you make it hard for other programmers to understand the scenario that you are testing.
It's not maintainable; other programmers won't know what makes customer ID 5 appropriate for one test and customer ID 7 appropriate for another. Once a test like this breaks, it tends to stay broken or get deleted.

In other words: relying on pre-existing data means your tests will break often, are painful to maintain when they do break, and don't clearly justify why another program should spend time fixing them.

The solution is simple: each test should create each and every test record it will rely on. If that sounds like a lot of work, it can be.... but keep reading to see how to keep it manageable.

Rule 2: Liberal use of data helper and scenario setup classes

Setting up the supporting data for a test sucks. It's time consuming and generally results in a lot of duplication, which then reduces test maintainability and readability. Test code is real code and should be kept DRY like anything else!

I've found it useful to create two different types of helper classes to attack this problem:

Data helpers are utility classes that expose methods for quickly creating entities in the system. These classes:
- Are generally static, for convenience.
- Exist in the test project, not the main data access project.
- Create a single object (or object graph), such as a Customer or an Order with its OrderItem children.
- Create data with meaningful defaults, but allow the important fields to be explicitly specified where needed. (Optional parameters in .NET 4 FTW!)
Scenario objects (aka "fixtures") represent specific data scenarios that might apply to multiple tests, such as the scenario in which a Customer has placed an Order and one of the items is backordered. These classes:
- Exist in the test project.
- Have public properties that identify key data in the scenario (e.g. the Customer ID, Order ID, and backordered Item ID).
- Are instantiated by a test, at which time the scenario data is created.

In short, data helpers are low-level utilities for creating a specific data record in a specific state, while scenario classes represent larger contexts consisting of multiple entities. I have found that while the time needed to create these objects is not trivial, it quickly pays off as new tests are easier and easier to write.

Rule 3: Don't use your DAL to test your DAL

Tests for DAL code generally set up some data in the database, invoke the DAL, and then verify that the database was properly modified. I've generally found it difficult to use the primary DAL to quickly and concisely verify those assertions.

In some cases, the primary DAL may not expose a suitable API for doing record-level verification. For example, when there are significant differences between the logical schema exposed through the domain layer and the physical schema of the database, it may be impossible (or at least difficult) to write low-level data assertions.

In other cases, especially early in development, using the DAL to test the DAL creates dependency issues. For instance, many tests involve a sequence of events like "get entity by ID, save changes to it, then verify it was changed". If both the GetById() and Save() methods are currently under development then your test will give you inconclusive results until both methods are implemented.

In all of these cases I've found it valuable to verify data assertions using a LINQ to SQL data context. This provides a convenient, object-based representation of the data schema that is perfectly suited for verifying row-level operations were performed properly. This data context lives in the test project and is automatically regenerated (using SQLMetal.exe) whenever the schema changes, so it's a close-to-zero-effort solution.

You could also use a micro ORM like Massive, or anything else that makes it quick and easy to interact directly with the database.

Rule 4: Tests make no permanent changes to the database

Tests should be run often, and if you follow Rule #1 your tests create a lot of new data when they run. If you don't clean up after them, your test database will quickly grow in size. Also, if you point your test suite at the same database you use to run your app, you'll quickly get tired of seeing that test data accumulate in your views.

The easiest way to prevent this is to wrap each test in a database transaction, and then rollback that transaction at the end of the test. This performs the desired cleanup and also isolates tests running in parallel from interfering with each other's data.

There are a few different ways to approach this. Depending on your needs, check out this or this.

Conclusion

None of these techniques are particularly clever or game changing, but when used together they can significantly improve your data tests:

When tests create their own scenario data, you don't need to run them against a particular known state. This reduces maintenance costs significantly.
Investing in data helpers and scenario classes makes it easy to add new tests. The easier it is to write tests, the more likely that developers will actually do it.
"Close to the metal" abstractions like LINQ to SQL make it easy to write row- and field-level assertions against the database.
Adding some "auto rollback" behavior to your data tests keeps your database trim and tidy, no matter how many times you run your test suite.

Happy data testing!

2 comments:

Yossu said...: Thanks for another great post! This one especially hit a nerve with me, as I've been struggling for some time trying to work out how to test the actual data access code.

It's fine to say create your own data, but this isn't always so easy. What if your code needs to cover scenarios like when the query doesn't return any data, or when it does, and you need to write tests to cover both cases? Are you suggesting having a totally empty database, and populating it as needed before each test?

Apart from the problems of not being able to run up the solution (often the best kind of test is to try the app out and see if it works!) as there won't be any data, it will also slow the tests down a lot if you have to create any data needed before each test, and then delete it again afterwards.

One of my main projects currently uses a snapshot of the live database during development, so we can develop against realistic (if a little out of date) data. According to what you've written, we wouldn't be able to do this.

Not trying to argue, but I really can't see how to do this in practice. I'd love to read more, as it's a problem I've been trying to answer for a while. Can't find any good resources for testing the data access code.

Thanks again; 12/21/2011 7:50 AM
Jim Holmes said...: @Yossu: yes, you can use a set of pre-built data in some situations. (See Day 18: Baseline Datasets which discusses this specifically!)

That said, there are many scenarios where you just have to pay the tax and write tests which create a fair amount of data specific to that test. Integration- or functional-level tests hitting the database will always be slow. It's just the nature of the beast.

You mention not being able to run up the application for manual testing or development. That's a horse of a completely different color from what your test automation suites should be running against. Don't conflate the two situations. Yes, by all means, look to a default load of data for manual exploratory testing as well as your regular development work. That's a great help.; 12/21/2011 11:53 PM

FrazzledDad

Tuesday, December 20, 2011

31 Days of Testing—Day 17: Rules for Effective Data-Driven Tests