Saturday, February 7, 2009

And not a drop to drink - Leak and Soak Tests

Owning an old house has some advantages. You can enjoy its period architectural details and perhaps find some hidden treasures. When we did some house renovations, we found newspapers from 1923 in one wall of the house. It was quite a surprise to be able to read contemporary sports coverage of Babe Ruth[1]! Owning an old house also means that you are always performing maintenance. I remember late one cold January night when I woke up to the sound of water dripping. It wasn't the bathroom, it was a vintage cast-iron radiator drippig water through the ceiling. Nothing had changed in the water pressure, but the washers in the radiator had just been worn down by prolonged use. Another time, a massive rainstorm resulted in a leak in the roof. The roof had performed well for years of less severe rain storms or equally severe, but shorter downpours, but the extended high level of rain simply soaked through it.

In the first case, the problem wasn't that the level of water traffic had increased beyond what the radiator "system" could manage, it was the accumulated damage of a relatively low level of traffic over time. In the second case, the problem was that the roof "system" was overwhelmed by prolonged exposure to a high level of traffic.

OK - what do all these expensive home repairs have to do with software testing? Just this; when you go about designing stress tests for your product or application, you should include "leak" and "soak" tests* in your plan.

When some people approach stress test planning, they treat the tests as a blunt instrument with which they try to assault the system under test. This type of scorched earth approach to testing can result in system failures being encountered, but these failures can be difficult to diagnose and reproduce. A combination of disciplined leak and soak tests can help you better identify the root causes of stress related system failures.

How do leak and soak tests differ? The differences can be thought of in the terms of the radiator and roof failures that I mentioned above.

In a leak test, you tend to run the system under a manageable, and tracable level of traffic for an extended period, to determine of the accumulated weight of usage causes a system failure. The classic case is when you are looking for a memory leak. The taffic load should not be extreme.

It may be that each system action results in a very small memory leak, or in the premanent allocation of some other system resource. If you run this test once, the leak may occur, but in the context of a system with several GB of memory, and a process that is using several MB of memory, you might not notice that the memory or system resource is not freed up when the test completes. But, if you repeat the test, and observe the process under test a tool such as JBoss Profiler[2], and observe the system with utilities such as top, sar, or vmstat, then you may be able to spot a trend of when system resources are used and not released. A great way to begin a leak testing regimen, is by observing the memory and system resource use of the software under test in an idle state. Just start up the server or application product under test and then leave it alone for an extended period. You may find that it's use of system resources increases over time, just through its self-maintenance, even when it is not actively processing user or client requests.

So, in a leak test, the key variable in the test equation is time.

In contrast, a soak test, hits the system under test with a significant test load, and maintains that load over an extended period of time. A leak test may involve running traffic through the system under test on only one thread, but a soak test may involve stressing the system to its maximum number of concurrent threads. This is where you may start to see inter-thread contention issues or problems with database connection pools being exhausted. If you can establish a reliable baseline of system operation with a leak test that runs over an extended period of time, then you can move onto more aggressive tests such as soak tests. If, however, a leak test exposes system failures, then you would probably encounter the same types of failures with soak tests. But, the level of traffic used in a soak might make these failures harder to diagnose.

So, in a soak test, the key variables in the test equation are both time and a sustained high load level.

To sum it up, if a leak test is passive aggressive in nature, a soak test is just plain aggressive.

* Dzięki Jarek!

References:

[1] http://en.wikipedia.org/wiki/Curse_of_the_Bambino
[2] JBoss Profiler - http://www.jboss.org/community/docs/DOC-10728

1 comment:

Michael Kelly said...

What a fantastic post. That's probably the most accessible description I've seen of the two approaches.