Thursday 6 December 2012

Managing Performance Troubleshooting Projects



Software performance troubleshooting can be defined and interpreted in several ways - for the sake of this discussions it refers to dealing with post-production performance related problems. A related term is performance tuning but I consider that to be improvement over existing performance metrics. Performance troubleshooting projects tend to be guerrilla assignments where one has to act quickly and with precision. My experience is that managing such projects requires a slightly different mindset compared to a normal performance testing project.

First things first: Why do systems have performance related issues once they are in production? There can be many causes; unexpectedly high surges of traffic, not enough thought given to non-functional requirements at the system architecture level, poorly specified non-functional requirements, incorrect assumptions about how test environment scales to production, dependency on an external system etc. A logical question to ask is, weren't all these things accounted for during the normal performance testing phase before going to production? That's a fair point, but it needs to be understood that performance testing itself is performed under extremely tight timelines. Much like functional testing, it is impossible to test each and every conceivable scenario in performance testing also. Furthermore, no matter how much one desires, the testing and production environments are very seldom identical - this alone can be a source of many issues. The fact is that production systems, with their live data and real time traffic, are so complex that simulating them with 100% accuracy in the test environment is almost impossible - hence the difficulty in predicting production issues.

In these cases, obviously prevention is better than cure - but as I have tried to argue above, some things cannot be prevented. When performance testing for such projects has been completed (to whatever extent possible), it is critically important that potential production issues are highlighted as risks in the final document. This will go a long way towards understanding the situation when its time to perform troubleshooting tests. The documented risks should be the starting point of the troubleshooting exercise. The tests to run should be few and need to be planned very carefully so that the problem is identified as quickly as possible. The tests should be run in the production environment and with live data. This in itself is a very tricky proposition - what should be the load on the environment; if its too much, the system performs slower and affects live users in adverse ways thereby potentially losing business; if its too little, simulating the actual problem might be difficult. Can certain components be switched off so as to isolate the problem - if so for how long? If additional data flows are needed, where do they come from? Then there are legal/contractual issues to consider, such as possible testing of any external systems (if allowed, then who does it?). All these are judgement calls and need to be made accurately and quickly (often in real time), both by the performance test specialist in charge as well as the project team.

Performance troubleshooting is performed by experienced specialists who have done it before. The time frame is obviously very short - business loses money every minute that the problem persists. To avoid cluttering, there is usually just one person who handles the hands on as well as performance test management duties. The project team treats these projects with a great deal of urgency - tasks to support the performance test specialist are expedited, there is a general willingness on their part to let the specialist take the lead and guide the effort forward. This implies added responsibility on the specialist to try and achieve the desired goal with the given set of constraints.

Performance troubleshooting is an expensive exercise since it happens very late into the product cycle. Unfortunately, systems that have had production issues once are at risk of being afflicted with them again - particularly if the remedies to the first issue involve changes to shared code repositories. It is therefore very important that performance test managers, when they do have the opportunity, insist on pre-production performance testing to be as thorough as possible. If the exhibited performance is not satisfactory or if enough testing has not been performed (for whatever reason), the clear recommendation should be to not risk going live or at the very least, the risks of doing so need to be communicated in writing and must be fully understood.