I ♥ New Relic

Stephen Connolly's picture

Or how I learned to stop worrying and love the New Relic monitoring service

You know that feeling… you’ve just pushed the new service into production… it’s live… you have users hitting the thing… now is when the problems show up. 
 
What were the problems you say… well for one the New Relic monitoring I had put on the service was reporting a score of less than 0.7 at random… it would then recover… but for an application where I was happy with the design (I don’t want to claim the design is perfect or even good… but I wasn’t moaning about it) it shouldn’t be the case.
 
And stranger still, I had ramped up load testing on the app prior to launch and never seen such issues.
 
And then I discovered this great little button in New Relic
New Relic Profile
The very useful button

 

Because you have the New Relic agent attached to the application, you now have the ability to run a profiler on the production instance…
without
 
 stopping
 
 production!

 

I cannot stress how handy this is. The problem we had did not show up when load testing[1], where attaching a traditional profiler would be easy. And we have real users using the service, so we cannot afford downtime to stop the application and attach the profiler… so click, click… wait 5 minutes and analyse the results.
 

What did I find… well one path through the code was creating and destroying again an AsyncHttpClient instance (something that is designed to be a shareable long-lived quazi-singleton) and burning up 77% of the CPU time doing the destroy. 2 minutes later I replaced the new instance with a reference to my shared long-lived quazi-singleton that all the other code paths were using, and redeployed… the result:
AsyncHttpClient instance testing
The version with the bug fix was deployed some time around 04:20 (not my time zone).

 

Notice how the CPU on both instances went down to less than 5% and the response time stabilised at something nice and low… 
 
Thank you New Relic, you helped me find a bug that only surfaced with production load.

Notes:

  1. One of the issues with Load Testing is that you have to define your scenarios, and while you can add some elements of randomness to each scenario, you cannot give that true randomness that occurs with real users. In our case I had not put sufficient randomness along the code path[2] that had the bug, so my caching layer was hot on those values and was masking the issue. Real users are not so kind to your code!
  2. How do you load test the code path where new users sign up via a specific third party OAuth service? You’d need to create a large number of fake accounts on that service and feed them into the load tester… or you cheat and do what I did and just use 2-3 and wipe their records out of the DB every 10-15 seconds… yep… just begging for a hot cache disguised bug!
—Stephen Connolly
CloudBees
www.cloudbees.com
Stephen Connolly has nearly 20 years experience in software development. He is involved in a number of open source projects, including Jenkins. Stephen was one of the first non-Sun committers to the Jenkins project and developed the weather icons. Stephen lives in Dublin, Ireland - where the weather icons are particularly useful. Follow Stephen on Twitter and on his blog.