Wednesday, January 19, 2011

Unexplained spike in web traffic

Question

I am suspicious of an unexplained 1600% increase in traffic and massive slow-down that lasted about 10 minutes. I'm not sure if it was an attempted DoS attack, dictionary login attack, etc. Regardless, what actions should I take to monitor my server (which logs should I look at, what tools should I use, etc.) to make sure nothing nefarious happened? What steps should I take during future slowdowns such as these? Is there a standard way to have the server alert me during such a surge in traffic?

All the gory details:

One of my clients reported an unresponsive website (Ruby on Rails via Apache, Mongrel, and mongrel_cluster on a CentOS 5 box.) around 1:00 today.

I was in full troubleshooting mode when I got the email at 1:15. It was indeed exceptionally slow to ssh and load web pages, but ping output looked fine (78 ms), and traceroute from my workstation in Denver showed slow times on a particular hop mid-way from Dallas to the server in Phoenix (1611.978 ms 195.539 ms). 5 minutes later, the website was responsive & traceroute was now routing through San Jose to Phoenix. I couldn't find anything obviously wrong on my end--the system load looked quite reasonable (0.05 0.07 0.09) and I assumed it was just a networking problem somewhere. Just to be safe, I rebooted the machine anwyay.

Several hours later, I logged to Google Analytics to see how things looked for the day. I had a huge spike in hits: Usually this site averages 6 visits/hour, but at 1:00 I got 130 (a 1600% increase)! Nearly all of these hits appear to come from 101 different hosts spread across the world. Each visitor was on the website for 0 seconds and each visit was direct (i.e. it's not like the web page got slashdotted) and each visit was a bounce.

Ever since about 1:30, things are running smooth and I'm back to the average 6 visits per hour.

Disclaimer:

I am a code developer (not a sysadmin) who must maintain web servers for machines that run the code that I write.

  • it's unclear what you were pinging/tracing and from where. But if that was a hop in a middle of a traceroute's output, then jump from 190 ms to 1600 ms probably means network congestion. If this correlates to your event and switching of a routing path, it is possible that a part of your providers network was attacked including your server.

    There is no single solution to your problem. There are many tools and approaches, like Scout, Keynote, New Relic, Nagios, etc. It all depends. Whatever you decide to do, just don't forget one thing, that if you monitor something on a server and from that server, and that server becomes unavailable you loose any means to notify yourself that it is down :)

    From monomyth

0 comments:

Post a Comment