December 20, 2013

I Could Spit Nails

Got a customer that went live with Phase 2 of their implementation of our software.  As it happened, they did this Dec 1, which was the week after I got back from the Texas trip.  I was the on-call support dude for the week, and lo and behold, they had an after-hours outage on Wednesday the 4th.

Been fighting their system ever since.  Maxing out on CPU and RAM, errors in the indexing, page load failures... A heap of things.  They gave me some numbers from their Analytics software that tells them how many users are in at a given time, and what the servers show *should* handle that load.

But they weren't.

Crash self-study course in Load Balancer configuration so I could check those.  Seemed OK.

Crash self-study course in Apache Tomcat webserver to trouble shoot that... Seemed OK.

Ran a Load Test (last week Thursday at 10:30p-1:00a)with a script they wrote to simulate 1500 users an hour hitting their system.  Crashed.  Broke.  The whole thing went Tango Uniform.  

Made some changes.  Tweak some values.  Add this config.  Remove that config.  Change a few more things.  Cuss.  Drink coffee.  Smoke cigarettes.  Cuss some more.  Change more configs.

Ran another test this week Wednesday, same time.  I'm working 12+ hours a day on this system.   Have been since 04 December.

Their CEO gets involved.  He's raising hell with our CEO, who (thankfully) doesn't raise hell, but simply says "What do you need to solve this?"

Grab a Senior Software Engineer, pull him into a remote session to investigate.

We dig around.

We see traffic coming in.

The system crashes.  

Another of our Senior guys says "Open this obscure seven-directory-deep file, search for this tells you how many users are hitting the system at any time."

The number comes back... 15,000.


They lied to us about their volume, and we built an environment to meet the volume they mentioned (1500).

Wonder of wonders that when you increase your traffic 10-fold, your under-powered servers can't handle it.

I'm so frustrated at their dishonesty, I could spit nails.

1 comment:

Old NFO said...

Ouch... at least you finally 'found' the truth... Now the question is, will they pay to upgrade to actually support their load???