Log Book - December 14, 2017
8:45AM - “It was the best of times, it was the worst of times.”
A new day begins in the office, located at the heart of Budapest. The reassuring beep of the door buzzer tells me that I’ve clocked in. Today, I’ve decided to come into the office to do some pair programming with a colleague of mine. We needed to solve a nasty AWS issue where DNS wasn’t assigned to our servers during one of the CI test runs.
It was a tricky trouble, but I was adamant that we could crack it. I met up with Balázs, my friend and colleague, and we chatted about the last hackathon meeting and its discussions surrounding Route53. At that time, we found out that AWS throttled our requests and our test wasn’t ready to deal with it. Nothing that a little sleep / retry couldn’t fix.
But this time, the bug was an elusive one. We’ve been at it for about a week now without actually being able to find its root cause.
Never mind. Today was the day. I felt it in my bones.
9:15AM - The hunt begins
We sat down in front of the 24” 4K DELL, which is by far my favorite monitor. Not going to lie, those terminal fonts never looked more crisp than in 4K. But it serves another purpose: It’s easy to divide the terminal into a two panel parallel working station. It’s my turn to type, and Balázs’s turn to drive.
I grabbed the keyboard and fired up my IDE. We looked at the code for the hundredth time, trying to figure out where the problem may lie.
Yesterday, I read on a forum that the newest docker version may have trouble with networking. Since our tests are running in a container, it could be possible that the DNS hasn’t been assigned and handled properly. We looked at the Dockerfile, again.
“Maybe we’ve missed something,” Balazs pointed out.
I know this part of the code intimately because we’ve opened it up a couple of times while hunting for this bug. With a solemn nod we plow through each line, again and again, making sure we’ve correctly understood each step and its side effects.
The day goes by slowly. I was about to call for a break when I noticed something odd in the php-fpm initialization script. The sites are dynamic and socket activated, and I notice that the startup sequence is an older code. With a swift edit, we changed some of the behavior for that section, and eagerly fired up a new test run. We hoped for the best.
Sadly, another failure. The PR was still marked red. A no go. Bah! I had had enough. So, I called for a rest.
11:30AM - Water cooler chat
The pairing session lasted for quite some time and I felt drained. I went out into the kitchen area where our resident Barista, Feri was brewing some excellent coffee. I took a cuppa and nestled myself in comfort. I played darts for a little bit so I could distract myself, allowing my subconscious to mule over this relentless puzzle.
I started to get hungry, so I asked Balázs whether he’s up for some lunch. Since our office is right next to Nyugati Pályaudvar, food choices are abundant. Next to us is the WestEnd City Centre where there are plenty of nice things to eat such as fish, chicken, beef, salads, pizzas, gyros, Thai, Chinese, Italian and Greek. Almost all the nations are represented in that plaza.
We decided to settle for a nice Italian pizza at Bellozzo’s. As we ate, we discussed the day’s advent of code challenge, and whether either of us had managed to solve it. And if not, what would be our next approach? Do we have a clear understanding of the problem?
By the time we got back to the office, we were rejuvenated. Winter was slowly setting in, and the cool air on my face somewhat refreshed my thoughts.
12:45PM - Once more into the fray
And, it was back to business again. Balázs, with hands resting on the keyboard, suddenly lights up, remembering that a while back he reviewed a PR where the team changed Jenkins’ credential handler. He quickly dived on to Github, and after some searching, finds what we hoped would be our “egg of Columbus”. Turns out, it was just another dead-end.
After that, we tried various code patches. We fired off test run, after test run, in our AWS dev environment. In a haze, we frantically commented out code, trying to find the root cause for this bug.
2:30PM - Dead-ends all over the place
By now, we had half the handlers and classes which dealt with DNS handling and networking, commented out or with added debugging information to, or otherwise marked. The bug persisted.
As an AWS partner, we have access to premium support so we fired off an AWS ticket with as much logging and information as we could muster. While we waited for a reply, we started periodically reverting to older commits. Grafana and our system metrics told us that the problem had manifested itself, exactly one and half months ago.
Of course we tried reverting all the relevant commits- but alas, no cigar! It had to be from some outside change. It just had to!
3:36PM - AWS Support
“I’m sorry, but it’s not us.”, said Rick from AWS support.
“Come on Rick! Give us something!” I pleaded, “We’ve even reverted a week back before it manifested itself, but it still has failed. It has to be something from your end. I’m running out of ideas here man.”
“I know, I know. “ Rick said empathically, “I’m looking at everything we have here, but I don’t see any errors other than the ones you guys have already described to me. I’m sorry.”
“Fine, thanks for checking anyways. Could you at least please forward us all the logs you can get from these time periods.”
”Sure,” said Rick, “But it’s a lot!”
“It’s fine,” I said reassuringly. “Eventually we will find something.”
And with that, I ended the call. Shortly after, a Zip file arrived to us with a bunch of logs to look at. Hurray!
4:47PM - Logging is key
Gah, I can’t take it anymore! There HAS to be something, I thought. I stood up, stretched my back, and paced furiously up and down the office. At that point, we had been looking at logs for more than an hour. We can see that every time a request tries to get in through the socket, it can’t be reached.
A small victory though: by finding out that it wasn’t a DNS problem, we knew that we had been looking in all the wrong places.
5:25PM - Hope
After a brief stand-up with our colleagues in Boston, Massachusetts, we sat down again. Out of nowhere, Balazs leaped up, yelling, “version!” I looked at him befuddled. Huh?
Balázs frantically opened PR after PR when suddenly, he stops at a peculiar one. It’s seemingly unrelated to our troubles, but now that the logs told us that the error wasn’t actually what first bubbled up to us, we learnt not to act based on appearances. Turns out, that a bad merge pulled in a version of a package that we didn't support yet and caused our troubles.
5:45PM - Fingers crossed
Once we found the culprit, the error made perfect sense. With fingers crossed, we fired off a test run for the millionth time… and... greenboard baby! The PR flag for the test run flashed green! One out of three safeguards were now cleared. We only needed two reviews from other people, and then it was good to go!
Exhausted but bursting with happiness after finally catching the bug. We packed our things and headed out for the day.
T’was a good run.