Debugging in Linux!

What’s Debugging in DevOps?

And why is it challenging?

  • It’s part of your application lifecycle that doesn’t get covered in sprints, Agile, Scrum, etc.
  • Something went Wrong. Something is not working. Something is slow.
  • No self written code. Everything that we’re debugging, is actually part of some libraries, some kernel code or some developer written code.
    But we don’t have luxury to do print statements.
  • Figure it all out when the house is on fire. You’ve 5 minutes!!

The scope of this talk is only about observability tools. Benchmarking, Tuning, Static config check tools are for some other day.

Usual Methods

  • Street light.
    I know few things. I’ll try to look only in those areas.
    Staring at code until bug reveals itself.
  • Tune randomly.
    Increase CPU, IOPS, Memory.. restart the service. It works!
  • Blame someone else.
    When you’ve lot of connected components, you just go from app layer, database layer, network is bad, it’s raining at our data center in US.
  • Looking at dashboards.
    This is something I follow reluctantly. If I’ve 100 different metrics on a server, which all dashboards do I look, before concluding anything?

Any of these methods might work sometimes, but don’t make it a practice.

Usual things

  1. Hey, my application is running slow. Why?
  2. Hey, my machines are at 100% cpu usage.. can you add more CPU cores?
  3. OMG! I’ve an out-of-memory error. Add more RAM!!
  4. It was working fine in staging, was working good under load testing.
  5. We were able to handle higher loads than this, why is this happening now?

Latency Numbers Every Developer Should Know



  • Lets you look at what’s your program is doing, without knowing anything about source.
  • It prints out every system call made by your process.
    Do not run on production.

Case Study:
Missing library. A python app called Sentry, a beautiful open source real time error tracker. Now, we’ve installed it in virtualenv and something is wrong. I did strace -e trace=open,read <pid> and foudn out that one of the libraries is being loaded from somewhere other than my env folder.

  • Other usual cases I use strace for:
    • Finding out where are my logs. strace -e write
    • Which configuration file is it using. strace -e open
    • Which network components is it connecting to?


Strace linux standard uptime and find out where is it getting data from!

Strace over uptime

Fun stuff to do: cat /proc/uptime, cat /proc/loadavg


It prints out how much network and disk every second.


Case Study:
Mysql is intermittently slow. Sniffing network packets, and monitoring for a while, can tell you what’s happening. In one scenario, recv has jumped to 455MB and mysql went slow. So, it’s a query! Now, we’ve a lead.. We can inspect further.


top is the first guy we look at.

Problem: Shortlived processes aren’t covered in top/htop..

atop got your back, though. It prints out short lived processes too.
Other cpu tools: mpstat, iostat

Fun exercise: Strace all these tools!

However, these tools can only go till process level. What if I have a tool which tells me which function in my program is using most of the CPU?


perf is an amazing tool! Does a lot! It’s a sampling profiler.

Flame Graphs are interesting way to look at perf reports. Because perf reports can be long!


opensnoop traces open() syscalls, showing the file name and returned file descriptor.

opensnoop log

This is part of an army called perf-tools

Isn’t this same as strace -e open?? Yes. However, strace freezes every system call, prints it to output, unfreezes it. While opensnoop does it without freezing system calls.

Fun Stuff to do:How do we know this? strace strace, strace opensnoop Reading source code should help you better.


Quickly find out if firewall is the issue. telnet <host-ip> <port>
Till date, I found it useful only for this, in debugging.


Print network connections, routing tables, interface statistics, etc. I use it mostly to find out how many connections from each host are there currently on my server.
It’s simple to use! netstat -tunapl and grep desired stuff.

Something like this is helpful one-liner.
netstat -antu | grep :80 | grep -v LISTEN | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -rn


Tells what files are opened by what processes.
Classic case of disk space
df -h shows 100% disk usage. du -sh / tells other tale. Is linux lying?

Not actually. lsof +L1 tells that you’ve few files , but your filesystem has a pointer to it. So, it’s part of filesystem space.

How to reclaim? Restart process, to release deleted files.

In general, avoid deleting a file currently held by process. If you want to reclaim space, and keep the file intact echo > filename

Widely, we’ve these tools doing either of sampling profilers, trace system calls.


Basic tools:
- top, htop, atop, ps, mpstat, free, vmstat, iostat, netstat


Somewhat unpopular and difficult to articulate.
- strace, dstat, tcpdump, nicstat, pidstat, swapon, lsof, sar


Too specific and requires some knowledge about linux kernel.
- ltrace (trace library calls of the function), ss (socket inspection tool), ethtool, blktrace (traces of i/o traffic on block devices), slabtop, tiptop, pcstat and many more!

Fun and Important Stuff: Which one is used by your monitoring system? Nagios, Sensu, Newrelic, Datadog, etc.. somehow use the same underlying functions.

A lot more under /proc

We’ve seen a lot of tools above. If you strace them all, you’ll find they’re opening /proc/ at one point or the other time. All the treasures are hidden here!
/proc/ is a directory in linux where all of your processes stay.
It has some folders under the hood. What I’d be most interested, in general, are limits and status. Sometimes, filedescriptors!

Few more cases:

git is slow!

We’ve our hosted gitlab server and it’s painfully slow for a few days.
Updated gitlab version a few days earlier, but is it really the reason?
Increased verbosity over git.


Found out ssh is slow. Turned on strace on sshd, on server during off work hours. Found out some obscure library call on systemd taking too long. Googled it around, and boom! There’s a linux patch yet to be merged. Lived with workaround for a couple of days, applied patch and everyone is happy!

Finding out this, would’ve been practically impossible with our basic tools like top, iotop, etc.

Postgresql is slow after about 2 million operations.

An obscure postresql server that’s running painfully slow, after about 2 million operations. Usual suspects like CPU, RAM are calm. We restart the system (not just service!), and it works fine for 2 million more operations. Now, as it is production, we can’t run strace. Did some random unrelated tunings and expected it to work, it obviously doesn’t.
Finally did perf top and bunch of other things to find out something is awry with L1 cache.

Randomly DNS fails sometimes.

Yet to be debugged to the root level. I’m poor in networking! I’m sure it’s not difficult. I might have to do some tcpdump or so.

Curious case of Amazon CPU credits, I/O Burst credits.

We run out of them, quickly. And while we can have alerts to do some actions when they go out of credits, I was more keen on how the usage has been. To scale right in production, is a myth. We got to do everything related to profiling before hand, understand what resources my program is using, etc. and decide on what instance type, capacity to allocate.

Some of these might not be applicable for docker.
I tried to run strace inside a container for the first time, and it said permission denied. I’m gonna figure out why and update here.

Reference blogs to read more:
Brendann Greg
Julia Evans
htop explained by Peteris