Posts Tagged ‘fastboot’

Introducing ‘timechart’

Monday, September 7th, 2009

Finding out why your Linux computer performs the way it does has been a hard task. Sure, there is Oprofile, and even ‘perf’ in recent kernels. There is LatencyTOP to find out where latencies happen.

But all of these tools are rather limited when the software stack that has the performance issue is more complex than a single program. The tool that comes closest to being useful is `bootchart‘, but that has a rather limited resolution.

To solve this, I have been working on a new tool, called Timechart, based on ‘perf’, that has the objective to show on a system level what is going on, at various levels of detail. In fact. one of the design ideas behind timechart is that the output should be “infinitely zoomable”; that is, if you want to know more details about something, you should be able to zoom in to get these details.
The rest of this blog post describes some aspects of timechart, using real life examples and screenshots. However, it is really hard to show the power of timechart on such a static page, to get a real feeling of what timechart can show you really ought to try it out yourself.

Timechart basics

The output of the timechart program looks like this:

highlevel

At the very top, a very short reminder of what each colors means is given.

The second part of the output is the per-cpu information. In the case of this trace, taken on my Penryn based dual core laptop, there are two horizontal bars, one for each core.

The third part of the output is the per-process information, the grey bars in the lower 2/3rd of the image. Each significant process has its own bar.

In the example above, bash was very busy (as indicated by the blue color), and thus also kept one of the CPUs busy at all times (also in blue in the CPU part of the output). Around the 4 second mark, the firefox program had some activity (although it was never really quiet) and the X server later on became active as well.

In the current zoom level it is a bit hard to see, but the yellow pieces in the process bars indicate processes that are waiting for the scheduler to give them a timeslice, while red areas would have indicated that the respective processes are waiting for disk IO.

This view gives us a high level idea of what is going on, but it does not look very detailed. The design philosophy behind timechart is that the output is “infinitely zoomable”. By using the zoom function of your SVG reader, you can zoom into an area of the output that you are interested in. For example, lets look at the Xorg activity to find out what is going on:

zoom

Here you can see Xorg starting out sleeping (gray), but at some point activity happens (as indicated by the little green lines; these green lines show communication). Once the communication happens (X11 draw requests), you can see that Xorg alternatingly is running code (blue) or waiting for the CPU scheduler to give it a timeslice (yellow). If you look careful you see a few “2″s in this zoom level already; these denote the cpu number Xorg is executing on.

At this zoom level, there are still activities that are so detailed that they don’t make much sense yet… so of course, we can zoom in even further.

Rather than showing the output of a futher zoom, I welcome you to take your own look around using this example output of timechart.

Scheduler delays

By now, everyone who has followed Linux kernel development in the last few weeks has seen the controversy surrounding the CPU schedule, with the BFS-versus-mainline debate.

The timechart tool visualizes the delays caused by the scheduler using yellow areas in the per process time display. For example, in the image below, you can see that after some scheduler delay, the process runs for a short time, as indicated by the blue area. After a little while, the process is waiting for (disk) IO, likely because it’s reading a file that’s not yet in the page cache. Once the IO is complete (as indicated by the little red dots), the process now ends up waiting for the CPU scheduler (yellow) to execute the program (blue). In the trace below, the program then runs briefly before hitting an IO delay again; in fact there are several “run, wait for IO, wait for scheduler” sequences in the trace below. As it happens, this trace is from the “make” program during a kernel compile.

scheddelay

Communication cost

One of the “hidden killers” of performance is the cost due to communication latencies. The timechart tool tries to visualize these using thin green lines between processes that communicate.

For example, in the picture below a “cc1″ process (top) is communicating with an “as” process (bottom), as part of a kernel compile. The first three green lines show that “cc1″ is communicating (sending data in this case) to the “as” process, which, after each communication needs to wait for the process scheduler (yellow area) before it gets to act on the data. The second three lines show the inverse communication; where “as1″ is waking up “cc1″ (that was waiting for a short time, as indicated by the short grey area the first and third time, the second time the wait was so short that the grey area is invisible). And again, the yellow areas indicate that “cc1″ first has to wait for a time slice before it gets to actually execute (blue).

wakeup

Other real world examples where I’ve used timechart to find communication latencies is measuring why starting the “hal” program is so slow; it turns out that there are seemingly hundreds of dbus round trips involved in that between various components, and timechart shows this beautifully. (I’ve left the diagram out of this blog post to keep the size reasonable).

Power Management tuning

timechart can also be used as part of the tuning of some of Linux’ power management. Controlling the CPU frequency is perhaps an obscure area of the kernel where only a handful of people are actively working… but as it happens I care about this area and timechart is a very useful tool here. With timechart, we can see when frequency changes happen, what lead up to that change and then what the impact of said change is. For example take the portion of a timechart below:

pstates1

You can see here that the “claws-mail” program started running on the CPU for some time, when the CPU frequency (indicated with the yellow thick line) was about 1/3rd of the maximum frequency. Pretty soon after that, the cpu frequency gets increased by the kernel to the maximum, and when “claws-mail” is done, the frequency is lowered in two steps. As you can see in the graph, the middle step of this frequency reduction was totally unneeded. This highlights an area of optimization… expect patches for this on the Linux Kernel Mailing list soon.

As with most things in timechart, zooming in will give you additional information:

pstates2

In the zoomed in graph above you can see that the actual frequency is displayed inside the yellow lines, and you can also see in more detail how much cpu utilization there actually was that lead to the decrease from 2.0 Ghz to 1.6 Ghz.

Inkscape

I recommend the use of the inkscape program to view the output of timechart. I know there are many SVG readers, but I have found that several of them aren’t capable of coping with the high level of detail that timechart provides.

It almost seems that some SVG capable programs have an O(N^2) algorithm, and with timechart, the N gets very high. Others don’t allow you to zoom in far enough to get all the detail that timechart can provide.

That doesn’t mean I’m entirely happy with inkscape; it looks like version 0.46 has at least some performance problems, and also I’d like to have the maximum zoomlevel increased so that it’s possible to zoom in even further.

Naming contest

Several people I talked to don’t like the name “timechart”. Unfortunately, nobody has come up with a better name for the tool yet.

If anyone has an idea for a better name, please let me know