[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

a though to help with debugging ns



 
A long sad story, followed by a hopefully helpful suggestion...

I have been running some fairly large simulations recently.
Something like 500-1000 FTP sessions, which yield many more TCP
flows.  The simulations take ~20 minutes to run and produce many,
many megabytes of trace files.  I started running across "bugs".  I
started seeing connections that would start, but never finish.  Or,
connections that were supposed to start but never sent any packets.
A bunch of things started jumping out after I distilled the trace
file down to something I could look at.  Some of these bugs were
obviously with my distilling code.  But, others turned out to be ns'
fault (which I have posted to the list).

I am sure it is not news to a lot of you that debugging a 20 minute
simulation with thousands of connections is no fun task.  My first
thought was to try to come up with a small simulation of one TCP
connection that would run in 2 seconds and then I could debug that
with gdb or print statements or whatever.  No such luck.  And, I
tried hard.  After it was clear that I was going to have to debug
the 20 minute simulation I went home for the day had a beer and
decided to tackle it the next morning.

Half asleep the next morning I thought gdb with watchpoints was my
answer.  Obviously I had not taken in enough caffeine yet.
Watchpoints take forever on a small program, let alone a 20 minute
simulation.  I decided to keep thinking...

So, I thought I'd just stick some print statements around (the first
debugging technique I learned and still by far the most robust).
Obviously that was not a well thought out idea either as I now got
output dumps that were bigger than the trace files (many megabytes).
Sifting through them was going to be painful at best.  

Next I remembered that I knew when the event was supposed to happen
(or did happen -- depending on the particular bug I was chasing).
So, I added a little if statement in front of my debugging print
statements to only print stuff around the time when the error was
happening.  Still lots of garbage about connections that were
working perfectly fine.  

Finally, it hit me...  What I needed was a variable in the TCP class
that I could turn on if I wanted the debugging.  And, so that is
what I did...

So, I added a variable to the protected section of the TcpAgent
class...

    int tcp_debug_;

And, in the ns-default.tcl file I added a default value...

    Agent/TCP set tcp_debug_ false

Finally, in TcpAgent::TcpAgent() I added...

    bind_bool ("tcp_debug_", &tcp_debug_);

Then, all debugging type statements take a form like...

    if (tcp_debug_)
	printf ("slowdown()\n");

Now the debugging information can be turned on and off (assuming
you're using callback routines, which I am) in the tcl code.  Very
nice.  This generally helped point out the errors rather rapidly.
Well, OK, it still took 20 minutes for each simulation run, but I
had to make very few debugging runs after I introduced this
machinery, since the debugging output was easy to read and only for
the busted connection.  

I suggested the above to Sally, who thought it was a nice idea.
However, while this took about 2 minutes to implement the above,
what I am suggesting is that ns include these debugging variables
all over the place (I have subsequently added a sink_debug_ to the
TCP sink class).  

For what it's worth, I encourage the VINT team to add tons of such
variables.  Maybe not all in one pass, but as you work on various
classes keep this in the back of your mind.  I am guessing that
would be a big help in debugging rare problems that only seem to
crop up in large simulations.

allman


---
http://roland.grc.nasa.gov/~mallman/