[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

CalendarScheduler and "backwards in time"



I, too, have encountered the "backwards in time" problem with ns's
CalendarScheduler. The CalendarScheduler is indeed not reliable for
long-running simulations with a great many events that are scheduled very
closely.

First: my code never schedules events with negative delays; this sanity
check was the obvious first move.

I believe I've identified the underlying cause and have a crude fix. I'm
working on a much better fix, that I'll be able to post code for in the next
day or so.

The problem, as one might expect, is with numerical stability. The Calendar
Queue dequeue algorithm repeatedly adds a bucket's width (width_ in ns) to
the bucket upper boundary (buckettop_ in ns), as it scans across the "days"
of the calendar. Because buckettop_ and width_ are floating-point values,
there is opportunity for rounding error in buckettop_. Specifically, when the
width_ is very small, and buckettop_ is large relative to width_, there can
be insufficient precision to capture the small increment to buckettop_. This
loss of precision is magnified over the millions of repetitions of this FP
addition.

The CalendarScheduler code's author attempts to avoid the problem in two ways:

	- When it computes width_ (bucket width), it won't choose a width_
	  that's less than the current clock_ value * some small constant
	  (MIN_WIDTH, 1.0e-6 in the ns implementation).  This doesn't let the
	  width_ be too small relative to the current clock_, and thus should
	  avoid loss of precision in the repeated additions mentioned above.

	- It uses a buckettop_ that is actually (.5 * width_) larger than the
	  true value, in case there *is* rounding error in the repeated
	  additions to buckettop_.

Note that the former will worsen the insert performance of the queue--when
invoked, it effectively forces more events into a bucket than the sampling
suggested.

The latter, however, is not sufficient alone for all simulation run lengths
and event patterns--the underflow can accumulate more than .5 * width_'s
worth without control of the choice of width_.

I can observe the underflow in gdb. After 300 seconds of a simulation with
very many densely spaced events, I observe that the minimum width_ is always
used, and that there are events in the current bucket with timestamps
*less* than the bottom of the current bucket (this occurs after "wrapping
around the year" when there is sufficient underflow--you'll leave a bucket
"too early").

If I had time, I might look at the IEEE FP representation and see how these
values look with regard to the representation's precision. But this isn't
a 1st-year grad course in numerical computation. :-)

If I change MIN_WIDTH to 1.0e-5, the error vanishes. But this worsens enqueue
performance significantly.

I'm preparing a better fix now, and will post code for it shortly. The
better fix is to keep an unsigned long int count of "bucket hops", and
periodically set buckettop_ using direct multiplication of width_ * hops_.
Done occasionally (i.e., every hundred thousand or more bucket hops), this
will pull buckettop_ back to its true value, and keep the underflow of the
repeated FP additions from growing without bound.

-Brad, [email protected]