Theseus FAQ

QUESTION ANSWER
 

It seems like Theseus is slow. Am I doing something wrong?

Theseus is normally very fast. When Theseus is not running as fast as you might expect, check the following:
  • Is your plan I/O-bound? The purpose of the Theseus architecture is to extract as much parallelism as possible from I/O-bound situations, esp when operators like Xwrapper are retrieving lots of data from multiple remote sources. If you're attempting to use Theseus to do something entirely CPU-bound, your plan will almost always run slower than a run-of- the mill Java program (unless you have more than one processor.
  • Is your /system/logger/level above 0? If so, check the size of your log file. Lots of logging (which is what Theseus will do at levels above 0) will slow down performance
  • Are you using enough threads? The /system/pool/worker_profiles setting allows you to choose the size of your thread pool during execution. The default is 10, but try raising it to 50 and see if that helps.
  • Are your remote sources very slow? Usually, for I/O-bound plans, the time of execution will be proportional to sum of the latencies from sources that must be queried in sequence. Check the sources along your plan paths and see if any of them are very slow.
If you have checked all of these things and things still seem slower than expected, file a bug report with the development team.
 

If I use the TheseusBox API, can I call execute() more than once?  

Yes. Once you have created a TheseusBox, you can load a plan (via loadPlan()) and call execute() for that plan as many times as you want. This is the suggested approach if you want to enable your plan to be accessed via a servlet or any other type of server-style program.
 

The manual shows an example of operators like SELECT that take strings as inputs. Can I use streams instead of strings?

Yes; actually, every operator input is a stream. As the note at the front of the manual says, string literals are converted to streams of 1 tuple and 1 attribute (name of attribute is "dummy", but this name often does not matter).

Thus, it is possible to have an input file consisting of:

RELATION data: val char
10 
20
RELATION criteria: name_does_not_matter char
val < 15
and a plan:
PLAN p1
{
  INPUT: stream data, stream criteria
  OUTPUT: stream output

  BODY
  {
    select(data, criteria : output)
  }
}
 

How complex can SELECT criteria be?

Complex enough to handle most boolean logical expressions that you would like to write. For example, you could have changed the input for the above input file to:
RELATION data: val char
10
20
RELATION criteria: name_does_not_matter char
(val > 2 or (val > 10 and val < 15)) and val < 18
 

What do I do if I want SELECT or JOIN criteria to be dynamically generated? For example, suppose I want to find the MIN of some value in one relation R1 and then use that minimum value as the basis for a selection that I do on another relation R2.  

You have two choices when specifying the SELECT or JOIN criteria. One choice is to hardcode it (i.e., specify it as a literal). The other is to make it a variable. Note from the SELECT manual page that if you do make it a variable, the criteria needs to be the only value in a single tuple relation. So, what you would do is:
  1. Use the FORMAT operator to construct a dynamic conditions string.
  2. Use PROJECT, if necessary, to ensure that the criteria is the only attribute of the relation output by FORMAT.
  3. Feed that variable to SELECT (or JOIN)
For example, if you want to calc the MIN of R1 and then use it to filter R2, you would first calc the MIN before step 1 (above), using an AGGREGATE function, and then use that MIN value to build the SELECT/JOIN condition in FORMAT.
 

How do we express IF/THEN/ELSE in Theseus (or, how do we express a termination condition for a recursive plan)?

Recursion is dataflow is the preferred method of looping beacuse it requires fewer operators and synchronization.

To express IF/THEN/ELSE in your plans, you generally need to do the following:

  • Apply some filtering condition to a stream (e.g., "city = Chicago")
  • Test if the stream is NULL or not. This allows you to route dataflow conditionally.
To use this in your plans, consider:
  • Using the SELECT operator to filter out some condition you want to test (i.e., the average rating of the set of restaurants).
  • Using the NULL operator to test whether the stream is NULL or not. Make sure that you understand the semantics of the NULL operator. It has 3 inputs (stream to be tested, stream to route when true, stream to route when false) and 2 outputs (new name for "true" stream", new name for "false" stream).