Getting Started

Contents

  1. Overview of Theseus
  2. Using Theseus
  3. Configuration
  4. Running the examples

1. Overview of Theseus

Theseus is a plan language and execution system for high-performance information gathering, processing, and monitoring. At its core is a streaming dataflow architecture that allows many plan operations to execute in parallel; it is a particularly effective way to combat what are otherwise significant network latencies.

Theseus is a unique combination of network query engine and general agent execution system. Existing network query engines (like Niagara, Telegraph, and Tukwila) are good at quickly processing network data, but lack the expressivity required to do more complicated information gathering tasks. Existing agent execution systems support concurrent execution, are more general, are more expressive, but lack the ability to efficiently route information between plan operators. Sharing much in common with both, Theseus allows expressive information gathering plans to be executed at network query engine speeds - and beyond!

Expressive plan language

Theseus provides a rich, expressive plan language for information gathering. In constrast to the non-accesible intermediate query plans constructed by network query engines, Theseus plans:

Dataflow plan execution

The Theseus plan executor obeys a dataflow model of execution. This means that operators in the plan can be executed as soon as any of their inputs become available. This automatically maximizes the degree of horizontal parallelism at runtime. In constrast, plans executed under the more common von Neumann model of execution rely on a separate instruction counter to determine which instruction should be executed next. Thus, they are inherently serial.

Data streaming

In addition to dataflow-style execution, Theseus also supports the streaming (or pipelining) of data between operators. The advantage of this is that it allows data being processed by one operator to be communicated to and processed by a downstream operator as soon as possible. Furthermore, there is no requirement for a producing operator to wait for a consumer to be available - communication occurs asynchronously via queues. This increases the degree of vertical parallelism at runtime. In contrast, non-streaming architectures force operators to process all of its inputs before transmitting any output to downstream operators. Under this scheme, communication between operators along a common flow is typically synchronous.

More details

To read more about streaming dataflow, see the papers on our Theseus website.

2. Using Theseus

To use Theseus, you write a plan, provide an input file, and request execution of that plan with the input file. To request execution, you can use any Theseus client or call Theseus from its Java API.

To use the interactive command line client, trcli (Theseus relational client), you would specify:

trcli [plan name, without the .plan extension] [data file]
For example, suppose you have written a plan called get-cs-students.plan that combines a set of graduate students and undergraduate students and then outputs those from that set which are in the Computer Science department.

Suppose the plan looks like this:

PLAN get-cs-students
{
  INPUT: stream grad-students, undergrad-students
  OUTPUT: stream cs-students

  BODY
  {
    union (grad-students, undergrad-students : all-students)
    select (all-students, "department = 'Computer Science'" : cs-students)
  }
}
Next, suppose that we have a datafile called get-cs-students.data that looks like this:
RELATION grad-students: name char, department
Smith|History
Thomas|Computer Science
Jones|Cineman
Hill|Computer Science
RELATION undergrad-students: name char, department
Jackson|Math
Adams|Computer Science
Park|Business
You can use trcli to execute the plan on the data by typing:
trcli get-cs-students get-cs-students.data
which produces
----------------------------------------------
RELATION: g_cs-students
   attrs: name, department
----------------------------------------------
Hill, Computer Science
Thomas, Computer Science
Adams, Computer Science
----------------------------------------------
If the basename (the part before the ".") of the plan and data files are the same, you can omit the name of the data file and Theseus will attempt to locate it based on the plan name. For example, in the case above, you can also run the plan with the desired data via:
trcli get-cs-students 

3. Configuration

To configure your system to run Theseus, you need to only edit one file:
trcli.bat
When editing this file, you will need to: If you look at the file trcli.bat (at the top level of the Theseus installation), you will see the following:
REM ------------------------------------------------------------------
REM Make sure that you do both step #1 and #2.
REM ------------------------------------------------------------------

REM
REM (1) Correct and uncomment the line below
REM set THESEUS_DIR=c:\users\barish\research\theseus\system\src\3.5.1
REM

REM
REM (2) Then, delete (or comment out) the next line below
GOTO :NO_INIT
You need to change this to something like
REM ------------------------------------------------------------------
REM Make sure that you do both step #1 and #2.
REM ------------------------------------------------------------------

REM
REM (1) Correct and uncomment the line below
set THESEUS_DIR=c:\theseus\theseus351

REM
REM (2) Then, delete (or comment out) the next line below
REM GOTO :NO_INIT
Now you should be able to use trcli.bat from any DOS (COMMAND.COM) window.

To ensure that the client is ready, open a DOS window and type:

trcli
this should generate the following usage message:
::: trcli - Theseus 'relational' client
USAGE: java theseus.tools.tcli.TRCli <.plan file> 
Successful return of this message means that you're all set to run the client.

One final task is to edit the etc/Theseus.properties file to make sure any paths are correct. For example, the line:

operators/quip/path = c:\\theseus\\theseus310\\bin\\quip.exe
should be modified (on the right side) to correctly reflect the path where quip.exe can be found. This is important if you want to be able to use XQuery in any of the plans that you will be writing.

4. Running the examples

NOTE: it is assumed that you are running Theseus on a Windows 2000 machine which also includes Java 1.4 (JRE). This distribution may work out of the box with other versions of Windows and other versions of Java, but this has not been tested. Theoretically, this system can work fine on any system to which a Java 1.4 virtual machine (JVM) has been ported. You will need to tweak the client startup scripts, however.

To run the examples in this release, first make sure you have completed the editing task in step 3 above. Next, from the top level of your Theseus installation (let's call this $TOP), type:

cd $TOP/plans/examples

To run an example plan:

cd basic
..\..\trcli gen1 gen1.data
You should be able to do this with the other plans in this directory as well. Notice that all of the plans have the same-named data file. Also notice that there a subplan sub1.plan that makes a call to sub1a.plan. Take a look at the file sub1.plan to see this. Finally, notice that the plans func1.plan and func2.plan rely on user-defined functions found in Functions.java, which is located in the same directory. You should be able to compile Functions.java without any linking/knowledge of Theseus.