Preface
Many researchers of discourse agree that coherent texts have
internal structure and that this structure is conveniently characterized
by discourse/rhetorical relations, i.e., relations that reflect semantic
and functional judgments about the text spans they connect. Yet, despite
significant progress in understanding the linguistic phenomena above the
sentence boundary, the discourse parsing of free, unrestricted text remains
an elusive goal. To date, most researchers have assumed that in order to
derive the discourse structure of texts, one needs full semantics. In this
book, I explore an alternative approach to discourse processing that need
not be grounded in a full semantic account of sentence processing.
Instead of focusing on the semantics of discourse relations
and on the relationship between the semantics of discourse and that of
the individual sentences and clauses, I provide a completely specified
axiomatization of the most widely accepted mathematical properties of discourse
structures, which are amenable to straightforward formalization. The axiomatization
is strong enough to reduce significantly the space of discourse interpretations.
Also, it is strong enough to enable one to derive well-formed discourse
structures for unrestricted texts with surprisingly good results, although
the rhetorical relations that hold between textual units and spans cannot
themselves be determined unambiguously.
The reason one can derive the discourse structure of texts
despite their inherent rhetorical ambiguity may be found in the fact that
the axiomatization proposed here enables an explicit enumeration of all
valid interpretations. In the same way a syntactic theory enables all valid
syntactic interpretations of a sentence to be derived, the axiomatization
proposed in this book enables all valid discourse interpretations of a
text to be derived. But in the same way a syntactic theory may produce
interpretations that are incorrect from a semantic perspective, this axiomatization
may produce interpretations that are incorrect when additional discourse-specific
phenomena, such as focus, cohesion, and intentions, are factored in.
Since the formalism and algorithms described in this book
can be applied to any text, the strengths and weaknesses of the approach
and the generality of the principles it is based on can be immediately
and properly evaluated. The evaluations carried out are both intrinsic
and extrinsic:
-
For the intrinsic evaluation, I assess how closely the discourse
structures derived automatically for a set of texts matched the discourse
structures that were constructed by humans.
-
For the extrinsic evaluation, I estimate the utility of automatically
derived discourse structures to produce summaries of texts. To this end,
I first show by means of a psycholinguistic experiment that discourse structures
can be used effectively in order to determine which portions of texts humans
perceive as being important. I then use the lessons learned from the experiment
in order to implement a discourse-based summarization algorithm, which
identifies important clauses and sentences in text at levels of performance
that exceed those of current commercial systems and are close to those
of humans.
Automatically deriving the discourse structure of text is
a difficult problem. This book does not solve it. Importantly, though,
the book shows how one can estimate quantitatively the validity of the
theoretical assumptions that it relies upon and the success of the discourse
parsing and discourse-based summarization algorithms that it proposes.
That is, the book allows one to make not only qualitative statements, such
as ``discourse processing is hard'', but also quantitative ones, such as
the following:
-
``By using cue phrases and cohesion, I can implement a discourse
parser that is 30% below human performance.''
-
``If I determine the elementary units of discourse correctly
and use knowledge about cue phrases, cohesion, part of speech tags, and
Wordnet lexical relations, I can build discourse structures whose hierarchical
scaffold is as good as the scaffold of the structures built by humans.''
-
``If I use machine learning techniques, I can train a discourse-based
summarizer to identify important units in short scientific articles as
well as humans do. Using the same techniques, I can train a discourse-based
summarizer to identify important units in newspaper articles at levels
of performance that are 10% below the level of humans.''
Such quantitative estimates of the effects of the hypotheses
and choices one makes in developing theories and algorithms are crucial
for furthering progress in the field.
Being able to derive automatically the structure of text
can have a significant impact on solving a variety of problems in syntactic
processing, natural language generation, machine translation, summarization,
question answering, and information retrieval. Some of these problems may
be addressed using only the theory and algorithms presented in this book.
Some of them may need more elaborate theories and algorithms. I hope this
book will provide a starting point to those who want to address these problems
and inspire those who believe that automatic discourse processing is feasible.