Abstract
Document clustering is a popular tool for automatically organizing
a large collection of texts. Clustering
algorithms are usually applied to documents represented as vectors
in a high dimensional term space. We
investigate the use of Latent Semantic Analysis to create a new vector
space, that is the optimal
representation of the document collection. Documents are projected
onto a small subspace of this vector
space and clustered. We compare the performance of clustering algorithms
when applied to documents
represented in the full term space and in reduced dimension subspace
of the LSA-generated vector space.
We report significant improvements in cluster quality for LSA subspaces
with optimal dimensionality. We
discuss the procedure for determining the right number of dimensions
for the subspace. Moreover, when
this number is small, the total running time of the clustering algorithm
is comparable to the one that uses
the full term space.