Document Clustering in Reduced Dimension Vector Space

Kristina Lerman
USC Information Sciences Institute
4676 Admiralty Way
Marina del Rey, CA 90292
January 1999


Abstract
Document clustering is a popular tool for automatically organizing a large collection of texts. Clustering
algorithms are usually applied to documents represented as vectors in a high dimensional term space. We
investigate the use of Latent Semantic Analysis to create a new vector space, that is the optimal
representation of the document collection. Documents are projected onto a small subspace of this vector
space and clustered. We compare the performance of clustering algorithms when applied to documents
represented in the full term space and in reduced dimension subspace of the LSA-generated vector space.
We report significant improvements in cluster quality for LSA subspaces with optimal dimensionality. We
discuss the procedure for determining the right number of dimensions for the subspace. Moreover, when
this number is small, the total running time of the clustering algorithm is comparable to the one that uses
the full term space.



(Full paper)