Publications

Using representation learning and web text to identify competitor networks

Abstract

Understanding the competitive landscape of public and private companies is essential for a range of activities. Prior work has often characterized competition using inflexible industry classifications or relied on proprietary data for public companies. The almost total lack of coverage of private companies has also been a severe limitation. This paper addresses these limitations by using a vast and untapped resource for understanding the competitor network for both public and private companies: Web text. We use representation learning techniques to generate robust representations (word embeddings) of companies in a high-dimensional vector space and to accurately identify the competitor network of peers for a focal company. We evaluate the competitor network against multiple downstream applications:(1) Predicting profitability;(2) Validating the self-identified competitors of a focal company;(3) Identifying In-Sector (IS) peers that occur in the same industry sector as the focal company;(4) Determining industry classification codes (NAICS) for a focal company. Our proposed approaches match or improve on the performance of prior baselines that relied on curated corpora, despite the use of noisy Web text. In particular, embedding based models outperform prior baseline models in identifying In-Sector (IS) peers, in identifying peers that are in the same NAICS sector as the focal company, and in predicting the NAICS code for private companies. A use case identifies scenarios where companies can gain the most benefit from high quality Web text about their products and services.

Date
September 22, 2025
Authors
Gerard Hoberg, Craig Knoblock, Gordon Phillips, Jay Pujara, Zhiqiang Qiu, Louiqa Raschid
Publisher
Working Paper