Rui Tao's Portfolio

Intelligent Company Name Unification Using Semantic Similarity

Graphs of performance analytics on a laptop screen
Published on
/1 mins read/---

Overview

In modern email processing systems, one common challenge is dealing with variations of company names. The same company might appear with slightly different names across multiple emails - for example, "Microsoft Corp", "Microsoft Corporation", or "Microsoft Inc". This post describes how we implemented an intelligent system to unify these company names using semantic similarity.

Implementation

Core Features

  • OpenAI's text-embedding-3-large model for semantic analysis
  • Batch processing with similarity threshold ≥ 0.7
  • MongoDB-based permanent storage system with local TTLCache for performance

Performance Improvement

Implemented a two-tier caching system with MongoDB permanent storage and local TTLCache for company embeddings, achieved 95x performance improvement.

  • Without cache: ~1.8s per API call
  • With cache: ~0.019s per request
  • 95x performance improvement

Future Improvements

  1. Advanced Matching

    • Company subsidiaries recognition
    • Multilingual support
  2. Analytics

    • Industry classification
    • Company relationship tracking

This solution significantly improves data consistency and processing efficiency in our email analysis system.