Intelligent Company Name Unification Using Semantic Similarity

- Published on
- /1 mins read/---
Overview
In modern email processing systems, one common challenge is dealing with variations of company names. The same company might appear with slightly different names across multiple emails - for example, "Microsoft Corp", "Microsoft Corporation", or "Microsoft Inc". This post describes how we implemented an intelligent system to unify these company names using semantic similarity.
Implementation
Core Features
- OpenAI's text-embedding-3-large model for semantic analysis
- Batch processing with similarity threshold ≥ 0.7
- MongoDB-based permanent storage system with local TTLCache for performance
Performance Improvement
Implemented a two-tier caching system with MongoDB permanent storage and local TTLCache for company embeddings, achieved 95x performance improvement.
- Without cache: ~1.8s per API call
- With cache: ~0.019s per request
- 95x performance improvement
Future Improvements
Advanced Matching
- Company subsidiaries recognition
- Multilingual support
Analytics
- Industry classification
- Company relationship tracking
This solution significantly improves data consistency and processing efficiency in our email analysis system.
← Previous postMulti-Agent Chatbot Project Documentation