Finding Unity
Why Your AI Search Might Not Understand What You're Looking For
Imagine typing a straightforward query into your AI-powered business intelligence platform: "What JDs have unity in their requirements?" You expect to see results like the Mobile Games Developer role, which explicitly calls for "Experience with Unity or other game engines." Instead, the system replies: "None of your job descriptions mention unity as a requirement." Frustrating, right? You (I specifically) have just encountered one of the subtle pitfalls of modern semantic search, a technology that's brilliant at grasping nuance but can falter spectacularly on ambiguous terms.
In this article/moan , we (I) will unpack why this happens, how semantic search really works under the hood, and, most importantly, practical ways to fix it. We'll build on real-world examples like "Unity" to explore solutions, including advanced techniques like query rewriting and domain-specific embeddings, to make your platform more robust for cold, contextless prompts.
The Word That Means Everything (and Nothing)
"Unity" is a classic case of polysemy, a word with multiple, often unrelated meanings. Here's a quick rundown of its many faces:
When you query "Unity" in a job search context, a human might infer the game engine based on surrounding clues. But an wide usage AI system? After a cold start, lacking any history, and without explicit guidance, it has to guess and it often leans toward the most common or generic interpretation, like "team unity" in HR documents.
This isn't a bug; it's a feature of how AI processes language.
How Semantic Search Actually Works (as far as I can figure)
Unlike old-school keyword searches that simply scan for exact matches, semantic search uses embeddings, mathematical vectors that capture the "meaning" of text. Tools like OpenAI's embeddings or Sentence-BERT convert your query and documents into high-dimensional points in vector space. Similar meanings cluster together, enabling fuzzy matches:
This is revolutionary for natural language queries. But ambiguity strikes when a word like "Unity" pulls in multiple directions. The embedding for "Unity" might average out across its senses, drifting toward dominant usages (e.g., philosophical unity in general text corpora) rather than niche ones (e.g., the game engine in tech docs).
Recommended by LinkedIn
The Cold Start Problem: No Context, No Clarity
Humans rely on context to disambiguate. In the JD example:
But semantic systems often index documents in chunks, splitting long JDs into smaller fragments for efficiency. A chunk with just "Experience with Unity is a plus" loses those clues. Worse, your query is "cold": no conversation history, user profile, or hints. The AI is left interpreting in a vacuum, potentially defaulting to business jargon like "unity in requirements" meaning alignment or process unification.
Add in model biases, general-purpose embeddings are trained on vast internet data where "unity" as harmony dominates, and you've got a recipe for missed hits.
The Irony of Intelligence: Smarter Systems, Bigger Blind Spots
The more "intelligent" the search, the more it infers intent, which can backfire. A dumb keyword tool would flag "Unity" instantly (case-sensitive or not). But semantic search asks: "What does this really mean?" In a job platform, it might pivot to cultural fit or team dynamics, ignoring the tech angle. This is especially true for proper nouns or domain-specific terms, which embeddings handle poorly without tuning.
Real-World Implications and Examples
This isn't just theoretical. In platforms like LinkedIn or custom HR AI tools, similar issues arise with terms like "Python" (programming language vs. snake), "Java" (code vs. island/coffee), or "Swift" (Apple's language vs. fast). Users building AI search for resumes, patents, or codebases face the same: a query for "React" might return chemistry docs on reactions, not the JavaScript library.
As of now with advancements in multimodal AI, these problems persist but are easing through specialised models. For instance, tech-focused embeddings from Hugging Face now better isolate senses in code-heavy datasets.
Solutions: From Quick Fixes to Advanced Overhauls
Fixing this doesn't require scrapping your system. Here's a layered approach, starting simple and scaling up:
So what did I do, raised a bug, worry about it later :)
Insightful article. Love the various solution options. Love even more the "raise a bug and worry about it later" 😂