LLM language detection: Two prompt engineering findings

Background

Most conversational agents need to detect the language of an incoming message. The detection drives which knowledge base to search, which language to reply in, and which downstream rules apply. We had been doing this with a fine-tuned fastText classifier (a classical, statistically-trained language identifier) sitting in front of the agent. It is fast (around 1ms per call) and accurate on clean inputs, but its failure modes are exactly the ones customers notice: emails whose body and signature are in different languages, dialect pairs that classical models conflate (Spanish vs Catalan, Traditional vs Simplified Chinese), and short acknowledgements interpreted out of context. We moved detection inside the agent's reasoning step, exposing an update_current_language tool that the LLM can invoke as part of its response. The interesting work was not in the move; it was in two prompt-engineering findings, each one a default we inherited from classical thinking and had to unlearn.

Finding 1: Call the tool before generating the answer, not after

The natural prompt structure is: read the message, generate a customer-facing response, then call any tools the response requires. This works for most tools and failed badly here. The model was forgetting to call update_current_language after producing a long response; the trailing call was being silently dropped. Reversing the order (fire the tool first, then generate the response) substantially improved consistency. The fix is small, a prompt rewrite, not a model change. The underlying mechanism is broadly applicable: when an LLM owns both a customer-facing response and a side-effecting tool call, force the tool call early.

Finding 2: Condition the tool call on the language script, not on self-confidence

Our first prompt asked the model whether it was confident the message was in a supported language and defaulted to "unsupported" under uncertainty. Confidence-gated prompts of this shape systematically underfire on inputs the model is hesitant about; in our case, most of the prior failures were on languages with non-Latin scripts (Arabic, Cyrillic, Mongolian, Khmer), where the model defaulted to "unsupported" under uncertainty. We replaced the confidence gate with a script-first rule, where script means writing system. The prompt names the relevant non-Latin scripts explicitly, which in some cases can give a strong signal to fire the tool. We added a Latin-script alert for tokens that are real words in non-English languages but surface-read as English ("Reso" in Italian, "TIMOG AFRICA" in Filipino/Tagalog). And we instructed the model that a missed language switch is worse than an extra one.

An alternative we tested and did not adopt

A short POC constrained the prompt to a JSON schema, with the detected language emitted as a structured field alongside the user-facing reply. The POC measured added latency on the response key and was paused early; we kept the tool-call architecture.

What we measured

Across ~367k examples spanning 55 languages (expected labels from Google Translate), the LLM detector reached 94.9% to the legacy fastText classifier's 91.8% overall (+3.1pp), with the largest gains on rarer languages: Haitian Creole (+37.6%), Croatian (+25.6%), Slovenian (+17.8%), and Catalan (+17.3%). Significant improvements were also seen in Afrikaans and Swahili (+14.5% each), Norwegian (+11.9%), and Malay, Indonesian, and Japanese (+11.1% each). Chinese (Traditional) is the one language where fastText still came out ahead (-12.92%), and even that is likely overstated: the experiment called the LLM in isolation using a single message, without the conversation history, or bot configuration that UR has access to in production (the live agent responds correctly in Traditional Chinese on some cases the isolated experiment marks as failures), and a portion of the `zh-tw` labels in the dataset are character-ambiguous between Traditional and Simplified.

The more interesting benchmark was the prompt-iteration loop. Once detection lived inside the agent, the relevant metric stopped being "is the language label right" and started being "does the model fire the tool when it should." We iterated the prompt against an evaluation dataset of 500 examples replayed three times for 1500 trials per run, and the three prompt changes above moved the success rate from 78.6% to 97.9%. The prompt instructs the model that a missed call is worse than an extra one, so when in doubt, the system fires. On the should-fire side, 1468 of 1470 cases are correct; both failures are the Italian word "Reso", which means `returned/refunded` in Italian but surface-reads as a plausible English fragment and is genuinely ambiguous out of context.

Market implication

Two findings generalize beyond language detection. First, when an LLM owns both a customer-facing response and a side-effecting tool call, force the tool call early in the generation; long responses can cause the model to skip the trailing tool call. Second, when an LLM gates a tool call on its own confidence, it will systematically under-fire. Condition the call on observable surface features instead (script, syntax, structure): features the model can verify against the input rather than against its own uncertainty.