- Published on
- • 10 min read
Auto Search: Semantic Search Inside the JVM
- Authors

- Name
- Ching Chew
- Socials
When you cannot remember the exact name of the thing you are looking for, what do you do? You Ctrl-F a few guesses, then go ask the person who built it.
Imagine you have a reports repository with a few hundred data items across dozens of reports. Users know the concept they want ("how many GPs do we have"), but rarely the exact name an analyst gave it ("GP FTE", "Total practitioner FTE", or something else again). The search is a name match. So the workflow degrades to clicking around reports, then asking someone.

Auto Search is a semantic search bar that goes straight to the matching item across all reports, regardless of the words used. Same query, same dataset.

The search ladder we already know
Most teams reach for this ladder when "the search isn't good enough":
Each rung buys recall, but they all match strings, not meaning. "GP FTE", "doctor hours" and "general practitioner full-time equivalent" describe the same concept and share zero tokens. The synonym list at rung three is a bag of intent the team has to maintain forever.
Semantic embeddings are the next rung up: the search compares meaning, not strings. The catch is the off-the-shelf models do not know your domain.
Why off-the-shelf embeddings fall short
Sentence embedding models are trained on general English. They know "doctor" and "physician" are close. They do not know "GP FTE" and "general practitioner full-time equivalent" are the same thing in an Australian primary care context.
Off-the-shelf all-MiniLM-L6-v2 hits Recall@1 of 0.75 on the workforce planning corpus. bge-small-en-v1.5 is better at 0.80. Both still miss one query in five, and the misses cluster on exactly the domain-specific phrasings that matter most.
Fine-tuning teaches the model the vocabulary. The unglamorous bit is sourcing the training data.
Synthetic training pairs from Claude
There is no human labelled "golden dataset" yet so the pipeline manufactures them: ten natural-language queries per item, generated by Claude Haiku, prompted to vary acronyms, synonyms, colloquialisms and specificity.
The prompt:
prompt = (
f"Data item:\nName: {item['name']}\nDescription: {item.get('description', '')}\n\n"
f"Generate {QUERIES_PER_ITEM} diverse natural-language queries a health workforce planner "
f"might type to find this item. Include acronym expansions, synonyms, colloquial phrasings, "
f"and different specificity levels. Return as a JSON array of strings only."
)A typical run on item GP FTE (description: "Full-time equivalent general practitioners delivering primary care services") produces queries like:
- "GP staffing levels"
- "GP FTE primary care"
- "general practitioner workforce capacity"
- "full-time equivalent doctors in primary care"
- "how many FTE GPs do we have"
After regenerating against changed items in subsequent runs, the corpus accumulates ~5,900 query/item pairs across 350 items. 80% train, 20% holdout. Each pair is one positive example: this query maps to this item. Negatives are sampled in-batch by MultipleNegativesRankingLoss, so every other item in the batch acts as a negative for any given query.
LLM-generated training data is well-established for this use case; the catch is it teaches the model to recognise the kinds of phrasing the LLM produces. A second, smaller test set with a different phrasing distribution (shorter, keyword-heavy queries) is what tells you whether the model generalises beyond the verbose, conversational patterns Claude produces by default. Both sets are synthetic; a real-user-query set is a follow-up.
Architecture
The pipeline runs offline; the runtime is in-process inside a Spring Boot app.
Offline pipeline
A manifest stores sha256(wpp_id + item_id + name + description) per item. Each run only regenerates pairs and embeddings for new or changed items. The first full run costs about 30c in Claude API calls; incremental runs are free unless the corpus changes.
Runtime
EmbeddingService loads the ONNX model on app startup. SimilarityService loads the precomputed vectors into memory. The server-side embed + similarity step takes ~5 ms on a laptop (excluding Vue debounce, network and DOM render):
- Vue
SearchBardebounces input (300 ms, min 3 chars), POSTs the query EmbeddingService.embed()runs the query through ONNX RuntimeSimilarityService.search()does an in-memory cosine over 350 vectors, returns top-K- The dropdown renders results with confidence scores
- Selection navigates to the parent report set with
?highlight=itemId

The destination page reads the highlight parameter, scrolls to the item, and fades a CSS highlight over three seconds.

Why ONNX in the JVM, not Bedrock at query time
There is an option to call Amazon Bedrock for the query embedding on every search. This will work but is one more component to manage. It also adds a network hop, a per-query cost and a runtime dependency on a service that can throttle or fail.
ONNX Runtime in the JVM trades that for a one-off model load on app startup:
| Property | Bedrock at query time | ONNX in JVM |
|---|---|---|
| Per-query call (typical) | ~50 ms | ~5 ms |
| Per-query cost | Yes | No |
| Network dependency | Yes | None after startup |
| Deployment surface | +1 service | Same JAR |
| Model swap | Trivial | Rebuild image + republish vectors |
The Bedrock figure is API round-trip; the ONNX figure is in-process inference. The gap is the network, which is the point.
For a feature that needs to feel like Ctrl-F, the latency alone settles it.
The fine-tuned all-MiniLM-L6-v2 is 22M parameters. INT8 quantisation gets the ONNX file under 25 MB. It loads in under a second on a t3.medium.
public EmbeddingService(
@Value("${search.model.path}") String modelPath,
@Value("${search.tokenizer.path}") String tokenizerPath
) throws OrtException, IOException {
this.env = OrtEnvironment.getEnvironment();
this.session = env.createSession(modelPath, new OrtSession.SessionOptions());
this.tokenizer = HuggingFaceTokenizer.newInstance(Path.of(tokenizerPath));
}
public float[] embed(String text) throws OrtException {
Encoding enc = tokenizer.encode(text, true, true);
try (
OnnxTensor idsTensor = OnnxTensor.createTensor(env, new long[][]{enc.getIds()});
OnnxTensor maskTensor = OnnxTensor.createTensor(env, new long[][]{enc.getAttentionMask()});
OnnxTensor typesTensor = OnnxTensor.createTensor(env, new long[][]{enc.getTypeIds()});
OrtSession.Result result = session.run(Map.of(
"input_ids", idsTensor,
"attention_mask", maskTensor,
"token_type_ids", typesTensor
))
) {
float[][][] hidden = (float[][][]) result.get("last_hidden_state").get().getValue();
return meanPoolAndNormalize(hidden[0], enc.getAttentionMask());
}
}Mean-pool the token hidden states with the attention mask, L2-normalise the result. Standard sentence-transformers pooling, written by hand because the JVM has no sentence-transformers equivalent.
Similarity is even simpler. With L2-normalised vectors, cosine similarity is the dot product:
public List<SearchResult> search(float[] queryVector, int topK) {
record Scored(DataItem item, float score) {}
List<Scored> scored = items.stream()
.map(i -> new Scored(i, cosine(queryVector, i.embedding())))
.sorted(Comparator.comparingDouble(Scored::score).reversed())
.toList();
List<SearchResult> results = new ArrayList<>();
for (Scored s : scored.subList(0, Math.min(topK, scored.size()))) {
if (s.score() < MIN_SCORE) break;
results.add(new SearchResult(s.item().wppId(), s.item().itemId(), s.item().name(), s.score()));
}
return results;
}
static float cosine(float[] a, float[] b) {
float dot = 0;
for (int i = 0; i < a.length; i++) dot += a[i] * b[i];
return dot;
}A for-loop over 350 vectors. No vector DB. At this scale, FAISS or pgvector would be operational dependencies that buy nothing. The day the corpus crosses ~50k items, the loop stops fitting in a request budget; that is when the conversation about a vector index becomes worth having.
Evaluation
Two test sets, both synthetic. The LLM holdout (queries Claude generated, withheld from training, n=1182) measures whether the model learned the patterns. A separate, smaller test set (n=20, shorter and keyword-heavier than the training pairs) is the harder check: same LLM source, different phrasing distribution.
| Model | Recall@1 | MRR@5 |
|---|---|---|
| all-MiniLM-L6-v2 (OOTB baseline) | 0.750 | 0.808 |
| bge-small-en-v1.5 (OOTB baseline) | 0.800 | 0.850 |
| all-MiniLM-L6-v2 (fine-tuned, INT8 ONNX) | 0.850 | 0.908 |
All scores on the secondary test set (n=20). Fine-tuned Recall@1 on the LLM holdout (n=1182): 0.93.
Roughly +5pp Recall@1 over the best off-the-shelf baseline (n=20 is too small to claim significance, but the same fine-tuned model also beats the larger bge-small baseline that runs at fp32). The directional result is what matters for production: smaller, faster and more accurate, because the smaller model knows the domain and the larger one does not.
Recall@1 of 0.85 means one query in seven still misses on the top hit. The dropdown shows top-5 with confidence scores, so a near-miss usually surfaces in slot 2 or 3. The minScore = 0.4 threshold suppresses results below that. Better to show nothing than to show a confident-looking wrong answer.
Lessons learnt
The threshold matters more than the model. minScore = 0.4 was set by eyeballing the score distribution on the secondary test set. Too high and good matches get suppressed. Too low and the dropdown fills with confident-looking nonsense. This needs revisiting any time the model or corpus shifts.
The LLM holdout overstates performance. Recall@1 on it is 0.93, on the secondary test set 0.85. The gap is what the model learned about Claude's verbose conversational phrasing rather than the shorter keyword-style phrasing in the secondary set. The 0.85 is the one worth quoting, and even that is a synthetic proxy until real-user logs exist.
MultipleNegativesRankingLoss is the right loss for retrieval. It treats every other item in the batch as an implicit negative, which means batch size affects training quality. Batch 32 gave clean convergence in three epochs on ~4,700 training pairs.
Embeddings are not free RAM. 350 × 384 × 4 bytes is half a megabyte. 50k items would be 75 MB before any indexing overhead. The "in-memory cosine for everything" pattern has a ceiling.
Where it goes next
More items. The for-loop suffices for hundreds of items. Thousands still fit; tens of thousands need an approximate nearest neighbour index (HNSW via Lucene, or FAISS if you want to leave the JVM). The SimilarityService interface does not change.
More corpora. The pipeline is corpus-agnostic: corpus.json is the only input. Any team with a structured catalogue (data items, configuration entries, product SKUs, anything where users hunt by intent) can drop in their items and rerun.
Hybrid retrieval. Semantic search is strong on intent, weak on rare exact tokens (codes, IDs, version numbers). A BM25 layer on top, with score fusion, is the next obvious upgrade for search bars where both kinds of query happen.
Online learning. Logged queries with click-throughs become a real labelled set. Once you have those, the synthetic-pair scaffold can come down.
Code and Further Reading
The full source is on GitHub. The stack is Java 21, Spring Boot, ONNX Runtime, DJL HuggingFace Tokenizers on the runtime side; Python 3.12 with sentence-transformers on the training side; Vue 3 on the frontend.
For the loss function and training loop, the sentence-transformers training docs cover MultipleNegativesRankingLoss in depth.
For ONNX export from sentence-transformers, the Optimum library is the path of least resistance.
If your app has a search bar that fails on intent, the pipeline is on GitHub and corpus.json is the only thing you need to swap. I'd like to hear what breaks when you try it on yours.
AI Tools
Claude Code was used to plan and build the pipeline, the Java services and the Vue frontend. Claude Haiku generated the synthetic training pairs. Claude was used to draft this post.