Auto Search: Semantic Search Inside the JVM

When you cannot remember the exact name of the thing you are looking for, what do you do? You Ctrl-F a few guesses, then go ask the person who built it.

Imagine you have a reports repository with a few hundred data items across dozens of reports. Users know the concept they want ("how many GPs do we have"), but rarely the exact name an analyst gave it ("GP FTE", "Total practitioner FTE", or something else again). The search is a name match. So the workflow degrades to clicking around reports, then asking someone.

Keyword search returns nothing for a natural-language query

Auto Search is a semantic search bar that goes straight to the matching item across all reports, regardless of the words used. Same query, same dataset.

The search ladder we already know

Most teams reach for this ladder when "the search isn't good enough":

Each rung buys recall, but they all match strings, not meaning. "GP FTE", "doctor hours" and "general practitioner full-time equivalent" describe the same concept and share zero tokens. The synonym list at rung three is a bag of intent the team has to maintain forever.

Semantic embeddings are the next rung up: the search compares meaning, not strings. The catch is the off-the-shelf models do not know your domain.

Why off-the-shelf embeddings fall short

Sentence embedding models are trained on general English. They know "doctor" and "physician" are close. They do not know "GP FTE" and "general practitioner full-time equivalent" are the same thing in an Australian primary care context.

Off-the-shelf all-MiniLM-L6-v2 hits Recall@1 of 0.75 on the workforce planning corpus. bge-small-en-v1.5 is better at 0.80. Both still miss one query in five, and the misses cluster on exactly the domain-specific phrasings that matter most.

Fine-tuning teaches the model the vocabulary. The unglamorous bit is sourcing the training data.

Synthetic training pairs from Claude

There is no human labelled "golden dataset" yet so the pipeline manufactures them: ten natural-language queries per item, generated by Claude Haiku, prompted to vary acronyms, synonyms, colloquialisms and specificity.

The prompt:

prompt = (
    f"Data item:\nName: {item['name']}\nDescription: {item.get('description', '')}\n\n"
    f"Generate {QUERIES_PER_ITEM} diverse natural-language queries a health workforce planner "
    f"might type to find this item. Include acronym expansions, synonyms, colloquial phrasings, "
    f"and different specificity levels. Return as a JSON array of strings only."
)

A typical run on item GP FTE (description: "Full-time equivalent general practitioners delivering primary care services") produces queries like:

"GP staffing levels"
"GP FTE primary care"
"general practitioner workforce capacity"
"full-time equivalent doctors in primary care"
"how many FTE GPs do we have"

After regenerating against changed items in subsequent runs, the corpus accumulates ~5,900 query/item pairs across 350 items. 80% train, 20% holdout. Each pair is one positive example: this query maps to this item. Negatives are sampled in-batch by MultipleNegativesRankingLoss, so every other item in the batch acts as a negative for any given query.

LLM-generated training data is well-established for this use case; the catch is it teaches the model to recognise the kinds of phrasing the LLM produces. A second, smaller test set with a different phrasing distribution (shorter, keyword-heavy queries) is what tells you whether the model generalises beyond the verbose, conversational patterns Claude produces by default. Both sets are synthetic; a real-user-query set is a follow-up.

Architecture

The pipeline runs offline; the runtime is in-process inside a Spring Boot app.

Offline pipeline

A manifest stores sha256(wpp_id + item_id + name + description) per item. Each run only regenerates pairs and embeddings for new or changed items. The first full run costs about 30c in Claude API calls; incremental runs are free unless the corpus changes.

Runtime

EmbeddingService loads the ONNX model on app startup. SimilarityService loads the precomputed vectors into memory. The server-side embed + similarity step takes ~5 ms on a laptop (excluding Vue debounce, network and DOM render):

Vue SearchBar debounces input (300 ms, min 3 chars), POSTs the query
EmbeddingService.embed() runs the query through ONNX Runtime
SimilarityService.search() does an in-memory cosine over 350 vectors, returns top-K
The dropdown renders results with confidence scores
Selection navigates to the parent report set with ?highlight=itemId

The destination page reads the highlight parameter, scrolls to the item, and fades a CSS highlight over three seconds.

Matched item highlighted on the destination page

Why ONNX in the JVM, not Bedrock at query time

There is an option to call Amazon Bedrock for the query embedding on every search. This will work but is one more component to manage. It also adds a network hop, a per-query cost and a runtime dependency on a service that can throttle or fail.

ONNX Runtime in the JVM trades that for a one-off model load on app startup:

Property	Bedrock at query time	ONNX in JVM
Per-query call (typical)	~50 ms	~5 ms
Per-query cost	Yes	No
Network dependency	Yes	None after startup
Deployment surface	+1 service	Same JAR
Model swap	Trivial	Rebuild image + republish vectors

The Bedrock figure is API round-trip; the ONNX figure is in-process inference. The gap is the network, which is the point.

For a feature that needs to feel like Ctrl-F, the latency alone settles it.

The fine-tuned all-MiniLM-L6-v2 is 22M parameters. INT8 quantisation gets the ONNX file under 25 MB. It loads in under a second on a t3.medium.

public EmbeddingService(
        @Value("${search.model.path}") String modelPath,
        @Value("${search.tokenizer.path}") String tokenizerPath
) throws OrtException, IOException {
    this.env = OrtEnvironment.getEnvironment();
    this.session = env.createSession(modelPath, new OrtSession.SessionOptions());
    this.tokenizer = HuggingFaceTokenizer.newInstance(Path.of(tokenizerPath));
}
 
public float[] embed(String text) throws OrtException {
    Encoding enc = tokenizer.encode(text, true, true);
    try (
        OnnxTensor idsTensor   = OnnxTensor.createTensor(env, new long[][]{enc.getIds()});
        OnnxTensor maskTensor  = OnnxTensor.createTensor(env, new long[][]{enc.getAttentionMask()});
        OnnxTensor typesTensor = OnnxTensor.createTensor(env, new long[][]{enc.getTypeIds()});
        OrtSession.Result result = session.run(Map.of(
            "input_ids",      idsTensor,
            "attention_mask", maskTensor,
            "token_type_ids", typesTensor
        ))
    ) {
        float[][][] hidden = (float[][][]) result.get("last_hidden_state").get().getValue();
        return meanPoolAndNormalize(hidden[0], enc.getAttentionMask());
    }
}

Mean-pool the token hidden states with the attention mask, L2-normalise the result. Standard sentence-transformers pooling, written by hand because the JVM has no sentence-transformers equivalent.

Similarity is even simpler. With L2-normalised vectors, cosine similarity is the dot product:

public List<SearchResult> search(float[] queryVector, int topK) {
    record Scored(DataItem item, float score) {}
    List<Scored> scored = items.stream()
        .map(i -> new Scored(i, cosine(queryVector, i.embedding())))
        .sorted(Comparator.comparingDouble(Scored::score).reversed())
        .toList();
 
    List<SearchResult> results = new ArrayList<>();
    for (Scored s : scored.subList(0, Math.min(topK, scored.size()))) {
        if (s.score() < MIN_SCORE) break;
        results.add(new SearchResult(s.item().wppId(), s.item().itemId(), s.item().name(), s.score()));
    }
    return results;
}
 
static float cosine(float[] a, float[] b) {
    float dot = 0;
    for (int i = 0; i < a.length; i++) dot += a[i] * b[i];
    return dot;
}

A for-loop over 350 vectors. No vector DB. At this scale, FAISS or pgvector would be operational dependencies that buy nothing. The day the corpus crosses ~50k items, the loop stops fitting in a request budget; that is when the conversation about a vector index becomes worth having.

Evaluation

Two test sets, both synthetic. The LLM holdout (queries Claude generated, withheld from training, n=1182) measures whether the model learned the patterns. A separate, smaller test set (n=20, shorter and keyword-heavier than the training pairs) is the harder check: same LLM source, different phrasing distribution.

Model	Recall@1	MRR@5
all-MiniLM-L6-v2 (OOTB baseline)	0.750	0.808
bge-small-en-v1.5 (OOTB baseline)	0.800	0.850
all-MiniLM-L6-v2 (fine-tuned, INT8 ONNX)	0.850	0.908

All scores on the secondary test set (n=20). Fine-tuned Recall@1 on the LLM holdout (n=1182): 0.93.

Roughly +5pp Recall@1 over the best off-the-shelf baseline (n=20 is too small to claim significance, but the same fine-tuned model also beats the larger bge-small baseline that runs at fp32). The directional result is what matters for production: smaller, faster and more accurate, because the smaller model knows the domain and the larger one does not.

Recall@1 of 0.85 means one query in seven still misses on the top hit. The dropdown shows top-5 with confidence scores, so a near-miss usually surfaces in slot 2 or 3. The minScore = 0.4 threshold suppresses results below that. Better to show nothing than to show a confident-looking wrong answer.

Lessons learnt

The threshold matters more than the model. minScore = 0.4 was set by eyeballing the score distribution on the secondary test set. Too high and good matches get suppressed. Too low and the dropdown fills with confident-looking nonsense. This needs revisiting any time the model or corpus shifts.

The LLM holdout overstates performance. Recall@1 on it is 0.93, on the secondary test set 0.85. The gap is what the model learned about Claude's verbose conversational phrasing rather than the shorter keyword-style phrasing in the secondary set. The 0.85 is the one worth quoting, and even that is a synthetic proxy until real-user logs exist.

MultipleNegativesRankingLoss is the right loss for retrieval. It treats every other item in the batch as an implicit negative, which means batch size affects training quality. Batch 32 gave clean convergence in three epochs on ~4,700 training pairs.

Embeddings are not free RAM. 350 × 384 × 4 bytes is half a megabyte. 50k items would be 75 MB before any indexing overhead. The "in-memory cosine for everything" pattern has a ceiling.

Where it goes next

More items. The for-loop suffices for hundreds of items. Thousands still fit; tens of thousands need an approximate nearest neighbour index (HNSW via Lucene, or FAISS if you want to leave the JVM). The SimilarityService interface does not change.

More corpora. The pipeline is corpus-agnostic: corpus.json is the only input. Any team with a structured catalogue (data items, configuration entries, product SKUs, anything where users hunt by intent) can drop in their items and rerun.

Hybrid retrieval. Semantic search is strong on intent, weak on rare exact tokens (codes, IDs, version numbers). A BM25 layer on top, with score fusion, is the next obvious upgrade for search bars where both kinds of query happen.

Online learning. Logged queries with click-throughs become a real labelled set. Once you have those, the synthetic-pair scaffold can come down.

Code and Further Reading

The full source is on GitHub. The stack is Java 21, Spring Boot, ONNX Runtime, DJL HuggingFace Tokenizers on the runtime side; Python 3.12 with sentence-transformers on the training side; Vue 3 on the frontend.

For the loss function and training loop, the sentence-transformers training docs cover MultipleNegativesRankingLoss in depth.

For ONNX export from sentence-transformers, the Optimum library is the path of least resistance.

If your app has a search bar that fails on intent, the pipeline is on GitHub and corpus.json is the only thing you need to swap. I'd like to hear what breaks when you try it on yours.

AI Tools

Claude Code was used to plan and build the pipeline, the Java services and the Vue frontend. Claude Haiku generated the synthetic training pairs. Claude was used to draft this post.