Search Relevance Testing

Dashboard

Run test batch

Store domain / persona

Optional. Sent to Gemini when scoring each query’s results so scores favor products in this domain (e.g. on electronics, "mobile" ranks phones above accessory-only hits unless the query clearly asks for accessories).

Client

Source

JSON on server (optional)

Or upload JSON from your computer

Limit (recommendations)

Max queries Leave empty for all

Run by domain (generate keywords)

Store domain / persona

Optional. Sent to Gemini when scoring results. If empty, the keyword "Domain" below is also used for scoring context.

Client

Domain

Number of keywords

Language (test cases)

Limit (recommendations)

Save as (optional) Persist this run's keywords for re-run without regenerating.

Run from queries (comma-separated or file)

Store domain / persona

Optional. Sent to Gemini when scoring each query’s results so relevance reflects this store type.

Client

Queries

Or upload file One query per line

Limit (recommendations)

Max queries

Client data (Catalog API)

Load catalog from the catalog sync API, then generate test cases across selected use cases and run a test batch. You can select one, several, or all use cases. Total test cases are divided across selected use cases, or set per-use-case count directly.

Client Catalog and recommendations use this client’s base URL.

Store domain / persona

Optional. Sent to Gemini when scoring each generated test query’s results (use-case context is included automatically).

Dataset limit Leave empty or 0 to load all. Otherwise max products (page size auto, up to 200/request).

Use cases

Select one, several, or all use cases to run.

Use case list

Total test cases When set, this drives the batch size. "Per use case" is ignored.

Per use case Leave empty when using Total above. If both empty, the server default applies per use case.

Limit (recommendations)

Language (test cases)

Save as (optional) Persist this run's test cases for re-run without regenerating.

Run from saved set

Re-run a previously saved test case set. No catalog fetch or Gemini — same queries every time.

Store domain / persona

Optional. Sent to Gemini when scoring results for this batch (saved queries are unchanged).

Saved set

Client

Limit (recommendations)

Saved test case sets

Saved sets are stored in the DB. Delete to free space.

Name	Client	Source	Queries	Created	Actions

Quick links

Clients · View test batches · View runs · Insights · Evaluation metrics guidelines

Clients

Each client has a name and a base URL (e.g. https://search-pg.ecommbeta.com). Recommendations and catalog APIs use the same base: /recommendations and /catalog/sync/products/. Select a client when running a batch.

Add client

Name

Base URL

Dashboard base URL Optional; used when running batch from Unique searches API

ID	Name	Base URL	X-Tenant-ID	Dashboard URL	Actions

Test batches

ID	Name	Created	Runs	Actions

Batch

Summary

Runs

Click a run ID to see products with scores and reasons below.

ID	Query	Score	Verdict	Cost

Runs

Limit

ID	Batch	Query	Score	Verdict	Cost	Created

Run

Query:

Score: · Verdict: · Cost: $

Analysis summary

Products

#	Product	Score	Reason

Store domain (optional)

Only for Run per-search analysis below: same role as Dashboard → Store domain / persona.

Insights

Runs to analyze

ID	Created	Run IDs	Report

Evaluation metrics guidelines

This page explains, in plain language, how we score each search using graded relevance—the model’s scores for how well each returned product fits the query. You do not need a technical background to use it.

Each search returns an ordered list of products with a relevance score on each row. Reports and Excel use these graded metrics so you see how strong the matches are, not only pass or fail.

How graded relevance scores are assigned

After the search API returns products, a language model (LLM) looks at the customer’s query and the exact list of product names that came back. It assigns each product a whole-number score from 0 to 100 using a fixed scale: a high score means the product fits the query well (up to a near-perfect match), mid scores mean partly related, and very low scores mean a poor or off-topic match. For weaker items, the model may add a short explanation. It also writes a brief overall verdict about the quality of the whole result set. Those per-product scores are what drive accuracy, precision, coverage, NDCG, and F1 in the graded view.

What “how many results we asked for” means

When you set a limit (for example 50 products), that number is what we compare against for several metrics. Think of it as the depth we expect the search to fill. If the API returns fewer than that, some scores will reflect that gap on purpose.

Accuracy

The average strength of all relevance scores in the list. Every position is treated the same—so unlike precision, shuffling order does not change this number. It is a pure “overall quality of this list” reading.

Empty results: If nothing is returned, metrics read as zero so the test still counts in batch averages.

Precision

Looks at the top of the list, up to the limit you asked for, and weights higher positions more than lower ones. Strong scores at the top pull precision up; burying the best match at the bottom pulls it down.

Example in words: If the strongest match is only in 8th place, precision will look weaker than if that same item were 1st or 2nd.

Recall

In exported Excel this column is labeled Recall, but the idea is closer to coverage: how much useful material you got compared to what you asked for, together with how strong those scores are. If you asked for 50 results but only received 5, this number goes down—even if those 5 score very well—because filling the requested depth matters. NDCG, by contrast, only judges whether those returned items were ordered well.

No full ground truth: In most runs we do not have a complete catalog of “every product that should have matched” the query. Textbook recall compares what you found with all relevant items—including strong matches the search never returned. We never see those missed items in the data, so we cannot treat the problem like a full checklist (in technical terms, false negatives are not observable here). The Recall column is therefore a practical coverage-style score built from returned products and their LLM scores, not classic recall against a complete answer key.

When many results come back: We compare the top portion to the best scores that could sit in those slots, so a long list that is both strong and well filled can still score high on this measure.

F1 score

One combined number that balances precision and coverage. Use it when you want a single headline instead of reading both.

NDCG ranking

Answers: Were the returned products sorted in a sensible order by their relevance scores? It compares the actual order to the best possible order using the same items that came back. A short list can still score well if those items are ordered well. Shortfalls on “we asked for many but got few” are reflected in coverage, not here.

Dashboard

Run test batch

Run by domain (generate keywords)

Run from queries (comma-separated or file)

Client data (Catalog API)

Run from saved set

Saved test case sets

Quick links

Clients

Add client

Test batches

Batch

Summary

Runs

Run —

Runs

Run

Analysis summary

Products

Insights

Evaluation metrics guidelines

How graded relevance scores are assigned

What “how many results we asked for” means

Accuracy

Precision

Recall

F1 score

NDCG ranking