Optional. Sent to Gemini when scoring each query’s results so scores favor products in this domain (e.g. on electronics, "mobile" ranks phones above accessory-only hits unless the query clearly asks for accessories).
Per page when fetching unique searches
Leave empty for all
Run by domain (generate keywords)
Optional. Sent to Gemini when scoring results. If empty, the keyword "Domain" below is also used for scoring context.
Persist this run's keywords for re-run without regenerating.
Run from queries (comma-separated or file)
Optional. Sent to Gemini when scoring each query’s results so relevance reflects this store type.
One query per line
Client data (Catalog API)
Load catalog from the catalog sync API, then generate test cases across selected use cases and run a test batch. You can select one, several, or all use cases. Total test cases are divided across selected use cases, or set per-use-case count directly.
Catalog and recommendations use this client’s base URL.
Optional. Sent to Gemini when scoring each generated test query’s results (use-case context is included automatically).
Leave empty or 0 to load all. Otherwise max products (page size auto, up to 200/request).
Select one, several, or all use cases to run.
When set, this drives the batch size. "Per use case" is ignored.
Leave empty when using Total above. If both empty, the server default applies per use case.
Persist this run's test cases for re-run without regenerating.
Run from saved set
Re-run a previously saved test case set. No catalog fetch or Gemini — same queries every time.
Optional. Sent to Gemini when scoring results for this batch (saved queries are unchanged).
Saved test case sets
Saved sets are stored in the DB. Delete to free space.
Each client has a name and a base URL (e.g. https://search-pg.ecommbeta.com). Recommendations and catalog APIs use the same base: /recommendations and /catalog/sync/products/. Select a client when running a batch.
Add client
For search-pg.ecommbeta.com only
Optional; used when running batch from Unique searches API
ID
Name
Base URL
X-Tenant-ID
Dashboard URL
Actions
Test batches
ID
Name
Created
Runs
Actions
Batch
·
Summary
Runs
Click a run ID to see products with scores and reasons below.
ID
Query
Score
Verdict
Cost
Run —
#
Product
Score
Reason
Runs
ID
Batch
Query
Score
Verdict
Cost
Created
Run
Query:
Score: ·
Verdict: ·
Cost: $
Analysis summary
Products
#
Product
Score
Reason
Only for Run per-search analysis below: same role as Dashboard → Store domain / persona.
Insights
ID
Created
Run IDs
Report
Evaluation metrics guidelines
This page explains, in plain language, how we score each search using graded relevance—the model’s scores for how well each returned product fits the query. You do not need a technical background to use it.
Each search returns an ordered list of products with a relevance score on each row. Reports and Excel use these graded metrics so you see how strong the matches are, not only pass or fail.
How graded relevance scores are assigned
After the search API returns products, a language model (LLM) looks at the customer’s query and the exact list of product names that came back. It assigns each product a whole-number score from 0 to 100 using a fixed scale: a high score means the product fits the query well (up to a near-perfect match), mid scores mean partly related, and very low scores mean a poor or off-topic match. For weaker items, the model may add a short explanation. It also writes a brief overall verdict about the quality of the whole result set. Those per-product scores are what drive accuracy, precision, coverage, NDCG, and F1 in the graded view.
What “how many results we asked for” means
When you set a limit (for example 50 products), that number is what we compare against for several metrics. Think of it as the depth we expect the search to fill. If the API returns fewer than that, some scores will reflect that gap on purpose.
Accuracy
The average strength of all relevance scores in the list. Every position is treated the same—so unlike precision, shuffling order does not change this number. It is a pure “overall quality of this list” reading.
Empty results: If nothing is returned, metrics read as zero so the test still counts in batch averages.
Precision
Looks at the top of the list, up to the limit you asked for, and weights higher positions more than lower ones. Strong scores at the top pull precision up; burying the best match at the bottom pulls it down.
Example in words: If the strongest match is only in 8th place, precision will look weaker than if that same item were 1st or 2nd.
Recall
In exported Excel this column is labeled Recall, but the idea is closer to coverage: how much useful material you got compared to what you asked for, together with how strong those scores are. If you asked for 50 results but only received 5, this number goes down—even if those 5 score very well—because filling the requested depth matters. NDCG, by contrast, only judges whether those returned items were ordered well.
No full ground truth: In most runs we do not have a complete catalog of “every product that should have matched” the query. Textbook recall compares what you found with all relevant items—including strong matches the search never returned. We never see those missed items in the data, so we cannot treat the problem like a full checklist (in technical terms, false negatives are not observable here). The Recall column is therefore a practical coverage-style score built from returned products and their LLM scores, not classic recall against a complete answer key.
When many results come back: We compare the top portion to the best scores that could sit in those slots, so a long list that is both strong and well filled can still score high on this measure.
F1 score
One combined number that balances precision and coverage. Use it when you want a single headline instead of reading both.
NDCG ranking
Answers: Were the returned products sorted in a sensible order by their relevance scores? It compares the actual order to the best possible order using the same items that came back. A short list can still score well if those items are ordered well. Shortfalls on “we asked for many but got few” are reflected in coverage, not here.