Skip to content

Amgix Now Benchmarks: Relevance and Latencies

This post is part of Amgix Now Benchmarks Series

Amgix Now - Benchmarks: Relevance

This report focuses on Amgix Now benchmarks, specifically search relevance and latencies across four different datasets. For context, we are including results from three other popular search engines: Typesense, Meilisearch, and Elasticsearch, tested against the same datasets.

Jump to: Benchmarks, Results, or Takeaways

TL;DR In Charts

Typesense Meilisearch Elasticsearch Amgix Now
PC Parts (opt) 0.8438 0.8750 0.9133 0.9325
SciFact 0.3386 0.3770 0.6953 0.6637
Quora 0.2521 0.2650 0.8058 0.8018
NQ 0.0364 0.0506 0.3116 0.3121

* Relevance chart is using boosted configuration scores where available (see details below)

Typesense Meilisearch Elasticsearch Amgix Now
PC Parts (avg) 3.4 3.3 5.3 3.1
SciFact 7.0 5.6 8.5 4.3
Quora 9.5 6.0 6.0 3.6
NQ 64.9 19.7 7.2 9.6

* Latency chart for PC Parts dataset is using average values between Defaults and Optimized runs (see details below)

Typesense Meilisearch Elasticsearch Amgix Now
PC Parts (avg) 5.3 5.1 7.7 4.4
SciFact 35.3 9.6 11.5 6.3
Quora 32.6 10.4 9.0 5.1
NQ 226.5 39.7 10.3 13.3

* Latency chart for PC Parts dataset is using average values between Defaults and Optimized runs (see details below)

Typesense Meilisearch Elasticsearch Amgix Now
PC Parts (Defaults) 1.9 1.5 2.9 0.8
PC Parts (Optimized) 1.9 2.1 1.9 1.8
SciFact 28.3 4.0 3.0 2.0
Quora 23.1 4.4 3.0 1.5
NQ 161.6 20.0 3.1 3.7
Average Latency Variance 43.4 6.4 2.8 2.0

Benchmarks

Introduction

You may have noticed, if you read our blog posts and documentation pages, that we love benchmarks. Well, maybe not "love" (because they are a lot of work), but we do them a lot. Part of it is that the Amgix project is young and benchmarks are our way to tell the world that we are serious and we may be worth your consideration. The other part is that we like precision. When something is fast, we want to know how fast and when something is precise we want to know how precise. Benchmarks give us a tangible way to measure what we are building.

What are We Testing

The primary goal of these tests is for us to find out and share with you how Amgix Now performs on various datasets.

Modern search engines offer a staggering array of features and options, but no matter the level of sophistication of these solutions, there are two core functions that any user expects from a search system: relevance and speed.

Can you find what I'm looking for and can you find it quickly?

To that end, the tests below focus on these metrics by measuring relevance (nDCG@k), recall (Recall@k) and search query latencies (p50, p95). We also report some operational observations along the way.

Note

The following tests are keyword-only. While Amgix Now is a hybrid search engine, capable of handling multi-vector searches using both keyword tokenizers and ML models, the model-based search latencies are dominated by model inference times, which doesn't give us a good representation of search engine performance characteristics.

Frame of Reference

The results below include the Amgix Now numbers side-by-side with the numbers for the three widely used and popular search engines: Typesense, Meilisearch, and Elasticsearch. These are mature projects/businesses with decades of software engineering expertise, countless features, established communities and dedicated users and customers. Listing Amgix Now side-by-side with these giants seems... well... immodest.

We didn't take the decision to include their numbers in this report lightly. For one, it opens us up to all kinds of criticism, fair or unfair. But it is also a ton of work to research and test everything on four different systems.

Nevertheless, we decided to go ahead and publish all the numbers. And here is why:

  1. Publishing only Amgix Now numbers in isolation lacks context. Is 22ms on p50 on this dataset good or meh? Including the numbers from the other systems gives us a frame of reference. This is how some of the other search engines perform on the same task.

  2. Even if you have no interest in Amgix, it may be useful for the engineering community to see the detailed metrics for the established search engines on the same hardware, on the same tests. Not to see who is better or worse. All systems have their strengths and weaknesses. Search relevance and speed are hardly the only metrics that one takes into account when selecting a search solution. But to be able to make more informed decisions when you need to. In fact, we would have loved it if the detailed benchmarks like this existed somewhere on the internet, so we can just cite them and not do all the work.

A Word About WMTR

Amgix Now includes all the same tokenizers that Amgix comes with: WMTR, Full Text, Trigrams and Whitespace. For these tests we are using WMTR as it is the default keyword tokenizer.

WMTR (Weighted Multilevel Token Representation) is Amgix's custom lexical tokenizer. It represents text through multiple lexical views at once: a surface-form view that stays closer to the original tokens, a language-aware normalized view built with Unicode word boundaries, stopword filtering, and stemming, and a character-level view that captures short local patterns inside the text. Those signals are then weighted together into a single sparse vector representation.

Because it requires no ML model, it's fast and lightweight, making it a great default for many use cases where you simply want high-quality lexical search. It is a good fit both for ordinary natural-language text and for noisier datasets (ERP, E-commerce, science, etc.) where short identifiers, mixed-format strings, and token structure still carry meaning.

BEIR Benchmark Context

The BEIR (Benchmarking IR) benchmark suite provides standardized datasets for evaluating information retrieval systems. In short, it's a framework that allows us to measure relevance of the results returned by a search system.

For context, we will include BM25 Multifield baseline scores from Pyserini's BEIR regression results with each BEIR dataset tested.

Test Setup

Expand collapsed sections for details:

Hardware

All tests are performed on a single bare-metal machine with the following specifications:

  • CPU: AMD Ryzen™ 9 5900X × 12 cores (24 threads)
  • RAM: 64GB
  • GPU: NVIDIA GeForce RTX 5060 Ti (16GB)
  • Storage: SSD
  • OS: Ubuntu 24.04.4 LTS

Note

This is hardly a clean room test setup. It's not even a server. It's a desktop Ubuntu workstation with many other processes running on it at the same time: browser windows, other applications, etc. But we did not have any heavy processes running at the time of the tests.

Methodology
  • All datasets below were tested on single containers with 4 CPU cores. The exception was the large NQ dataset, where we gave all the containers 8 CPU cores.

  • Container memory was not restricted. Elasticsearch had its heap size set at 16GB for all datasets.

  • Document uploads and BEIR tests were executed with custom python scripts that use BEIR library to do evaluations.
  • Documents are uploaded in batches of 100.
  • Search queries are executed in parallel, 4 at a time.
  • We followed this procedure on every dataset and every server:

    1. Start container with clean data directory (no indexes/collections).
    2. Create the index/collection and upload documents.
    3. Wait for the container CPU utilization to appear idle (allowing it to finish any background work).
    4. Stop the container.
    5. Start the container.
    6. Run the search tests two times to warm up any caches.
    7. Run the search tests one more time and record reported numbers (relevance and latencies).
    8. Where available, re-run the search tests with boosted queries to record nDCG@10 (boost) values only. See Boosted Queries section below for the details.
Server Versions

The following server versions were used for these tests:

  • Typesense: 29.0
  • Meilisearch: 1.37
  • Elasticsearch: 8.19.6
  • Amgix Now: 0.1.1
Typesense on Natural Language Datasets

While testing Typesense on Natural Language (NL) datasets (SciFact, Quora, NQ) we've realized that there is a significant tension between relevance and latencies when it comes to some of their query parameters. That presented us with a dilemma of what configuration options we should use when we present their numbers.

  • drop_tokens_threshold

    To the best of our understanding, Typesense is using AND semantics when searching the documents. It first finds all the documents in the corpus where all of the words (tokens) of the query have a match. It then looks at the drop_tokens_threshold value and if the number of the documents found in the first pass is below the value, it performs additional searches by omitting the query tokens until they find the desired number of the documents. The default value for drop_tokens_threshold is 1.

    This setting has a dramatic effect on both the latency of the searches and the relevance of the results. Consider the following numbers we recorded by running Typesense on Quora dataset with different values of drop_tokens_threshold:

    drop_tokens_threshold 0 1 10
    nDCG@10 0.0989 0.2521 0.2879
    p50 (ms) 4.8 9.5 16.1
    p95 (ms) 12.8 32.6 42.0
    • With drop_tokens_threshold of 0 they are lightning fast, but nDCG@10 of 0.0989 indicates that they find almost no relevant documents.

    • With drop_tokens_threshold of 10 their relevance is much better, but their latencies are much worse.

    • Going from 1 to 10 improved the relevance marginally, but at the huge expense of the speed.

    For that reason, on all the NL datasets, we ran Typesense with the default setting of 1. It seems to provide the best balance between the relevance and the latencies.

  • num_typos and typo_tokens_threshold

    These settings control typo tolerance of the search queries. NL datasets do not generally have typos, so enabling or disabling the typo tolerance of Typesense had almost no impact on relevance scores. However, keeping typo tolerance on (default) produced dramatically slower results. On the same Quora dataset:

    • p50: 9.5ms (no typos) -> 30.9 (with typos)
    • p95: 32.6ms (no typos) -> 226.4 (with typos)

    Due to these observations, and because NL datasets do not benefit from typo tolerance being enabled for relevance scores, we ran all the Typesense tests on NL datasets with typo tolerance disabled.

Meilisearch on Natural Language Datasets

With Meilisearch we found that typo tolerance on Natural Language (NL) datasets (SciFact, Quora, NQ) improved their relevance scores a little bit and didn't seem to impact their latencies at all. For that reason, we ran Meilisearch with all the default settings (typo tolerance is enabled by default).

May 15: Updated Amgix Now Configuration (and Results) for Natural Language Datasets

After initially publishing the results, we realized that Amgix Now was not optimally configured for natural language datasets (SciFact, Quora, NQ). WMTR was configured with trigrams enabled (default). Trigrams provide typo tolerance and partial string matching for datasets like PC Parts. On NL datasets they add noise and hurt performance. We have re-tested and updated Amgix Now numbers below.

Boosted Queries

On the datasets with two fields (SciFact, NQ) we applied simple scoring tweaks (where available) in an attempt to boost nDCG@k scores for better relevance. The general assumption being that text/content field matches are twice as important as name/title fields.

  • For Typesense and Elasticsearch that meant boosting text field weight to 2
  • We didn't find a corresponding setting for Meilisearch
  • For Amgix Now we set fusion_mode to linear and the weight of content field to 2.

The resulting numbers are reported as nDCG@10 (boost).

Disclaimer

We are not experts at running and configuring third-party search engines. While we studied the settings and tried to give every system appropriate configuration for the tested dataset, it's quite possible that we've missed something and a better configuration may exist. If you notice something in the configuration of these systems that may have affected the test results, please let us know, we'll be happy to re-test with the more optimal configurations.

Results

PC Parts Dataset

The PC Parts dataset is not a standard BEIR dataset. It's a list of 6,636 video card names (Asus TUF GAMING, Asus EAH6450 SILENT/DI/512MD3(LP), Gigabyte GV-N660OC-2GD, Gigabyte WINDFORCE). There are only 16 queries: short partial strings, typos, non alpha-numeric characters.

The data lacks linguistic richness that most tokenizers (model-based or not) are optimized for. However, this is representative of the data that exists in many domains (ERP, E-commerce, etc.), where you may have thousands of records with part numbers like Pin 12LP'-x03/5-XL and short or non-existent descriptions. Users searching these datasets often know the exact model/part number they want to find, so they search for unique bits of identifiers, special characters and all, to quickly find what they came for. Full-text tokenizers (for example) struggle with this, they typically strip out all special characters, drop single digits, etc. FT tokenizer may represent the above part number as pin 12lp x03 xl, which drops some or all differentiators, meaningful for this context, reducing recall and precision of searches.

Because the dataset is unique, not a Natural Language friendly data, we tested two different configs of search engines: Defaults (out-of-the-box experience) and Optimized (settings tuned for better relevance).

Defaults
Search Engine Configurations (expand to see details)

Collection definition:

{
    "vectors": [{"name": "keyword", "type": "wmtr", "index_fields": ["name", "content"]}]
}

Query:

{
    "query": "<QUERY_TEXT>",
    "limit": 100
}

Collection definition:

{
    "name": "<COLLECTION_NAME>",
    "fields": [
        {
            "name": "name", 
            "type": "string"
        },
        {
            "name": "text",
            "type": "string"
        },
    ]
}

Query:

{
    "q": "<QUERY_TEXT>",
    "query_by": "name,text",
    "per_page": 100
}

Index definition:

{"uid": "<INDEX_NAME>", "primaryKey": "id"}

Document definition:

{
    "id": "<DOC_ID>",
    "name": "<TITLE>",
    "text": "<TEXT>"
}

Query:

{
    "q": "<QUERY_TEXT>",
    "limit": 100
}

Index definition:

{
    "settings": {
        "analysis": {
            "analyzer": {
                "id_std": {
                    "type": "custom",
                    "tokenizer": "standard",
                    "filter": ["lowercase", "stop", "english_stemmer"]
                }
            },
            "filter": {
                "english_stemmer": {"type": "stemmer", "language": "english"},
            }
        }
    },
    "mappings": {
        "properties": {
            "name": {
                "type": "text",
                "analyzer": "id_std"
            },
            "content": {
                "type": "text",
                "analyzer": "id_std"
            }
        }
    }
}

Query:

{
    "size": 100,
    "track_total_hits": false,
    "query": {
        "multi_match": {
            "query": "<QUERY_TEXT>",
            "fields": ["name", "content"],
            "type": "most_fields"
        }
    }
}
Typesense Meilisearch Elasticsearch Amgix Now
nDCG@1 0.7500 0.7500 0.8125 0.9375
nDCG@10 0.7500 0.7500 0.8317 0.9325
Recall@10 0.7500 0.7500 0.8438 0.9375
p50 (ms) 2.6 2.9 4.6 3.4
p95 (ms) 4.5 4.4 7.5 4.2

Warning

At sub-10ms latencies with very few queries, small differences should be interpreted with caution as they may fall within normal measurement variance.

Optimized
Search Engine Configurations (expand to see details)

We tried different options to see if we can boost the relevance scores, but couldn't improve on defaults. So we used the same default configuration.

Collection definition:

{
    "vectors": [{"name": "keyword", "type": "wmtr", "index_fields": ["name", "content"]}]
}

Query:

{
    "query": "<QUERY_TEXT>",
    "limit": 100
}

For Typesense we enabled infix and set num_typos to 3.

Collection definition:

{
    "name": "<COLLECTION_NAME>",
    "fields": [
        {
            "name": "name", 
            "type": "string",
            "infix": true
        },
        {
            "name": "text",
            "type": "string",
            "infix": true
        },
    ]
}

Query:

{
    "q": "<QUERY_TEXT>",
    "query_by": "name,text",
    "per_page": 100,
    "infix": "always",
    "num_typos": 3
}

For Meilisearch we adjusted typo-tolerance settings (see below).

Index definition:

{"uid": "<INDEX_NAME>", "primaryKey": "id"}

Document definition:

{
    "id": "<DOC_ID>",
    "name": "<TITLE>",
    "text": "<TEXT>"
}

Query:

{
    "q": "<QUERY_TEXT>",
    "limit": 100
}

Additional config:

curl -X PATCH 'http://localhost:7700/indexes/<INDEX_NAME>/settings/typo-tolerance' \
    -H 'Content-Type: application/json' \
    -d '{
        "minWordSizeForTypos": {
            "oneTypo": 3,
            "twoTypos": 3
        }
    }'

For Elasticsearch we had to index and search with ngrams (trigrams).

Index definition:

{
    "settings": {
        "analysis": {
            "analyzer": {
                "id_std": {
                    "type": "custom",
                    "tokenizer": "standard",
                    "filter": ["lowercase", "stop", "english_stemmer"]
                },
                "id_ngram3": {
                    "type": "custom",
                    "tokenizer": "keyword",
                    "filter": ["lowercase", "ngram_3"]
                }
            },
            "filter": {
                "english_stemmer": {"type": "stemmer", "language": "english"},
                "ngram_3": {"type": "ngram", "min_gram": 3, "max_gram": 3}
            },
        }
    },
    "mappings": {
        "properties": {
            "name": {
                "type": "text",
                "analyzer": "id_std",
                "fields": {
                    "ngram": {"type": "text", "analyzer": "id_ngram3"},
                }
            },
            "content": {
                "type": "text",
                "analyzer": "id_std",
                "fields": {
                    "ngram": {"type": "text", "analyzer": "id_ngram3"},
                }
            }
        }
    }
}

Query:

{
    "size": 100,
    "track_total_hits": false,
    "query": {
        "multi_match": {
            "query": "<QUERY_TEXT>",
            "fields": ["name^1", "name.ngram^1", "content^1", "content.ngram^1"],
            "type": "most_fields"
        }
    }
}
Typesense Meilisearch Elasticsearch Amgix Now
nDCG@1 0.8125 0.8750 0.8750 0.9375
nDCG@10 0.8438 0.8750 0.9133 0.9325
Recall@10 0.8750 0.8750 0.9375 0.9375
p50 (ms) 4.2 3.7 6.0 2.8
p95 (ms) 6.1 5.8 7.9 4.6

Warning

At sub-10ms latencies with very few queries, small differences should be interpreted with caution as they may fall within normal measurement variance.

BEIR SciFact Dataset

The SciFact dataset contains 5,183 documents and 300 queries focused on scientific claims and evidence verification. This makes it a great benchmark for evaluating search performance on domain-specific scientific content.

Search Engine Configurations (expand to see details)

Collection definition:

{
    "vectors": [{
        "name": "keyword", 
        "type": "wmtr", 
        "index_fields": ["name", "content"],
        "wmtr_word_weight": 100
    }]
}

Query:

{
    "query": "<QUERY_TEXT>",
    "limit": 100
}

Boosted query:

{
    "query": "<QUERY_TEXT>",
    "limit": 100,
    "fusion_mode": "linear",
    "vector_weights": [
        {"vector_name": "keyword", "field": "name", "weight": 1.0},
        {"vector_name": "keyword", "field": "content", "weight": 2.0},
    ]
}

Collection definition:

{
    "name": "<COLLECTION_NAME>",
    "fields": [
        {
            "name": "name", 
            "type": "string"
        },
        {
            "name": "text",
            "type": "string"
        },
    ]
}

Query:

{
    "q": "<QUERY_TEXT>",
    "query_by": "name,text",
    "per_page": 100,
    "num_typos": 0,
    "typo_tokens_threshold": 0
}

Boosted query:

{
    "q": "<QUERY_TEXT>",
    "query_by": "name,text",
    "per_page": 100,
    "num_typos": 0,
    "typo_tokens_threshold": 0,
    "query_by_weights": "1,2"
}

Index definition:

{"uid": "<INDEX_NAME>", "primaryKey": "id"}

Document definition:

{
    "id": "<DOC_ID>",
    "name": "<TITLE>",
    "text": "<TEXT>"
}

Query:

{
    "q": "<QUERY_TEXT>",
    "limit": 100
}

Index definition:

{
    "settings": {
        "analysis": {
            "analyzer": {
                "id_std": {
                    "type": "custom",
                    "tokenizer": "standard",
                    "filter": ["lowercase", "stop", "english_stemmer"]
                }
            },
            "filter": {
                "english_stemmer": {"type": "stemmer", "language": "english"},
            }
        }
    },
    "mappings": {
        "properties": {
            "name": {
                "type": "text",
                "analyzer": "id_std"
            },
            "content": {
                "type": "text",
                "analyzer": "id_std"
            }
        }
    }
}

Query:

{
    "size": 100,
    "track_total_hits": false,
    "query": {
        "multi_match": {
            "query": "<QUERY_TEXT>",
            "fields": ["name", "content"],
            "type": "most_fields"
        }
    }
}

Boosted query:

{
    "size": 100,
    "track_total_hits": false,
    "query": {
        "multi_match": {
            "query": "<QUERY_TEXT>",
            "fields": ["name^1", "content^2"],
            "type": "most_fields"
        }
    }
}

May 15: Updated Amgix Now Numbers

After initially publishing these results, we realized that Amgix Now was not configured optimally for this dataset. WMTR was configured with trigrams enabled, which only add noise and hurt performance on natural language datasets. We have re-tested and updated Amgix Now numbers below.

Typesense Meilisearch Elasticsearch Amgix Now
nDCG@1 0.3233 0.2900 0.5433 0.4800
nDCG@10 0.3386 0.3770 0.6704 0.6280
nDCG@10 (boost) 0.3386 n/a 0.6953 0.6637
Recall@10 0.3509 0.4637 0.8053 0.7769
p50 (ms) 7.0 5.6 8.5 4.3
p95 (ms) 35.3 9.6 11.5 6.3

BEIR Quora Dataset

The Quora dataset contains 523K documents and 10,000 queries, and pairs duplicate questions from Quora. Documents are question text only.

Search Engine Configurations (expand to see details)

Collection definition:

{
    "vectors": [{
        "name": "keyword", 
        "type": "wmtr", 
        "index_fields": ["name", "content"],
        "wmtr_word_weight": 100
    }]
}

Query:

{
    "query": "<QUERY_TEXT>",
    "limit": 100
}

Collection definition:

{
    "name": "<COLLECTION_NAME>",
    "fields": [
        {
            "name": "name", 
            "type": "string"
        },
        {
            "name": "text",
            "type": "string"
        },
    ]
}

Query:

{
    "q": "<QUERY_TEXT>",
    "query_by": "name,text",
    "per_page": 100,
    "num_typos": 0,
    "typo_tokens_threshold": 0
}

Index definition:

{"uid": "<INDEX_NAME>", "primaryKey": "id"}

Document definition:

{
    "id": "<DOC_ID>",
    "name": "<TITLE>",
    "text": "<TEXT>"
}

Query:

{
    "q": "<QUERY_TEXT>",
    "limit": 100
}

Index definition:

{
    "settings": {
        "analysis": {
            "analyzer": {
                "id_std": {
                    "type": "custom",
                    "tokenizer": "standard",
                    "filter": ["lowercase", "stop", "english_stemmer"]
                }
            },
            "filter": {
                "english_stemmer": {"type": "stemmer", "language": "english"},
            }
        }
    },
    "mappings": {
        "properties": {
            "name": {
                "type": "text",
                "analyzer": "id_std"
            },
            "content": {
                "type": "text",
                "analyzer": "id_std"
            }
        }
    }
}

Query:

{
    "size": 100,
    "track_total_hits": false,
    "query": {
        "multi_match": {
            "query": "<QUERY_TEXT>",
            "fields": ["name", "content"],
            "type": "most_fields"
        }
    }
}

May 15: Updated Amgix Now Numbers

After initially publishing these results, we realized that Amgix Now was not configured optimally for this dataset. WMTR was configured with trigrams enabled, which only add noise and hurt performance on natural language datasets. We have re-tested and updated Amgix Now numbers below.

Typesense Meilisearch Elasticsearch Amgix Now
nDCG@1 0.2606 0.2502 0.7218 0.7120
nDCG@10 0.2521 0.2650 0.8058 0.8018
Recall@10 0.2556 0.2914 0.8999 0.8974
p50 (ms) 9.5 6.0 6.0 3.6
p95 (ms) 32.6 10.4 9.0 5.1

Errors

While testing Quora dataset with Meilisearch we have observed some Connection Reset errors on multiple occasions. We are not sure if we were hitting some bug in Meilisearch or it was caused by some issue in our test environment. We haven't observed any errors in Meilisearch logs or with any other datasets. Every time this happened, we have aborted the tests, restarted Meilisearch container and re-tested until we got clean test runs.

BEIR NQ Dataset

The Natural Questions (NQ) dataset contains 2.6M documents and 3,452 queries, and uses Wikipedia passages and factoid-style queries. This dataset is difficult for keyword searches, relevance scores are usually pretty low, but it gives us a chance to see how search engines perform on a large dataset.

Search Engine Configurations (expand to see details)

Collection definition:

{
    "vectors": [{
        "name": "keyword", 
        "type": "wmtr", 
        "index_fields": ["name", "content"],
        "wmtr_word_weight": 100
    }]
}

Query:

{
    "query": "<QUERY_TEXT>",
    "limit": 100
}

Boosted query:

{
    "query": "<QUERY_TEXT>",
    "limit": 100,
    "fusion_mode": "linear",
    "vector_weights": [
        {"vector_name": "keyword", "field": "name", "weight": 1.0},
        {"vector_name": "keyword", "field": "content", "weight": 2.0},
    ]
}

Collection definition:

{
    "name": "<COLLECTION_NAME>",
    "fields": [
        {
            "name": "name", 
            "type": "string"
        },
        {
            "name": "text",
            "type": "string"
        },
    ]
}

Query:

{
    "q": "<QUERY_TEXT>",
    "query_by": "name,text",
    "per_page": 100,
    "num_typos": 0,
    "typo_tokens_threshold": 0
}

Boosted query:

{
    "q": "<QUERY_TEXT>",
    "query_by": "name,text",
    "per_page": 100,
    "num_typos": 0,
    "typo_tokens_threshold": 0,
    "query_by_weights": "1,2"
}

Index definition:

{"uid": "<INDEX_NAME>", "primaryKey": "id"}

Document definition:

{
    "id": "<DOC_ID>",
    "name": "<TITLE>",
    "text": "<TEXT>"
}

Query:

{
    "q": "<QUERY_TEXT>",
    "limit": 100
}

Index definition:

{
    "settings": {
        "analysis": {
            "analyzer": {
                "id_std": {
                    "type": "custom",
                    "tokenizer": "standard",
                    "filter": ["lowercase", "stop", "english_stemmer"]
                }
            },
            "filter": {
                "english_stemmer": {"type": "stemmer", "language": "english"},
            }
        }
    },
    "mappings": {
        "properties": {
            "name": {
                "type": "text",
                "analyzer": "id_std"
            },
            "content": {
                "type": "text",
                "analyzer": "id_std"
            }
        }
    }
}

Query:

{
    "size": 100,
    "track_total_hits": false,
    "query": {
        "multi_match": {
            "query": "<QUERY_TEXT>",
            "fields": ["name", "content"],
            "type": "most_fields"
        }
    }
}

Boosted query:

{
    "size": 100,
    "track_total_hits": false,
    "query": {
        "multi_match": {
            "query": "<QUERY_TEXT>",
            "fields": ["name^1", "content^2"],
            "type": "most_fields"
        }
    }
}
Operational Observations

Because NQ is such a large dataset, we thought it would be a good opportunity to observe and record some additional metrics, like upload times, memory utilization at various points, and disk space used. Expand this section to see the details of how we recorded the numbers:

How we recorded these numbers

We recorded some operational observations by following the following procedure:

  1. Start container fresh with empty data directory. No indexes or collections.
  2. Record container memory utilization after start as Start Memory.
  3. Bulk upload all documents in 100 documents batches.
  4. Record the time it took to issue all bulk upload requests as Upload Time.
  5. We were able to record Index Time on Meilisearch by querying index stats and waiting for isIndexing value to be false. Typesense is synchronous, there is no separate indexing phase. Both Elasticsearch and Amgix Now (by default, it can be changed) have a "near real-time" indexing, so indexing may go on for some time after uploads are done, so we recorded it as "?".
  6. Wait for CPU activity to die down. Assuming indexing is done at this point.
  7. Record Index Memory - amount of memory consumed by the container after indexing.
  8. Shutdown the container and start it again.
  9. Run BEIR tests.
  10. Record Search Memory - amount of memory consumed by the container after searches ran.
  11. Shutdown the container and record Disk Usage by measuring disk utilization of the data directory.

Warning

Memory on all containers, except for Elasticsearch, fluctuated quite a bit, so the values recorded are just for a general idea and are not definitive.

Typesense Meilisearch Elasticsearch Amgix Now
Upload time 3810 695 212 754
Index Time n/a 168 ? ?
Start Memory 126MB 56MB 16.8GB 116MB
Index Memory 1.7GB 7.2GB 17.0GB 2.4GB
Search Memory 1.6GB 0.3GB 16.8GB 0.7GB
Disk Usage 3.1GB 18.0GB 1.5GB 6.0GB

Startup Times

When we restarted Typesense container it took over 5 minutes to load the data in memory and before the server endpoints became responsive. The rest of the systems became available in seconds.

Relevance and Latency

May 15: Updated Amgix Now Numbers

After initially publishing these results, we realized that Amgix Now was not configured optimally for this dataset. WMTR was configured with trigrams enabled, which only add noise and hurt performance on natural language datasets. We have re-tested and updated Amgix Now numbers below.

Typesense Meilisearch Elasticsearch Amgix Now
nDCG@1 0.0269 0.0330 0.1756 0.1344
nDCG@10 0.0362 0.0506 0.3127 0.2559
nDCG@10 (boost) 0.0364 n/a 0.3116 0.3121
Recall@10 0.0460 0.0694 0.4804 0.4155
p50 (ms) 64.9 19.7 7.2 9.6
p95 (ms) 226.5 39.7 10.3 13.3

Takeaways

Note

Because our primary goal was to evaluate Amgix Now, we focused our takeaways on its performance. The full results for all systems are included above for your own analysis.

Relevance

The relevance numbers for Amgix Now are not new to us. It uses the same WMTR tokenizer as Amgix that we have extensively tested before in our early benchmarks, WMTR benchmarks, and the discussion about Linear fusion.

Amgix Now performed very well on PC Parts dataset and matched or closely approached the published BEIR BM25 baseline on the BEIR datasets.

Latencies

p50 Story
  • On the small PC Parts with very few queries, Amgix Now stayed within the pack and probably within the normal measurement variance numbers. Seemingly, a little better on some runs (Optimized) and in the middle on others (Defaults). If you average the numbers between Defaults and Optimized results, Amgix Now was a tiny bit ahead of the pack with the average of 3.1ms.

  • On the small SciFact and the medium Quora datasets with plenty of queries, Amgix Now led the pack.

  • On the large NQ dataset, Amgix Now was a little behind the leader.

p95 Story

Here, Amgix Now leads on all datasets, except for NQ, showing impressive stability of search latencies. In fact, if you look at the latency variance (p95 - p50) across the datasets you can see this clearly:

Typesense Meilisearch Elasticsearch Amgix Now
PC Parts (Defaults) 1.9 1.5 2.9 0.8
PC Parts (Optimized) 1.9 2.1 1.9 1.8
SciFact 28.3 4.0 3.0 2.0
Quora 23.1 4.4 3.0 1.5
NQ 161.6 20.0 3.1 3.7
Average Latency Variance 43.4 6.4 2.8 2.0

Overall

Amgix Now performed very competitively across the datasets on both relevance and latency. It's new and nowhere near as battle-tested as the other systems included in this report, but we believe the numbers show that it's well worth consideration when you are choosing a search engine for your next project.