Amgix - Open-Source Hybrid Search System

Amgix (pronounced a-MAG-ix) - short for Amalgam Index
amalgam: a mixture or blend of different elements

Amgix is an open-source system that handles ingestion, embedding, and hybrid retrieval behind one REST API. You do not need to stitch together queues, a vector database, and ranking or fusion logic in your application.

Try it — Amgix One bundles everything you need in a single container:

docker run -d -p 8234:8234 -v <path/on/host>:/data amgixio/amgix-one:1

Full walkthrough: Getting started.

Beyond that single-container start, Amgix scales into independently deployable API, ingestion, query, communication, and storage tiers. It natively understands messy enterprise data (part numbers, SKUs, mixed alphanumeric strings) through a custom WMTR tokenizer. Even while coordinating the full pipeline — from ingestion to embedding to fused ranking — it delivers typeahead-level latency on multi-million-document corpora (see benchmarks).

The Pipeline is Built-In

To get hybrid search working today, teams usually have to build a fragile machine: an ML embedding service, a message broker, a vector database, and custom fusion code in the application layer.

Amgix replaces all of that with a single system boundary:

Zero Glue Code: You POST a document. Amgix handles the queueing, deduplication, distributed locking, retries, and ML embeddings internally.
Autonomous MLOps: Encoder worker nodes self-orchestrate and route embeddings dynamically based on real-time resource availability and demand. You don’t have to manually pin models to specific machines.
Server-Side Fusion: Define dense vectors, sparse models (like SPLADE), and keyword tokens on the same collection. The database searches them in parallel and fuses the results server-side.

Built for Messy Data

Standard search engines are built for clean paragraphs. They aggressively strip punctuation and short numbers, which ruins searches for SKUs, part numbers, mixed-alphanumeric, and identifier-heavy data.

Amgix ships with WMTR (Weighted Multilevel Token Representation) — a custom tokenizer built specifically for “ugly” data. It represents text through multiple lexical views at once: a surface-form view that stays closer to the original tokens, a language-aware normalized view built with Unicode word boundaries, stopword filtering, and stemming, and a character-level view that captures short local patterns inside the text. Those signals are then weighted together into a single sparse representation.

Storage Backends

You don’t even need a dedicated vector database to start. Amgix can run its entirely asynchronous ingestion queue and vector storage natively on the PostgreSQL or MariaDB instances you already operate. If you need maximum scale, simply point Amgix at a Qdrant database. The API remains exactly the same.

What we think is cool about Amgix · Why we built it

Documentation · GitHub Repo