Introduction

InferNex Bridge

Alauda Build of InferNex Bridge is based on the openFuyao InferNex project. InferNex Bridge connects KServe LLMInferenceService workloads with the InferNex inference acceleration stack, and also provides native InferNexService APIs for environments that do not use KServe.

The operator installs the InferNex Bridge controller, admission webhooks, RBAC, and the following custom resources:

  • InferNexService: A managed LLM inference service that can deploy inference engines, Hermes Router, Mooncake KV cache, cache-indexer, PD-Orchestrator, Eagle-Eye, and related resources.
  • InferNexServiceConfig: A reusable configuration template referenced by InferNexService through spec.baseRefs.

Deployment Modes

InferNex Bridge supports two deployment entry points. Choose one entry point for each inference service and do not deploy the same service through both paths.

InferNex Bridge currently supports NPU inference workloads only.

KServe LLMInferenceService

Use this mode when KServe is already installed and you want to keep the KServe LLMInferenceService workflow.

Add the infernex.io/runtime: "true" label to an LLMInferenceService. KServe continues to reconcile the inference engine, Hermes Router, Gateway, HTTPRoute, and InferencePool; InferNex Bridge reconciles the InferNex enhancement components such as Mooncake KV cache, cache-indexer, PD-Orchestrator, Eagle-Eye, and KServe runtime compatibility patches.

InferNexService

Use this mode when you want InferNex Bridge to manage the full inference service without using KServe as the entry point.

Create an InferNexService that references one or more InferNexServiceConfig templates. InferNex Bridge reconciles the inference engine, Hermes Router, enhancement components, and, when intelligent gateway routing is enabled, Gateway API resources.

Capabilities

  • KServe compatibility: Use the existing KServe LLMInferenceService workflow and opt in to InferNex acceleration with the infernex.io/runtime: "true" label.
  • Native InferNex APIs: Deploy inference services directly with InferNexService and reusable InferNexServiceConfig templates.
  • Prefill-decode disaggregation: Run P/D inference patterns with proxy-server coordination for prefill and decode workloads.
  • Mooncake KV cache: Deploy Mooncake KV cache and cache-indexer components for KV cache reuse and coordination.
  • Intelligent gateway routing: Integrate Hermes Router and Gateway API resources for model-aware request routing.
  • Elastic orchestration: Use PD-Orchestrator components such as Elastic-Scaler, Tidal, and ResourceScalingGroup when the inference engine replica fields are left for the scaler to manage.
  • Hardware observability: Integrate Eagle-Eye hardware monitor and diagnosis components when the required observability dependencies are installed.

For installation on the platform, see Install InferNex Bridge.

Documentation

InferNex Bridge upstream documentation and key dependencies: