POST /v1/deploy · Montreal, Canada

Deploy models. Serve inference. Scale with confidence.

InferenceHub runs the hub-and-spoke infrastructure behind production AI — model serving, API gateways and edge deployment engineered for sub-50ms responses across Canada.

$ ihb serve --model llm-7b --gateway edge --region ca-qc

InferenceHub model serving hub schematic
Gateway live 41 ms p99
0 Uptime SLA
0 p99 latency
0 Models served
0 Serving since
GET /packages

Deployment packages, priced in CAD

Four ways to get inference into production — from a single endpoint to a multi-region serving mesh.

tier/launch

Endpoint Launch

CAD 3,500 / from

One model, one managed inference endpoint with autoscaling and monitoring.

  • Single model endpoint
  • API key & basic gateway
  • Latency dashboard
Deploy inference
tier/scale

Serving Mesh

CAD 18,000 / from

Multiple models behind one gateway with routing, versioning and canary rollout.

  • Multi-model routing
  • Canary & rollback
  • Throughput monitoring
Deploy inference
tier/edge

Edge Hub

CAD 46,000 / from

Low-latency inference pushed to edge nodes for real-time, in-region serving.

  • Edge node deployment
  • Regional failover
  • Sub-50ms targets
Deploy inference
tier/platform

Inference Platform

CAD 120,000 / from

Full MLOps platform: registry, CI/CD, observability and on-call operations.

  • Model registry & CI/CD
  • Full observability stack
  • Managed on-call
Deploy inference
runtime/stack

The inference stack we run on

Battle-tested serving runtimes and gateways, wired into a single hub.

vLLM Triton TensorRT-LLM ONNX Runtime Ray Serve KServe Kubernetes Envoy Gateway Prometheus Grafana
why/inferencehub

Built for production serving, not notebooks

01 · latency

Latency-first serving

Every endpoint is profiled and tuned for p99 budgets, with batching and quantisation handled for you.

02 · residency

Canadian data residency

Inference runs in-region on infrastructure aligned with PIPEDA and Quebec Law 25 expectations.

03 · scale

Autoscaling by default

Hub-and-spoke routing scales spokes up and down with live traffic, so you pay for what you serve.

04 · observability

Full observability

Token throughput, GPU utilisation and request traces stream into dashboards from day one.

05 · rollout

Safe rollouts

Canary releases, shadow traffic and instant rollback keep new model versions from breaking production.

06 · team

Senior serving engineers

Small, accountable pods of platform and MLOps engineers — no hand-off to junior benches.

Cloud inference cluster serving models
cluster/online

Your models, online and serving in weeks

We stand up your first inference endpoint on a managed cluster, instrument it end to end, and hand you a runbook your team can own.

Explore services
GET /services

How the hub works

svc.serve

Model serving

Managed endpoints for LLMs, vision and tabular models with autoscaling and batching tuned to your traffic.

svc.gateway

API gateway setup

A single, authenticated gateway that routes, rate-limits and versions every inference request.

svc.edge

Edge deployment

Push models to edge and regional nodes for real-time inference close to your users.

client/feedback

Teams serving on the hub

"InferenceHub took our LLM from a flaky demo to a 38ms gateway serving every product team. The rollback safety alone paid for itself."

CLCamille LavoieVP Engineering, Montreal FinTech

"They deployed our vision models to edge nodes across three regions in five weeks. Throughput tripled and the dashboards are excellent."

RBRaj BhattHead of AI, Logistics Co.

"The platform package gave us a real model registry and CI/CD. New versions ship in minutes instead of days, with full observability."

SOSophie TremblayDirector of Data, HealthTech
POST /v1/deploy

Ready to serve inference at scale?

Book a deployment session and we will have your first endpoint live, monitored and documented within weeks.

Deploy inference