Edge AI & Embedded Inference — AI That Runs When the Cloud Doesn't
Cloud AI inference costs $0.002 per call, adds 200ms of latency, and fails the moment connectivity drops. For IoT devices, factory-floor systems, and mobile apps that need real-time AI, on-device inference isn't a nice-to-have — it's the architecture.
| Dimension | Cloud inference API call per request | Edge inference On-device, zero network |
|---|---|---|
| Latency | 150–400ms (round trip) | 5–50ms on-device |
| Works offline | ||
| Cost at scale | $0.002+ per call | Hardware amortised |
| Data privacy | Data leaves device | Nothing leaves device |
| Model updates | Instant (server-side) | OTA pipeline needed |
| Right for | Low-frequency, large models | Real-time, always-on, IoT |
What you get
When it fits
- Inference latency under 50ms is a product requirement — cloud round-trips can't meet it
- The device is intermittently or never connected (industrial, rural, in-vehicle, wearable)
- Per-call cost at your inference volume makes cloud APIs the wrong unit economics
- User privacy requires data to stay on-device — no raw sensor data leaves the device
When it doesn't
- The model genuinely needs 70B+ parameters and the device can't carry a quantized version without losing the accuracy the use case requires
- Inference happens at most a few times per day — cloud is cheaper and simpler at that frequency
- You don't own the hardware — if it's a generic browser, on-device inference is limited to WebAssembly ONNX and typically not worth the effort
Process
Week 1: hardware survey and accuracy-vs-size trade-off analysis. Weeks 2–4: model quantization and initial on-device port. Weeks 5–7: hardware-specific NPU/GPU optimization and benchmarking. Weeks 8–10: offline/sync strategy, OTA delivery pipeline, power and thermal testing.
Full delivery processPricing
Fixed-price by platform and model complexity. Single-platform port: $60–140k. Multi-platform with OTA delivery: $120–280k. Hardware-specific NPU optimization: add $30–60k per chip target. We'll tell you in discovery if the model can actually run on your hardware at the accuracy you need.
See engagement modelsIndustries we serve with this
FAQ
- How much does quantization hurt accuracy?
- INT8 quantization typically costs 0.5–2% accuracy on classification tasks. For detection and generation tasks the range is wider. We measure the actual delta against your eval set during week 1 — you get a number, not a range, before any optimization work starts.
- Which chips are you optimized for?
- Apple Neural Engine (A/M series), Qualcomm Hexagon DSP, Arm Ethos NPU, NVIDIA Jetson (TensorRT), and Intel OpenVINO targets. For MCU deployments: STM32, Nordic nRF, and ESP32 series. We'll tell you in discovery if your target chip has enough headroom for the model you need.
- How do you handle model updates without an app release?
- OTA weight delivery via a signed CDN (weights only, not code) with a rollback mechanism and a version gate that prevents a degraded model from shipping. Wired into your existing CI so a model improvement triggers an OTA push the same way a dependency update triggers a build.