The last few years have seen tremendous progress in AI vision systems. Cameras that once merely streamed video can now detect people, vehicles, animals, infrastructure defects, and even understand complex scenes. At the same time, the industry continues to debate where intelligence should live. Should cameras simply transmit raw video to powerful servers, or should they become intelligent edge devices that process information locally?
For many practical applications, the answer is increasingly clear: intelligence belongs at the edge.
Raw video is one of the most expensive forms of data we can transport and process. A single 1080p stream at 30 frames per second contains millions of pixels every second, most of which carry little useful information. Large portions of a scene remain unchanged between frames, and many environments contain long periods where nothing relevant happens at all. Sending every pixel to a cloud server and asking an AI model to repeatedly analyze the entire image is often inefficient from both a networking and computational perspective.
Edge computing changes this equation. Instead of treating a camera as a passive sensor, the camera becomes an active participant in understanding the environment. Lightweight neural networks running on embedded NPUs can perform object detection, tracking, segmentation, and scene analysis directly on the device. Rather than transmitting every frame, the camera can publish meaningful information such as detected objects, positions, confidence scores, trajectories, and metadata.
This approach becomes even more important as the industry moves toward larger multimodal systems and emerging World Models. While these models offer remarkable capabilities, they do not necessarily require access to every pixel from every camera. In many cases, a World Model benefits more from structured observations than from raw imagery. A stream of object detections, motion vectors, classifications, geospatial coordinates, and contextual events is often more valuable than a compressed video stream that must be decoded and analyzed again.
Consider a drone monitoring an area. The onboard vision system can detect vehicles, people, boats, roads, and obstacles locally. Instead of continuously transmitting high-bandwidth video to a remote AI service, the drone can publish a compact stream of observations. A higher-level World Model can then reason about behavior, patterns, intentions, and mission objectives using pre-processed information. The expensive visual processing occurs once, at the edge, while the strategic reasoning layer operates on a significantly smaller and richer dataset.
This philosophy scales particularly well in distributed systems. A fleet of drones, robots, vehicles, or smart cameras can each perform local perception and then contribute structured knowledge to a shared operational picture. Bandwidth requirements decrease, latency improves, and the overall system becomes more resilient when connectivity is limited or intermittent.
The messaging layer plays a critical role in this architecture. The ongoing discussion between MQTT and Zenoh often frames them as competing technologies, but practical edge systems should embrace both. MQTT remains one of the most mature, widely deployed, and operationally proven protocols in industrial IoT. Its ecosystem, tooling, and broker implementations make it an excellent choice for telemetry, commands, events, and integration with existing infrastructure.