In modern distributed systems, managing Site-2-Site connections in a mesh architecture while maintaining optimal network switch performance is a challenge that requires innovative solutions. At ShitOps, we grappled with optimizing our network switches to improve Device Telemetry KPIs across our infrastructure, leveraging cutting-edge technologies such as asynchronous programming paradigms, Cloudflare edge capabilities, WebAssembly, and ArgoCD for continuous deployment.

Problem Statement

Our primary objective was to enhance the efficiency and responsiveness of switches in our Site-2-Site mesh network. The network's dynamism, combined with the need to collect and process vast amounts of Device Telemetry asynchronously, required us to rethink the architecture fundamentally. Traditional synchronous approaches led to latency spikes and dropped telemetry data, ultimately affecting key performance indicators critical to our operations.

To address these concerns, we designed a multi-layered solution integrating several advanced technologies to orchestrate asynchronous telemetry processing, switch configuration, and mesh state synchronization.

Architectural Overview

Mesh Architecture with Site-2-Site Switches

The baseline is a mesh topology where multiple sites interconnect via Site-2-Site VPN switches. Each switch is responsible for forwarding telemetry data while maintaining a resilient and adaptive routing mechanism.

Asynchronous Telemetry Aggregation

Using event-driven asynchronous programming models, telemetry data from devices connected to switches is streamed in real-time without blocking I/O operations. This allows us to process and react to telemetry events promptly.

WebAssembly (Wasm) for Edge Processing

We deployed WebAssembly modules at the edge nodes, running lightweight telemetry processors directly on switches' management controllers, reducing latency and processing overhead.

Cloudflare Edge for Enhanced Security and Load Balancing

Traffic between site switches and central processing nodes routes through Cloudflare's global edge network, ensuring secure, low-latency, and distributed ingress.

ArgoCD for Continuous Delivery of Configurations

Configurations for switches, telemetry processing workflows, and Wasm modules are managed declaratively and deployed via ArgoCD. This ensures consistency and rapid rollouts across all nodes in the mesh.

Detailed Workflow

sequenceDiagram participant Device participant Switch participant WasmModule participant CloudflareEdge participant CentralTelemetryProcessor participant ArgoCD Device->>Switch: Send telemetry data asynchronously Switch->>WasmModule: Execute telemetry pre-processing (Wasm) WasmModule->>CloudflareEdge: Forward processed telemetry CloudflareEdge->>CentralTelemetryProcessor: Distribute telemetry data CentralTelemetryProcessor-->>ArgoCD: Monitor KPIs and configurations ArgoCD-->>Switch: Update switch configurations asynchronously Switch-->>Device: Acknowledge telemetry reception

Implementation Highlights

KPI Monitoring and Feedback

Devices generate tons of telemetry data points, which we aggregate and analyze in near-real-time to track network KPIs such as packet loss, latency, and switch utilization. When KPIs deviate, ArgoCD automatically triggers configuration updates, deploying patch sets to affected switches via site-2-site links.

Conclusion

By integrating asynchronous programming principles, cutting-edge WebAssembly capabilities at the edge, Cloudflare edge networking, and ArgoCD-driven continuous delivery, we've constructed a robust, responsive, and scalable solution to the complex challenge of optimizing switch performance in Site-2-Site mesh architectures. This ecosystem demonstrates how combining modern distributed systems technologies can push the boundaries of traditional network management paradigms, dramatically enhancing telemetry processing quality and operational KPIs.

Stay tuned for more architectures and engineering breakthroughs from ShitOps!