To attract order flow and effectively match orders, matching engines have been designed to handle extremely high message rates at low latencies. Historically, these environments were built with proprietary technology deployed on dedicated on-premise infrastructure to minimize latency and control jitter. We believe that innovations in cloud infrastructure allow the configuration and implementation of an equally low-latency trading platform in a cloud environment.
Earlier this year BJSS published a white paper documenting our performance testing of an FX Trading System in three public clouds (AWS, Azure and OCI). We demonstrated that we could achieve deterministic latency for an FX Trade flow between an Order Book and Matching engine sub-½-millisecond at an order rate of 10,000 orders/second. We believe that these tests are important because they show that cloud-based applications can attain deterministic latency at the sub-millisecond level. Jitter can be mitigated by applying many standard tuning techniques, as well as, eliminating noisy neighbours by deploying on bare-metal instances or non-shared virtual instances. These capabilities are provided by many public cloud providers.
But these results are constrained by the latencies inherent in the TCP/ethernet-based network architecture. We felt that we could achieve lower latencies by replacing this network architecture with a low-latency fabric that supports cut-through routing and replaces buffer copying and kernel interrupts with a direct network card to the memory transfer function. RDMA (Remote Direct Memory Access) is a protocol that supports this over InfiniBand and Converged Ethernet.
Oracle agreed to support this work by providing access to their HPC (High-Performance Computing) Cluster in OCI (Oracle Cloud Infrastructure) and their HPC team. This environment provides access to a RoCE V2 (RDMA over Converged Ethernet) low-latency fabric. RDMA over Converged Ethernet (RoCE) is a network protocol that supports the remote direct memory access (RDMA) protocol over an ethernet network. It directly copies data between memory spaces on two different hosts, bypassing the Operating System and CPU, and in effect, reduces CPU usage and latency when compared to a standard TCP protocol over an ethernet network.
We ran our trading system on this environment at 10,000 orders/second and reduced latency to a minimum of 6.66μs, 99th percentile of 12.48μs and maximum of 402μs. We increased throughput to ½ million orders/second and achieved latency of 11.20μs (99th percentile), but we measured significantly higher outliers at this volume (max of 37 msec). Details of our results are captured in a white paper.
We believe that we could reduce jitter, especially the high outliers, with further tuning. We think that these results measurably prove that a public cloud service provider, in this case, OCI, is able to support deterministic latencies at the 10μs level at very high message volumes. There is sufficient evidence to justify exploring the deployment of low-latency sensitive applications to OCI. This is significant because services requiring this service level avoid expensive on-site deployments.