Antipode: Enforcing Cross-Service Causal Consistency in Distributed Applications
Status:: 🟧
Links:: Data Consistency in Distributed Systems
Metadata
Authors:: Ferreira Loff, João; Porto, Daniel; Garcia, João; Mace, Jonathan; Rodrigues, Rodrigo
Title:: Antipode: Enforcing Cross-Service Causal Consistency in Distributed Applications
Date:: 2023
Publisher:: Association for Computing Machinery
URL:: https://dl.acm.org/doi/10.1145/3600006.3613176
DOI:: 10.1145/3600006.3613176
Bibliography
Ferreira Loff, J., Porto, D., Garcia, J., Mace, J., & Rodrigues, R. (2023). Antipode: Enforcing Cross-Service Causal Consistency in Distributed Applications. Proceedings of the 29th Symposium on Operating Systems Principles, 298–313. https://doi.org/10.1145/3600006.3613176
Zotero
Type:: #zotero/conferencePaper
Keywords:: [Consistency, Distributed Transcations, Saga Pattern, Transactions]
Relations
Abstract
Modern internet-scale applications suffer from cross-service inconsistencies, arising because applications combine multiple independent and mutually-oblivious datastores. The end-to-end execution flow of each user request spans many different services and datastores along the way, implicitly establishing ordering dependencies among operations at different datastores. Readers should observe this ordering and, in today's systems, they do not. In this work, we present Antipode, a bolt-on technique for preventing cross-service consistency violations in distributed applications. It enforces cross-service consistency by propagating lineages of datastore operations both alongside end-to-end requests and within datastores. Antipode enables a novel cross-service causal consistency model, which extends existing causality models, and whose enforcement requires us to bring in a series of technical contributions to address fundamental semantic, scalability, and deployment challenges. We implemented Antipode as an application-level library, which can easily be integrated into existing applications with minimal effort, is incrementally deployable, and does not require global knowledge of all datastore operations. We apply Antipode to eight open-source and public cloud datastores and two microservice benchmark applications. Our evaluation demonstrates that Antipode is able to prevent cross-service inconsistencies with limited programming effort and less than 2% impact on end-user latency and throughput.
Notes & Annotations
📑 Annotations (imported on 2024-01-16#10:57:11)
Although consistency is well-studied in the context of individual distributed datastores, new issues appear when an application uses multiple independent distributed datastores. In particular, a single end-to-end request can make multiple writes to multiple different datastores over the course of its execution; these writes are issued by the different services (and machines) traversed by the request. As a whole, the request establishes an implicit visibility ordering for its writes, which readers must respect if they are to be consistent
To address these challenges, we present Antipode, a system that enforces cross-service causal consistency for applications with requests that span multiple processes and interact with multiple datastores.
Furthermore, the complexity of end-to-end request flows and resulting graphs of interacting services can be daunting: a single user request may span hundreds of sub-queries that traverse multiple microservices [2, 36, 59]. We therefore argue that these architectural features are fertile ground for cross-service inconsistencies.
[2] Phillipe Ajoux, Nathan Bronson, Sanjeev Kumar, Wyatt Lloyd, and Kaushik Veeraraghavan. 2015. Challenges to Adopting Stronger Consistency at Scale. In 15th Workshop on Hot Topics in Operating Systems (HotOS’15). https://www.usenix.org/conference/hotos15/workshop-program/presentation/ajoux (§1, 2.1, 2.2, and 5.1).
[36] Yu Gan, Brendon Jackson, Kelvin Hu, Meghna Pancholi, Yuan He, Brett Clancy, Chris Colen, Fukang Wen, Catherine Leung, Siyuan Wang, Leon Zaruvinsky, Yanqi Zhang, Mateo Espinosa, Rick Lin, Zhongling Liu, Jake Padilla, Christina Delimitrou, Dailun Cheng, Ankitha Shetty, Priyal Rathi, Nayan Katarki, Ariana Bruno, Justin Hu, and Brian Ritchken. 2019. An Open-Source Benchmark Suite for Microservices and Their Hardware-Software Implications for Cloud & Edge Systems. In 24th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’19). https://doi.org/10.1145/3297858.3304013 (§1, 2.1, and 7).
→Microservices Reference and Benchmark Applications#DeathStarBench
[59] Xiao Shi, Scott Pruett, Kevin Doherty, Jinyu Han, Dmitri Petrov, Jim Carrig, John Hugg, and Nathan Bronson. 2020. FlightTracker: Consistency across read-optimized online stores at Facebook. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’20). https://www.usenix.org/conference/osdi20/presentation/shi (§1, 2.1, 3.4, 6.1, and 8).
In terms of the scale of the deployment, we found that out of Alibaba’s more than 17k microservices, more than 80% are stateful services, namely databases, caches or message queues. The prevalence of stateful services in the requests’ large call graphs is also very high.
Uber has also disclosed comparable findings regarding the intricacies of their call graphs [65]. In particular, they report that a single request can call up to 1400 unique endpoints, has an average of 112 RPC calls per request – and a maximum of 275k – and has an average request depth of 8.5 and a maximum of 35.
[65] Zhizhou Zhang, Murali Krishna Ramanathan, Prithvi Raj, Abhishek Parwal, Timothy Sherwood, and Milind Chabbi. 2022. CRISP: Critical Path Analysis of Large-Scale Microservice Architectures. In 2022 USENIX Annual Technical Conference (USENIX ATC 22). https://www.usenix.org/conference/atc22/presentation/zhang-zhizhou (§1 and 2.1).
Developed as a testbed for replicating industrial faults, the TrainTicket benchmark [66] is a microservice-based application that provides typical ticket booking functionalities, such as ticket reservation and payment. It is implemented in Java and it consists of more than 40 services including web servers, datastores and queues.
[66] Xiang Zhou, Xin Peng, Tao Xie, Jun Sun, Chao Ji, Wenhai Li, and Dan Ding. 2021. Fault Analysis and Debugging of Microservice Systems: Industrial Survey, Benchmark System, and Empirical Study. IEEE Transactions on Software Engineering 22, 4 (2021), 243–260. https://doi.org/10.1109/TSE.2018.2887384 (§1, 7, 7.1, and 7.4).
Existing noteworthy systems that address cross-service consistency take one of the following approaches: they either wrap requests in other abstractions, resort to centralized coordination mechanisms, add transactions to the design, or propose an overall revision of the system architecture.
Developed at Facebook, the FlightTracker [59] metadata server was designed to provide ready-our-writes (RYW) guarantees across a variety of datastores. It identifies a user session through a ticket abstraction, to which all of a user’s write operations are associated. Tickets are created and updated through a metadata server, and are passed between different services and datastores.
Traditionally, for multiple systems to interact, the use of a logically centralized coordination mechanism like Zookeeper [39] was the natural design choice. However, strongly consistent coordination systems introduce a performance bottleneck and go against microservice principles. In particular, the proponents of this class of architectures encourage the community to embrace eventual consistency due to its better performance and higher scalability [34].
[39] Patrick Hunt, Mahadev Konar, Flavio Paiva Junqueira, and Benjamin Reed. 2010. ZooKeeper: Wait-free Coordination for Internetscale Systems. In 2010 USENIX Annual Technical Conference (ATC ’10). https://www.usenix.org/conference/usenix-atc-10/zookeeperwait-free-coordination-internet-scale-systems (§8).
[34] Martin Fowler. 2015. Microservice Trade-Offs. https://martinfowler.com/articles/microservice-trade-offs.html (§1, 2.1, 3.3, and 8).
Distributed transactions can also be applied in this context, e.g., in the form of 2PC protocols [15]. Just as coordination-based approaches, distributed transactions suffer from low performance [61]. Faced with this problem, the community migrated towards an approach known as Sagas [37].
[15] Philip A Bernstein, Nathan Goodman, and Vassos Hadzilacos. 1987. Concurrency Control and Recovery in Database Systems. Addison-Wesley. https://dl.acm.org/doi/book/10.5555/17299 (§8).
[61] Michael Stonebraker and Ugur Çetintemel. 2005. "One Size Fits All": An Idea Whose Time Has Come and Gone. In 21st International Conference on Data Engineering (ICDE ’05). https://doi.org/10.1109/ICDE.2005.1 (§8).
[37] Hector Garcia-Molina and Kenneth Salem. 1987. Sagas. ACM SIGMOD Record 16, 3 (1987), 249–259. https://doi.org/10.1145/38714.38742 (§8).
→@Garcia-Molina.Salem.1987.Sagas
Although sagas gained acceptance within the community [22, 28], they fall short when compensating mechanisms are not possible or hard to achieve. Reversal is especially challenging when transactions trigger third-party side effects [4]. Furthermore, Sagas often still rely on an orchestrator-like entity that sequences the steps of a saga.
[22] Chris Richardson. 2021. Microservices.io. https://microservices.io/ (§8).
[28] Eventuate. 2021. Eventuate. https://eventuate.io/ (§8).
[4] Remzi Can Aksoy and Manos Kapritsos. 2019. Aegean: replication beyond the client-server model. In 27th ACM Symposium on Operating Systems Principles (SOSP ’19). https://doi.org/10.1145/3341301.3359663 (§1 and 8).
There are also proposals that opt for a complete redesign of the application, often ending up merging different services – and their respective datastores – into a single one. A good example of this approach is Diamond [64], where a reactive application with two different datastores (distributed storage and notification service), was merged into a single service that provided both functionalities.
[64] Irene Zhang, Niel Lebeck, Ariadna Norberg, Pedro Fonseca, Brandon Holt, Raymond Cheng, Arvind Krishnamurthy, and Henry M Levy. 2016. Diamond: Automating Data Management and Storage for Wide-area, Reactive Applications. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’16). https://www.usenix.org/conference/osdi16/technical-sessions/presentation/zhang-irene (§8).
In fact, service re-design is a hot topic in the microservice community [22, 35]. Aegean [4] proposes a redesign of the replication layer of datastores that are based on state machine replication [57], with the intention of ensuring that dependent services always see a strongly consistent view of ongoing operations.
[35] Jonas Fritzsch, Justus Bogner, Stefan Wagner, and Alfred Zimmermann. 2019. Microservices Migration in Industry: Intentions, Strategies, and Challenges. In EEE International Conference on Software Maintenance and Evolution (ICSME ’19). https://doi.org/10.1109/ICSME.2019.00081 (§8).
[4] Remzi Can Aksoy and Manos Kapritsos. 2019. Aegean: replication beyond the client-server model. In 27th ACM Symposium on Operating Systems Principles (SOSP ’19). https://doi.org/10.1145/3341301.3359663 (§1 and 8).
[57] Fred B. Schneider. 1990. Implementing fault-tolerant services using the state machine approach: a tutorial. Comput. Surveys 22, 4 (1990), 299–319. https://doi.org/10.1145/98163.98167 (§8 and A).
Microservices emerged as an architectural style that provides loosely coupled and independently deployable services, leading to good scalability, performance, and maintainability. And while they fulfilled this promise, data consistency was sacrificed and developers were left with accepting eventual consistency as the norm.
📑 Annotations (imported on 2024-01-16#11:41:33)
As straw man solutions, we could ameliorate both classes of problems by strengthening the guarantees of post-storage to make its replication synchronous, but this introduces undesirable delays that are discouraged in practice [34, 47, 48].
[34] Martin Fowler. 2015. Microservice Trade-Offs. https://martinfowler. com/articles/microservice-trade-offs.html (§1, 2.1, 3.3, and 8).
[47] Haonan Lu, Kaushik Veeraraghavan, Philippe Ajoux, Jim Hunt, Yee Jiun Song, Wendy Tobagus, Sanjeev Kumar, and Wyatt Lloyd. 2015. Existential consistency. In 25th ACM Symposium on Operating Systems Principles (SOSP ’15). https://doi.org/10.1145/2815400.2815426 (§3.3).
[48] Shutian Luo, Huanle Xu, Chengzhi Lu, Kejiang Ye, Guoyao Xu, Liping Zhang, Yu Ding, Jian He, and Chengzhong Xu. 2021. Characterizing microservice dependency and performance: Alibaba trace analysis. In 2021 ACM Symposium on Cloud Computing (SoCC ’21). https://doi.org/10.1145/3472883.3487003 (§1, 2.1, 3.1, 3.2, 3.3, and 4.1).]
We could also try to incorporate more global knowledge about end-to-end requests. For example, the notifier service could manually check the post-storage service before delivering the notification. Alternatively, all datastores could synchronize their replication progress. Generically, this requires developers to enforce consistency at an application-wide scale – which, although it is the status-quo in microservice-based applications [34, 42], is precisely the burden we aim to minimize. Overall, these approaches break the design philosophy of microservice applications, which intentionally imposes strict boundaries and loose coupling between services, to enable rapid and independent development [34, 42].
[42] Rodrigo Laigner, Yongluan Zhou, Marcos Antonio Vaz Salles, Yijian Liu, and Marcos Kalinowski. 2021. Data management in microservices. VLDB Endowment 14, 13 (2021), 3328–3361. https://doi.org/10.14778/3484224.3484232 (§1, 2.1, and 3.3).
Recently, Google introduced Service Weaver [38], a framework that enables developers to build microservice-based application using a programming model similar to writing a monolith application. While this framework helps developers tame the complexity of managing a microservice deployment, it is not meant to address either data placement or possible cross-service inconsistencies.
[38] Sanjay Ghemawat, Robert Grandl, Srdjan Petrovic, Michael Whittaker, Parveen Patel, Ivan Posva, and Amin Vahdat. 2023. Towards Modern Development of Cloud Applications. In 19th Workshop on Hot Topics in Operating SystemsJune 2023 (HotOS ’23). https://doi.org/10.1145/3593856.3595909 (§3.3).