License: Creative Commons Attribution 3.0 Unported license (CC BY 3.0)
When quoting this document, please refer to the following
DOI: 10.4230/LIPIcs.ECOOP.2017.1
URN: urn:nbn:de:0030-drops-72503
URL: http://dagstuhl.sunsite.rwth-aachen.de/volltexte/2017/7250/
Go to the corresponding LIPIcs Volume Portal


Schulte, Wolfram

Challenges to Achieving High Availability at Scale (Invited Talk)

pdf-format:
LIPIcs-ECOOP-2017-1.pdf (0.2 MB)


Abstract

Facebook is a social network that connects more than 1.8 billion people. To serve these many users requires infrastructure which is composed of thousands of interdependent systems that span geographically distributed data centers. But what is the guiding principle for building and operating these systems?

For Facebook’s infrastructure teams the answer is: Systems must always be available and never lose data. This talk will explore this quest. We will focus on three aspects.

Availability and consistency. What form of consistency do Facebook’s systems guarantee? Strong consistency makes understanding easy but has latency penalties, weak consistency is fast but difficult to reason for developers and users. We describe our usage of eventual consistency and delve into how Facebook constructs its caching and replicated storage systems to minimize the duration for achieving consistency. We share empirical data that measures the effectiveness of our design.

Availability and correctness. With network partitions, relaxed forms of consistency, and software bugs, how do we guarantee a consistent state? We present two systems to find and repair structural errors in Facebook’s social graph, one batch and one real-time.

Availability and scale. Sharding is one of the standard answers to operate at scale. But how can we develop one system that can shard storage as well as compute? We will introduce a new Sharding-as-a-Service component. We will show and evaluate how its design and service policies control for latency, failure tolerance and operationally efficiency.

BibTeX - Entry

@InProceedings{schulte:LIPIcs:2017:7250,
  author =	{Wolfram Schulte},
  title =	{{Challenges to Achieving High Availability at Scale (Invited Talk)}},
  booktitle =	{31st European Conference on Object-Oriented Programming (ECOOP 2017)},
  pages =	{1:1--1:1},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-035-4},
  ISSN =	{1868-8969},
  year =	{2017},
  volume =	{74},
  editor =	{Peter M{\"u}ller},
  publisher =	{Schloss Dagstuhl--Leibniz-Zentrum fuer Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{http://drops.dagstuhl.de/opus/volltexte/2017/7250},
  URN =		{urn:nbn:de:0030-drops-72503},
  doi =		{10.4230/LIPIcs.ECOOP.2017.1},
  annote =	{Keywords: Distributed Systems, Availability, Reliability, Fault Tolerance, Consistency, Scalability, Replication, Sharding, Caching}
}

Keywords: Distributed Systems, Availability, Reliability, Fault Tolerance, Consistency, Scalability, Replication, Sharding, Caching
Collection: 31st European Conference on Object-Oriented Programming (ECOOP 2017)
Issue Date: 2017
Date of publication: 16.06.2017


DROPS-Home | Fulltext Search | Imprint | Privacy Published by LZI