Advertisement
Guest User

Untitled

a guest
Jun 22nd, 2017
307
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 60.16 KB | None | 0 0
  1. # se
  2.  
  3. ## Abstract
  4. Spanner is Google’s scalable, multi-version, globallydistributed, and synchronously-replicated database. It is the first system to distribute data at global scale and support externally-consistent distributed transactions. This paper describes how Spanner is structured, its feature set, the rationale underlying various design decisions, and a novel time API that exposes clock uncertainty. This API and its implementation are critical to supporting external consistency and a variety of powerful features: nonblocking reads in the past, lock-free read-only transactions, and atomic schema changes, across all of Spanner.
  5. 1 Introduction
  6. Spanner is a scalable, globally-distributed database designed,
  7. built, and deployed at Google. At the highest
  8. level of abstraction, it is a database that shards data
  9. across many sets of Paxos [21] state machines in datacenters
  10. spread all over the world. Replication is used for
  11. global availability and geographic locality; clients automatically
  12. failover between replicas. Spanner automatically
  13. reshards data across machines as the amount of data
  14. or the number of servers changes, and it automatically
  15. migrates data across machines (even across datacenters)
  16. to balance load and in response to failures. Spanner is
  17. designed to scale up to millions of machines across hundreds
  18. of datacenters and trillions of database rows.
  19. Applications can use Spanner for high availability,
  20. even in the face of wide-area natural disasters, by replicating
  21. their data within or even across continents. Our
  22. initial customer was F1 [35], a rewrite of Google’s advertising
  23. backend. F1 uses five replicas spread across
  24. the United States. Most other applications will probably
  25. replicate their data across 3 to 5 datacenters in one geographic
  26. region, but with relatively independent failure
  27. modes. That is, most applications will choose lower latency
  28. over higher availability, as long as they can survive
  29. 1 or 2 datacenter failures.
  30. Spanner’s main focus is managing cross-datacenter
  31. replicated data, but we have also spent a great deal of
  32. time in designing and implementing important database
  33. features on top of our distributed-systems infrastructure.
  34. Even though many projects happily use Bigtable [9], we
  35. have also consistently received complaints from users
  36. that Bigtable can be difficult to use for some kinds of applications:
  37. those that have complex, evolving schemas,
  38. or those that want strong consistency in the presence of
  39. wide-area replication. (Similar claims have been made
  40. by other authors [37].) Many applications at Google
  41. have chosen to use Megastore [5] because of its semirelational
  42. data model and support for synchronous replication,
  43. despite its relatively poor write throughput. As a
  44. consequence, Spanner has evolved from a Bigtable-like
  45. versioned key-value store into a temporal multi-version
  46. database. Data is stored in schematized semi-relational
  47. tables; data is versioned, and each version is automatically
  48. timestamped with its commit time; old versions of
  49. data are subject to configurable garbage-collection policies;
  50. and applications can read data at old timestamps.
  51. Spanner supports general-purpose transactions, and provides
  52. a SQL-based query language.
  53. As a globally-distributed database, Spanner provides
  54. several interesting features. First, the replication configurations
  55. for data can be dynamically controlled at a
  56. fine grain by applications. Applications can specify constraints
  57. to control which datacenters contain which data,
  58. how far data is from its users (to control read latency),
  59. how far replicas are from each other (to control write latency),
  60. and how many replicas are maintained (to control
  61. durability, availability, and read performance). Data
  62. can also be dynamically and transparently moved between
  63. datacenters by the system to balance resource usage
  64. across datacenters. Second, Spanner has two features
  65. that are difficult to implement in a distributed database: it
  66. Published in the Proceedings of OSDI 2012 1
  67. provides externally consistent [16] reads and writes, and
  68. globally-consistent reads across the database at a timestamp.
  69. These features enable Spanner to support consistent
  70. backups, consistent MapReduce executions [12],
  71. and atomic schema updates, all at global scale, and even
  72. in the presence of ongoing transactions.
  73. These features are enabled by the fact that Spanner assigns
  74. globally-meaningful commit timestamps to transactions,
  75. even though transactions may be distributed.
  76. The timestamps reflect serialization order. In addition,
  77. the serialization order satisfies external consistency (or
  78. equivalently, linearizability [20]): if a transaction T1
  79. commits before another transaction T2 starts, then T1’s
  80. commit timestamp is smaller than T2’s. Spanner is the
  81. first system to provide such guarantees at global scale.
  82. The key enabler of these properties is a new TrueTime
  83. API and its implementation. The API directly exposes
  84. clock uncertainty, and the guarantees on Spanner’s timestamps
  85. depend on the bounds that the implementation provides.
  86. If the uncertainty is large, Spanner slows down to
  87. wait out that uncertainty. Google’s cluster-management
  88. software provides an implementation of the TrueTime
  89. API. This implementation keeps uncertainty small (generally
  90. less than 10ms) by using multiple modern clock
  91. references (GPS and atomic clocks).
  92. Section 2 describes the structure of Spanner’s implementation,
  93. its feature set, and the engineering decisions
  94. that went into their design. Section 3 describes our new
  95. TrueTime API and sketches its implementation. Section
  96. 4 describes how Spanner uses TrueTime to implement
  97. externally-consistent distributed transactions, lockfree
  98. read-only transactions, and atomic schema updates.
  99. Section 5 provides some benchmarks on Spanner’s performance
  100. and TrueTime behavior, and discusses the experiences
  101. of F1. Sections 6, 7, and 8 describe related and
  102. future work, and summarize our conclusions.
  103. 2 Implementation
  104. This section describes the structure of and rationale underlying
  105. Spanner’s implementation. It then describes the
  106. directory abstraction, which is used to manage replication
  107. and locality, and is the unit of data movement. Finally,
  108. it describes our data model, why Spanner looks
  109. like a relational database instead of a key-value store, and
  110. how applications can control data locality.
  111. A Spanner deployment is called a universe. Given
  112. that Spanner manages data globally, there will be only
  113. a handful of running universes. We currently run a
  114. test/playground universe, a development/production universe,
  115. and a production-only universe.
  116. Spanner is organized as a set of zones, where each
  117. zone is the rough analog of a deployment of Bigtable
  118. Figure 1: Spanner server organization.
  119. servers [9]. Zones are the unit of administrative deployment.
  120. The set of zones is also the set of locations across
  121. which data can be replicated. Zones can be added to or
  122. removed from a running system as new datacenters are
  123. brought into service and old ones are turned off, respectively.
  124. Zones are also the unit of physical isolation: there
  125. may be one or more zones in a datacenter, for example,
  126. if different applications’ data must be partitioned across
  127. different sets of servers in the same datacenter.
  128. Figure 1 illustrates the servers in a Spanner universe.
  129. A zone has one zonemaster and between one hundred
  130. and several thousand spanservers. The former assigns
  131. data to spanservers; the latter serve data to clients. The
  132. per-zone location proxies are used by clients to locate
  133. the spanservers assigned to serve their data. The universe
  134. master and the placement driver are currently singletons.
  135. The universe master is primarily a console that
  136. displays status information about all the zones for interactive
  137. debugging. The placement driver handles automated
  138. movement of data across zones on the timescale
  139. of minutes. The placement driver periodically communicates
  140. with the spanservers to find data that needs to be
  141. moved, either to meet updated replication constraints or
  142. to balance load. For space reasons, we will only describe
  143. the spanserver in any detail.
  144. 2.1 Spanserver Software Stack
  145. This section focuses on the spanserver implementation
  146. to illustrate how replication and distributed transactions
  147. have been layered onto our Bigtable-based implementation.
  148. The software stack is shown in Figure 2. At the
  149. bottom, each spanserver is responsible for between 100
  150. and 1000 instances of a data structure called a tablet. A
  151. tablet is similar to Bigtable’s tablet abstraction, in that it
  152. implements a bag of the following mappings:
  153. (key:string, timestamp:int64) ! string
  154. Unlike Bigtable, Spanner assigns timestamps to data,
  155. which is an important way in which Spanner is more
  156. like a multi-version database than a key-value store. A
  157. Published in the Proceedings of OSDI 2012 2
  158. Figure 2: Spanserver software stack.
  159. tablet’s state is stored in set of B-tree-like files and a
  160. write-ahead log, all on a distributed file system called
  161. Colossus (the successor to the Google File System [15]).
  162. To support replication, each spanserver implements a
  163. single Paxos state machine on top of each tablet. (An
  164. early Spanner incarnation supported multiple Paxos state
  165. machines per tablet, which allowed for more flexible
  166. replication configurations. The complexity of that design
  167. led us to abandon it.) Each state machine stores
  168. its metadata and log in its corresponding tablet. Our
  169. Paxos implementation supports long-lived leaders with
  170. time-based leader leases, whose length defaults to 10
  171. seconds. The current Spanner implementation logs every
  172. Paxos write twice: once in the tablet’s log, and once
  173. in the Paxos log. This choice was made out of expediency,
  174. and we are likely to remedy this eventually. Our
  175. implementation of Paxos is pipelined, so as to improve
  176. Spanner’s throughput in the presence of WAN latencies;
  177. but writes are applied by Paxos in order (a fact on which
  178. we will depend in Section 4).
  179. The Paxos state machines are used to implement a
  180. consistently replicated bag of mappings. The key-value
  181. mapping state of each replica is stored in its corresponding
  182. tablet. Writes must initiate the Paxos protocol at the
  183. leader; reads access state directly from the underlying
  184. tablet at any replica that is sufficiently up-to-date. The
  185. set of replicas is collectively a Paxos group.
  186. At every replica that is a leader, each spanserver implements
  187. a lock table to implement concurrency control.
  188. The lock table contains the state for two-phase locking:
  189. it maps ranges of keys to lock states. (Note that
  190. having a long-lived Paxos leader is critical to efficiently
  191. managing the lock table.) In both Bigtable and Spanner,
  192. we designed for long-lived transactions (for example,
  193. for report generation, which might take on the order
  194. of minutes), which perform poorly under optimistic concurrency
  195. control in the presence of conflicts. Operations
  196. Figure 3: Directories are the unit of data movement between
  197. Paxos groups.
  198. that require synchronization, such as transactional reads,
  199. acquire locks in the lock table; other operations bypass
  200. the lock table.
  201. At every replica that is a leader, each spanserver also
  202. implements a transaction manager to support distributed
  203. transactions. The transaction manager is used to implement
  204. a participant leader; the other replicas in the group
  205. will be referred to as participant slaves. If a transaction
  206. involves only one Paxos group (as is the case for
  207. most transactions), it can bypass the transaction manager,
  208. since the lock table and Paxos together provide transactionality.
  209. If a transaction involves more than one Paxos
  210. group, those groups’ leaders coordinate to perform twophase
  211. commit. One of the participant groups is chosen as
  212. the coordinator: the participant leader of that group will
  213. be referred to as the coordinator leader, and the slaves of
  214. that group as coordinator slaves. The state of each transaction
  215. manager is stored in the underlying Paxos group
  216. (and therefore is replicated).
  217. 2.2 Directories and Placement
  218. On top of the bag of key-value mappings, the Spanner
  219. implementation supports a bucketing abstraction called a
  220. directory, which is a set of contiguous keys that share a
  221. common prefix. (The choice of the term directory is a
  222. historical accident; a better term might be bucket.) We
  223. will explain the source of that prefix in Section 2.3. Supporting
  224. directories allows applications to control the locality
  225. of their data by choosing keys carefully.
  226. A directory is the unit of data placement. All data in
  227. a directory has the same replication configuration. When
  228. data is moved between Paxos groups, it is moved directory
  229. by directory, as shown in Figure 3. Spanner might
  230. move a directory to shed load from a Paxos group; to put
  231. directories that are frequently accessed together into the
  232. same group; or to move a directory into a group that is
  233. closer to its accessors. Directories can be moved while
  234. client operations are ongoing. One could expect that a
  235. 50MB directory can be moved in a few seconds.
  236. The fact that a Paxos group may contain multiple directories
  237. implies that a Spanner tablet is different from
  238. Published in the Proceedings of OSDI 2012 3
  239. a Bigtable tablet: the former is not necessarily a single
  240. lexicographically contiguous partition of the row space.
  241. Instead, a Spanner tablet is a container that may encapsulate
  242. multiple partitions of the row space. We made this
  243. decision so that it would be possible to colocate multiple
  244. directories that are frequently accessed together.
  245. Movedir is the background task used to move directories
  246. between Paxos groups [14]. Movedir is also used
  247. to add or remove replicas to Paxos groups [25], because
  248. Spanner does not yet support in-Paxos configuration
  249. changes. Movedir is not implemented as a single
  250. transaction, so as to avoid blocking ongoing reads and
  251. writes on a bulky data move. Instead, movedir registers
  252. the fact that it is starting to move data and moves the data
  253. in the background. When it has moved all but a nominal
  254. amount of the data, it uses a transaction to atomically
  255. move that nominal amount and update the metadata for
  256. the two Paxos groups.
  257. A directory is also the smallest unit whose geographicreplication
  258. properties (or placement, for short) can
  259. be specified by an application. The design of our
  260. placement-specification language separates responsibilities
  261. for managing replication configurations. Administrators
  262. control two dimensions: the number and types of
  263. replicas, and the geographic placement of those replicas.
  264. They create a menu of named options in these two dimensions
  265. (e.g., North America, replicated 5 ways with
  266. 1 witness). An application controls how data is replicated,
  267. by tagging each database and/or individual directories
  268. with a combination of those options. For example,
  269. an application might store each end-user’s data in its own
  270. directory, which would enable user A’s data to have three
  271. replicas in Europe, and user B’s data to have five replicas
  272. in North America.
  273. For expository clarity we have over-simplified. In fact,
  274. Spanner will shard a directory into multiple fragments
  275. if it grows too large. Fragments may be served from
  276. different Paxos groups (and therefore different servers).
  277. Movedir actually moves fragments, and not whole directories,
  278. between groups.
  279. 2.3 Data Model
  280. Spanner exposes the following set of data features
  281. to applications: a data model based on schematized
  282. semi-relational tables, a query language, and generalpurpose
  283. transactions. The move towards supporting
  284. these features was driven by many factors. The
  285. need to support schematized semi-relational tables and
  286. synchronous replication is supported by the popularity
  287. of Megastore [5]. At least 300 applications within
  288. Google use Megastore (despite its relatively low performance)
  289. because its data model is simpler to manage
  290. than Bigtable’s, and because of its support for synchronous
  291. replication across datacenters. (Bigtable only
  292. supports eventually-consistent replication across datacenters.)
  293. Examples of well-known Google applications
  294. that use Megastore are Gmail, Picasa, Calendar, Android
  295. Market, and AppEngine. The need to support a SQLlike
  296. query language in Spanner was also clear, given
  297. the popularity of Dremel [28] as an interactive dataanalysis
  298. tool. Finally, the lack of cross-row transactions
  299. in Bigtable led to frequent complaints; Percolator [32]
  300. was in part built to address this failing. Some authors
  301. have claimed that general two-phase commit is too expensive
  302. to support, because of the performance or availability
  303. problems that it brings [9, 10, 19]. We believe it
  304. is better to have application programmers deal with performance
  305. problems due to overuse of transactions as bottlenecks
  306. arise, rather than always coding around the lack
  307. of transactions. Running two-phase commit over Paxos
  308. mitigates the availability problems.
  309. The application data model is layered on top of the
  310. directory-bucketed key-value mappings supported by the
  311. implementation. An application creates one or more
  312. databases in a universe. Each database can contain an
  313. unlimited number of schematized tables. Tables look
  314. like relational-database tables, with rows, columns, and
  315. versioned values. We will not go into detail about the
  316. query language for Spanner. It looks like SQL with some
  317. extensions to support protocol-buffer-valued fields.
  318. Spanner’s data model is not purely relational, in that
  319. rows must have names. More precisely, every table is required
  320. to have an ordered set of one or more primary-key
  321. columns. This requirement is where Spanner still looks
  322. like a key-value store: the primary keys form the name
  323. for a row, and each table defines a mapping from the
  324. primary-key columns to the non-primary-key columns.
  325. A row has existence only if some value (even if it is
  326. NULL) is defined for the row’s keys. Imposing this structure
  327. is useful because it lets applications control data locality
  328. through their choices of keys.
  329. Figure 4 contains an example Spanner schema for storing
  330. photo metadata on a per-user, per-album basis. The
  331. schema language is similar to Megastore’s, with the additional
  332. requirement that every Spanner database must
  333. be partitioned by clients into one or more hierarchies
  334. of tables. Client applications declare the hierarchies in
  335. database schemas via the INTERLEAVE IN declarations.
  336. The table at the top of a hierarchy is a directory
  337. table. Each row in a directory table with key K, together
  338. with all of the rows in descendant tables that start with K
  339. in lexicographic order, forms a directory. ON DELETE
  340. CASCADE says that deleting a row in the directory table
  341. deletes any associated child rows. The figure also illustrates
  342. the interleaved layout for the example database: for
  343. Published in the Proceedings of OSDI 2012 4
  344. CREATE TABLE Users {
  345. uid INT64 NOT NULL, email STRING
  346. } PRIMARY KEY (uid), DIRECTORY;
  347. CREATE TABLE Albums {
  348. uid INT64 NOT NULL, aid INT64 NOT NULL,
  349. name STRING
  350. } PRIMARY KEY (uid, aid),
  351. INTERLEAVE IN PARENT Users ON DELETE CASCADE;
  352. Figure 4: Example Spanner schema for photo metadata, and
  353. the interleaving implied by INTERLEAVE IN.
  354. example, Albums(2,1) represents the row from the
  355. Albums table for user id 2, album id 1. This
  356. interleaving of tables to form directories is significant
  357. because it allows clients to describe the locality relationships
  358. that exist between multiple tables, which is necessary
  359. for good performance in a sharded, distributed
  360. database. Without it, Spanner would not know the most
  361. important locality relationships.
  362. 3 TrueTime
  363. Method Returns
  364. TT.now() TTinterval: [earliest; latest]
  365. TT.after(t) true if t has definitely passed
  366. TT.before(t) true if t has definitely not arrived
  367. Table 1: TrueTime API. The argument t is of type TTstamp.
  368. This section describes the TrueTime API and sketches
  369. its implementation. We leave most of the details for another
  370. paper: our goal is to demonstrate the power of
  371. having such an API. Table 1 lists the methods of the
  372. API. TrueTime explicitly represents time as a TTinterval,
  373. which is an interval with bounded time uncertainty (unlike
  374. standard time interfaces that give clients no notion
  375. of uncertainty). The endpoints of a TTinterval are of
  376. type TTstamp. The TT.now() method returns a TTinterval
  377. that is guaranteed to contain the absolute time during
  378. which TT.now() was invoked. The time epoch is analogous
  379. to UNIX time with leap-second smearing. Define
  380. the instantaneous error bound as , which is half of
  381. the interval’s width, and the average error bound as .
  382. The TT.after() and TT.before() methods are convenience
  383. wrappers around TT.now().
  384. Denote the absolute time of an event e by the function
  385. tabs(e). In more formal terms, TrueTime guarantees
  386. that for an invocation tt = TT.now(), tt.earliest 
  387. tabs(enow)  tt.latest, where enow is the invocation event.
  388. The underlying time references used by TrueTime
  389. are GPS and atomic clocks. TrueTime uses two forms
  390. of time reference because they have different failure
  391. modes. GPS reference-source vulnerabilities include antenna
  392. and receiver failures, local radio interference, correlated
  393. failures (e.g., design faults such as incorrect leapsecond
  394. handling and spoofing), and GPS system outages.
  395. Atomic clocks can fail in ways uncorrelated to GPS and
  396. each other, and over long periods of time can drift significantly
  397. due to frequency error.
  398. TrueTime is implemented by a set of time master machines
  399. per datacenter and a timeslave daemon per machine.
  400. The majority of masters have GPS receivers with
  401. dedicated antennas; these masters are separated physically
  402. to reduce the effects of antenna failures, radio interference,
  403. and spoofing. The remaining masters (which
  404. we refer to as Armageddon masters) are equipped with
  405. atomic clocks. An atomic clock is not that expensive:
  406. the cost of an Armageddon master is of the same order
  407. as that of a GPS master. All masters’ time references
  408. are regularly compared against each other. Each master
  409. also cross-checks the rate at which its reference advances
  410. time against its own local clock, and evicts itself
  411. if there is substantial divergence. Between synchronizations,
  412. Armageddon masters advertise a slowly increasing
  413. time uncertainty that is derived from conservatively applied
  414. worst-case clock drift. GPS masters advertise uncertainty
  415. that is typically close to zero.
  416. Every daemon polls a variety of masters [29] to reduce
  417. vulnerability to errors from any one master. Some
  418. are GPS masters chosen from nearby datacenters; the
  419. rest are GPS masters from farther datacenters, as well
  420. as some Armageddon masters. Daemons apply a variant
  421. of Marzullo’s algorithm [27] to detect and reject liars,
  422. and synchronize the local machine clocks to the nonliars.
  423. To protect against broken local clocks, machines
  424. that exhibit frequency excursions larger than the worstcase
  425. bound derived from component specifications and
  426. operating environment are evicted.
  427. Between synchronizations, a daemon advertises a
  428. slowly increasing time uncertainty.  is derived from
  429. conservatively applied worst-case local clock drift.  also
  430. depends on time-master uncertainty and communication
  431. delay to the time masters. In our production environment,
  432.  is typically a sawtooth function of time, varying
  433. from about 1 to 7 ms over each poll interval.  is therefore
  434. 4 ms most of the time. The daemon’s poll interval is
  435. currently 30 seconds, and the current applied drift rate is
  436. set at 200 microseconds/second, which together account
  437. Published in the Proceedings of OSDI 2012 5
  438. Timestamp Concurrency
  439. Operation Discussion Control Replica Required
  440. Read-Write Transaction x 4.1.2 pessimistic leader
  441. Read-Only Transaction x 4.1.4 lock-free
  442. leader for timestamp; any for
  443. read, subject to x 4.1.3
  444. Snapshot Read, client-provided timestamp — lock-free any, subject to x 4.1.3
  445. Snapshot Read, client-provided bound x 4.1.3 lock-free any, subject to x 4.1.3
  446. Table 2: Types of reads and writes in Spanner, and how they compare.
  447. for the sawtooth bounds from 0 to 6 ms. The remaining
  448. 1 ms comes from the communication delay to the
  449. time masters. Excursions from this sawtooth are possible
  450. in the presence of failures. For example, occasional
  451. time-master unavailability can cause datacenter-wide increases
  452. in . Similarly, overloaded machines and network
  453. links can result in occasional localized  spikes.
  454. 4 Concurrency Control
  455. This section describes how TrueTime is used to guarantee
  456. the correctness properties around concurrency control,
  457. and how those properties are used to implement
  458. features such as externally consistent transactions, lockfree
  459. read-only transactions, and non-blocking reads in
  460. the past. These features enable, for example, the guarantee
  461. that a whole-database audit read at a timestamp t
  462. will see exactly the effects of every transaction that has
  463. committed as of t.
  464. Going forward, it will be important to distinguish
  465. writes as seen by Paxos (which we will refer to as Paxos
  466. writes unless the context is clear) from Spanner client
  467. writes. For example, two-phase commit generates a
  468. Paxos write for the prepare phase that has no corresponding
  469. Spanner client write.
  470. 4.1 Timestamp Management
  471. Table 2 lists the types of operations that Spanner supports.
  472. The Spanner implementation supports readwrite
  473. transactions, read-only transactions (predeclared
  474. snapshot-isolation transactions), and snapshot reads.
  475. Standalone writes are implemented as read-write transactions;
  476. non-snapshot standalone reads are implemented
  477. as read-only transactions. Both are internally retried
  478. (clients need not write their own retry loops).
  479. A read-only transaction is a kind of transaction that
  480. has the performance benefits of snapshot isolation [6].
  481. A read-only transaction must be predeclared as not having
  482. any writes; it is not simply a read-write transaction
  483. without any writes. Reads in a read-only transaction execute
  484. at a system-chosen timestamp without locking, so
  485. that incoming writes are not blocked. The execution of
  486. the reads in a read-only transaction can proceed on any
  487. replica that is sufficiently up-to-date (Section 4.1.3).
  488. A snapshot read is a read in the past that executes without
  489. locking. A client can either specify a timestamp for a
  490. snapshot read, or provide an upper bound on the desired
  491. timestamp’s staleness and let Spanner choose a timestamp.
  492. In either case, the execution of a snapshot read
  493. proceeds at any replica that is sufficiently up-to-date.
  494. For both read-only transactions and snapshot reads,
  495. commit is inevitable once a timestamp has been chosen,
  496. unless the data at that timestamp has been garbagecollected.
  497. As a result, clients can avoid buffering results
  498. inside a retry loop. When a server fails, clients can internally
  499. continue the query on a different server by repeating
  500. the timestamp and the current read position.
  501. 4.1.1 Paxos Leader Leases
  502. Spanner’s Paxos implementation uses timed leases to
  503. make leadership long-lived (10 seconds by default). A
  504. potential leader sends requests for timed lease votes;
  505. upon receiving a quorum of lease votes the leader knows
  506. it has a lease. A replica extends its lease vote implicitly
  507. on a successful write, and the leader requests lease-vote
  508. extensions if they are near expiration. Define a leader’s
  509. lease interval as starting when it discovers it has a quorum
  510. of lease votes, and as ending when it no longer has
  511. a quorum of lease votes (because some have expired).
  512. Spanner depends on the following disjointness invariant:
  513. for each Paxos group, each Paxos leader’s lease interval
  514. is disjoint from every other leader’s. Appendix A describes
  515. how this invariant is enforced.
  516. The Spanner implementation permits a Paxos leader
  517. to abdicate by releasing its slaves from their lease votes.
  518. To preserve the disjointness invariant, Spanner constrains
  519. when abdication is permissible. Define smax to be the
  520. maximum timestamp used by a leader. Subsequent sections
  521. will describe when smax is advanced. Before abdicating,
  522. a leader must wait until TT.after(smax) is true.
  523. 4.1.2 Assigning Timestamps to RW Transactions
  524. Transactional reads and writes use two-phase locking.
  525. As a result, they can be assigned timestamps at any time
  526. Published in the Proceedings of OSDI 2012 6
  527. when all locks have been acquired, but before any locks
  528. have been released. For a given transaction, Spanner assigns
  529. it the timestamp that Paxos assigns to the Paxos
  530. write that represents the transaction commit.
  531. Spanner depends on the following monotonicity invariant:
  532. within each Paxos group, Spanner assigns timestamps
  533. to Paxos writes in monotonically increasing order,
  534. even across leaders. A single leader replica can trivially
  535. assign timestamps in monotonically increasing order.
  536. This invariant is enforced across leaders by making
  537. use of the disjointness invariant: a leader must only assign
  538. timestamps within the interval of its leader lease.
  539. Note that whenever a timestamp s is assigned, smax is
  540. advanced to s to preserve disjointness.
  541. Spanner also enforces the following externalconsistency
  542. invariant: if the start of a transaction T2
  543. occurs after the commit of a transaction T1, then the
  544. commit timestamp of T2 must be greater than the
  545. commit timestamp of T1. Define the start and commit
  546. events for a transaction Ti by estart
  547. i and ecommit
  548. i ; and
  549. the commit timestamp of a transaction Ti by si. The
  550. invariant becomes tabs(ecommit
  551. 1 ) < tabs(estart
  552. 2 ) ) s1 < s2.
  553. The protocol for executing transactions and assigning
  554. timestamps obeys two rules, which together guarantee
  555. this invariant, as shown below. Define the arrival event
  556. of the commit request at the coordinator leader for a
  557. write Ti to be eserver
  558. i .
  559. Start The coordinator leader for a write Ti assigns
  560. a commit timestamp si no less than the value of
  561. TT.now().latest, computed after eserver
  562. i . Note that the
  563. participant leaders do not matter here; Section 4.2.1 describes
  564. how they are involved in the implementation of
  565. the next rule.
  566. Commit Wait The coordinator leader ensures that
  567. clients cannot see any data committed by Ti until
  568. TT.after(si) is true. Commit wait ensures that si is
  569. less than the absolute commit time of Ti, or si <
  570. tabs(ecommit
  571. i ). The implementation of commit wait is described
  572. in Section 4.2.1. Proof:
  573. s1 < tabs(ecommit
  574. 1 ) (commit wait)
  575. tabs(ecommit
  576. 1 ) < tabs(estart
  577. 2 ) (assumption)
  578. tabs(estart
  579. 2 )  tabs(eserver
  580. 2 ) (causality)
  581. tabs(eserver
  582. 2 )  s2 (start)
  583. s1 < s2 (transitivity)
  584. 4.1.3 Serving Reads at a Timestamp
  585. The monotonicity invariant described in Section 4.1.2 allows
  586. Spanner to correctly determine whether a replica’s
  587. state is sufficiently up-to-date to satisfy a read. Every
  588. replica tracks a value called safe time tsafe which is the
  589. maximum timestamp at which a replica is up-to-date. A
  590. replica can satisfy a read at a timestamp t if t <= tsafe.
  591. Define tsafe = min(tPaxos
  592. safe ; tTM
  593. safe), where each Paxos
  594. state machine has a safe time tPaxos
  595. safe and each transaction
  596. manager has a safe time tTM
  597. safe. tPaxos
  598. safe is simpler: it
  599. is the timestamp of the highest-applied Paxos write. Because
  600. timestamps increase monotonically and writes are
  601. applied in order, writes will no longer occur at or below
  602. tPaxos
  603. safe with respect to Paxos.
  604. tTM
  605. safe is 1 at a replica if there are zero prepared (but
  606. not committed) transactions—that is, transactions in between
  607. the two phases of two-phase commit. (For a participant
  608. slave, tTM
  609. safe actually refers to the replica’s leader’s
  610. transaction manager, whose state the slave can infer
  611. through metadata passed on Paxos writes.) If there are
  612. any such transactions, then the state affected by those
  613. transactions is indeterminate: a participant replica does
  614. not know yet whether such transactions will commit. As
  615. we discuss in Section 4.2.1, the commit protocol ensures
  616. that every participant knows a lower bound on a prepared
  617. transaction’s timestamp. Every participant leader
  618. (for a group g) for a transaction Ti assigns a prepare
  619. timestamp sprepare
  620. i;g to its prepare record. The coordinator
  621. leader ensures that the transaction’s commit timestamp
  622. si >= sprepare
  623. i;g over all participant groups g. Therefore,
  624. for every replica in a group g, over all transactions Ti prepared
  625. at g, tTM
  626. safe = mini(sprepare
  627. i;g ) 􀀀 1 over all transactions
  628. prepared at g.
  629. 4.1.4 Assigning Timestamps to RO Transactions
  630. A read-only transaction executes in two phases: assign
  631. a timestamp sread [8], and then execute the transaction’s
  632. reads as snapshot reads at sread. The snapshot reads can
  633. execute at any replicas that are sufficiently up-to-date.
  634. The simple assignment of sread = TT.now().latest, at
  635. any time after a transaction starts, preserves external consistency
  636. by an argument analogous to that presented for
  637. writes in Section 4.1.2. However, such a timestamp may
  638. require the execution of the data reads at sread to block
  639. if tsafe has not advanced sufficiently. (In addition, note
  640. that choosing a value of sread may also advance smax to
  641. preserve disjointness.) To reduce the chances of blocking,
  642. Spanner should assign the oldest timestamp that preserves
  643. external consistency. Section 4.2.2 explains how
  644. such a timestamp can be chosen.
  645. 4.2 Details
  646. This section explains some of the practical details of
  647. read-write transactions and read-only transactions elided
  648. earlier, as well as the implementation of a special transaction
  649. type used to implement atomic schema changes.
  650. Published in the Proceedings of OSDI 2012 7
  651. It then describes some refinements of the basic schemes
  652. as described.
  653. 4.2.1 Read-Write Transactions
  654. Like Bigtable, writes that occur in a transaction are
  655. buffered at the client until commit. As a result, reads
  656. in a transaction do not see the effects of the transaction’s
  657. writes. This design works well in Spanner because a read
  658. returns the timestamps of any data read, and uncommitted
  659. writes have not yet been assigned timestamps.
  660. Reads within read-write transactions use woundwait
  661. [33] to avoid deadlocks. The client issues reads
  662. to the leader replica of the appropriate group, which
  663. acquires read locks and then reads the most recent
  664. data. While a client transaction remains open, it sends
  665. keepalive messages to prevent participant leaders from
  666. timing out its transaction. When a client has completed
  667. all reads and buffered all writes, it begins two-phase
  668. commit. The client chooses a coordinator group and
  669. sends a commit message to each participant’s leader with
  670. the identity of the coordinator and any buffered writes.
  671. Having the client drive two-phase commit avoids sending
  672. data twice across wide-area links.
  673. A non-coordinator-participant leader first acquires
  674. write locks. It then chooses a prepare timestamp that
  675. must be larger than any timestamps it has assigned to previous
  676. transactions (to preserve monotonicity), and logs a
  677. prepare record through Paxos. Each participant then notifies
  678. the coordinator of its prepare timestamp.
  679. The coordinator leader also first acquires write locks,
  680. but skips the prepare phase. It chooses a timestamp for
  681. the entire transaction after hearing from all other participant
  682. leaders. The commit timestamp s must be greater or
  683. equal to all prepare timestamps (to satisfy the constraints
  684. discussed in Section 4.1.3), greater than TT.now().latest
  685. at the time the coordinator received its commit message,
  686. and greater than any timestamps the leader has assigned
  687. to previous transactions (again, to preserve monotonicity).
  688. The coordinator leader then logs a commit record
  689. through Paxos (or an abort if it timed out while waiting
  690. on the other participants).
  691. Before allowing any coordinator replica to apply
  692. the commit record, the coordinator leader waits until
  693. TT.after(s), so as to obey the commit-wait rule described
  694. in Section 4.1.2. Because the coordinator leader chose s
  695. based on TT.now().latest, and now waits until that timestamp
  696. is guaranteed to be in the past, the expected wait
  697. is at least 2  . This wait is typically overlapped with
  698. Paxos communication. After commit wait, the coordinator
  699. sends the commit timestamp to the client and all
  700. other participant leaders. Each participant leader logs the
  701. transaction’s outcome through Paxos. All participants
  702. apply at the same timestamp and then release locks.
  703. 4.2.2 Read-Only Transactions
  704. Assigning a timestamp requires a negotiation phase between
  705. all of the Paxos groups that are involved in the
  706. reads. As a result, Spanner requires a scope expression
  707. for every read-only transaction, which is an expression
  708. that summarizes the keys that will be read by the entire
  709. transaction. Spanner automatically infers the scope for
  710. standalone queries.
  711. If the scope’s values are served by a single Paxos
  712. group, then the client issues the read-only transaction to
  713. that group’s leader. (The current Spanner implementation
  714. only chooses a timestamp for a read-only transaction
  715. at a Paxos leader.) That leader assigns sread and executes
  716. the read. For a single-site read, Spanner generally
  717. does better than TT.now().latest. Define LastTS() to
  718. be the timestamp of the last committed write at a Paxos
  719. group. If there are no prepared transactions, the assignment
  720. sread = LastTS() trivially satisfies external consistency:
  721. the transaction will see the result of the last write,
  722. and therefore be ordered after it.
  723. If the scope’s values are served by multiple Paxos
  724. groups, there are several options. The most complicated
  725. option is to do a round of communication with all of
  726. the groups’s leaders to negotiate sread based on LastTS().
  727. Spanner currently implements a simpler choice. The
  728. client avoids a negotiation round, and just has its reads
  729. execute at sread = TT.now().latest (which may wait for
  730. safe time to advance). All reads in the transaction can be
  731. sent to replicas that are sufficiently up-to-date.
  732. 4.2.3 Schema-Change Transactions
  733. TrueTime enables Spanner to support atomic schema
  734. changes. It would be infeasible to use a standard transaction,
  735. because the number of participants (the number of
  736. groups in a database) could be in the millions. Bigtable
  737. supports atomic schema changes in one datacenter, but
  738. its schema changes block all operations.
  739. A Spanner schema-change transaction is a generally
  740. non-blocking variant of a standard transaction. First, it
  741. is explicitly assigned a timestamp in the future, which
  742. is registered in the prepare phase. As a result, schema
  743. changes across thousands of servers can complete with
  744. minimal disruption to other concurrent activity. Second,
  745. reads and writes, which implicitly depend on the
  746. schema, synchronize with any registered schema-change
  747. timestamp at time t: they may proceed if their timestamps
  748. precede t, but they must block behind the schemachange
  749. transaction if their timestamps are after t. Without
  750. TrueTime, defining the schema change to happen at t
  751. would be meaningless.
  752. Published in the Proceedings of OSDI 2012 8
  753. latency (ms) throughput (Kops/sec)
  754. replicas write read-only transaction snapshot read write read-only transaction snapshot read
  755. 1D 9.4.6 — — 4.0.3 — —
  756. 1 14.41.0 1.4.1 1.3.1 4.1.05 10.9.4 13.5.1
  757. 3 13.9.6 1.3.1 1.2.1 2.2.5 13.83.2 38.5.3
  758. 5 14.4.4 1.4.05 1.3.04 2.8.3 25.35.2 50.01.1
  759. Table 3: Operation microbenchmarks. Mean and standard deviation over 10 runs. 1D means one replica with commit wait disabled.
  760. 4.2.4 Refinements
  761. tTM
  762. safe as defined above has a weakness, in that a single
  763. prepared transaction prevents tsafe from advancing. As
  764. a result, no reads can occur at later timestamps, even
  765. if the reads do not conflict with the transaction. Such
  766. false conflicts can be removed by augmenting tTM
  767. safe with
  768. a fine-grained mapping from key ranges to preparedtransaction
  769. timestamps. This information can be stored
  770. in the lock table, which already maps key ranges to
  771. lock metadata. When a read arrives, it only needs to be
  772. checked against the fine-grained safe time for key ranges
  773. with which the read conflicts.
  774. LastTS() as defined above has a similar weakness: if
  775. a transaction has just committed, a non-conflicting readonly
  776. transaction must still be assigned sread so as to follow
  777. that transaction. As a result, the execution of the read
  778. could be delayed. This weakness can be remedied similarly
  779. by augmenting LastTS() with a fine-grained mapping
  780. from key ranges to commit timestamps in the lock
  781. table. (We have not yet implemented this optimization.)
  782. When a read-only transaction arrives, its timestamp can
  783. be assigned by taking the maximum value of LastTS()
  784. for the key ranges with which the transaction conflicts,
  785. unless there is a conflicting prepared transaction (which
  786. can be determined from fine-grained safe time).
  787. tPaxos
  788. safe as defined above has a weakness in that it cannot
  789. advance in the absence of Paxos writes. That is, a snapshot
  790. read at t cannot execute at Paxos groups whose last
  791. write happened before t. Spanner addresses this problem
  792. by taking advantage of the disjointness of leader-lease
  793. intervals. Each Paxos leader advances tPaxos
  794. safe by keeping
  795. a threshold above which future writes’ timestamps will
  796. occur: it maintains a mapping MinNextTS(n) from Paxos
  797. sequence number n to the minimum timestamp that may
  798. be assigned to Paxos sequence number n + 1. A replica
  799. can advance tPaxos
  800. safe to MinNextTS(n) 􀀀 1 when it has applied
  801. through n.
  802. A single leader can enforce its MinNextTS()
  803. promises easily. Because the timestamps promised
  804. by MinNextTS() lie within a leader’s lease, the disjointness
  805. invariant enforces MinNextTS() promises across
  806. leaders. If a leader wishes to advance MinNextTS()
  807. beyond the end of its leader lease, it must first extend its
  808. lease. Note that smax is always advanced to the highest
  809. value in MinNextTS() to preserve disjointness.
  810. A leader by default advances MinNextTS() values every
  811. 8 seconds. Thus, in the absence of prepared transactions,
  812. healthy slaves in an idle Paxos group can serve
  813. reads at timestamps greater than 8 seconds old in the
  814. worst case. A leader may also advance MinNextTS() values
  815. on demand from slaves.
  816. 5 Evaluation
  817. We first measure Spanner’s performance with respect to
  818. replication, transactions, and availability. We then provide
  819. some data on TrueTime behavior, and a case study
  820. of our first client, F1.
  821. 5.1 Microbenchmarks
  822. Table 3 presents some microbenchmarks for Spanner.
  823. These measurements were taken on timeshared machines:
  824. each spanserver ran on scheduling units of 4GB
  825. RAM and 4 cores (AMD Barcelona 2200MHz). Clients
  826. were run on separate machines. Each zone contained one
  827. spanserver. Clients and zones were placed in a set of datacenters
  828. with network distance of less than 1ms. (Such a
  829. layout should be commonplace: most applications do not
  830. need to distribute all of their data worldwide.) The test
  831. database was created with 50 Paxos groups with 2500 directories.
  832. Operations were standalone reads and writes of
  833. 4KB. All reads were served out of memory after a compaction,
  834. so that we are only measuring the overhead of
  835. Spanner’s call stack. In addition, one unmeasured round
  836. of reads was done first to warm any location caches.
  837. For the latency experiments, clients issued sufficiently
  838. few operations so as to avoid queuing at the servers.
  839. From the 1-replica experiments, commit wait is about
  840. 5ms, and Paxos latency is about 9ms. As the number
  841. of replicas increases, the latency stays roughly constant
  842. with less standard deviation because Paxos executes in
  843. parallel at a group’s replicas. As the number of replicas
  844. increases, the latency to achieve a quorum becomes less
  845. sensitive to slowness at one slave replica.
  846. For the throughput experiments, clients issued sufficiently
  847. many operations so as to saturate the servers’
  848. Published in the Proceedings of OSDI 2012 9
  849. latency (ms)
  850. participants mean 99th percentile
  851. 1 17.0 1.4 75.0 34.9
  852. 2 24.5 2.5 87.6 35.9
  853. 5 31.5 6.2 104.5 52.2
  854. 10 30.0 3.7 95.6 25.4
  855. 25 35.5 5.6 100.4 42.7
  856. 50 42.7 4.1 93.7 22.9
  857. 100 71.4 7.6 131.2 17.6
  858. 200 150.5 11.0 320.3 35.1
  859. Table 4: Two-phase commit scalability. Mean and standard
  860. deviations over 10 runs.
  861. CPUs. Snapshot reads can execute at any up-to-date
  862. replicas, so their throughput increases almost linearly
  863. with the number of replicas. Single-read read-only transactions
  864. only execute at leaders because timestamp assignment
  865. must happen at leaders. Read-only-transaction
  866. throughput increases with the number of replicas because
  867. the number of effective spanservers increases: in the
  868. experimental setup, the number of spanservers equaled
  869. the number of replicas, and leaders were randomly distributed
  870. among the zones. Write throughput benefits
  871. from the same experimental artifact (which explains the
  872. increase in throughput from 3 to 5 replicas), but that benefit
  873. is outweighed by the linear increase in the amount of
  874. work performed per write, as the number of replicas increases.
  875. Table 4 demonstrates that two-phase commit can scale
  876. to a reasonable number of participants: it summarizes
  877. a set of experiments run across 3 zones, each with 25
  878. spanservers. Scaling up to 50 participants is reasonable
  879. in both mean and 99th-percentile, and latencies start to
  880. rise noticeably at 100 participants.
  881. 5.2 Availability
  882. Figure 5 illustrates the availability benefits of running
  883. Spanner in multiple datacenters. It shows the results of
  884. three experiments on throughput in the presence of datacenter
  885. failure, all of which are overlaid onto the same
  886. time scale. The test universe consisted of 5 zones Zi,
  887. each of which had 25 spanservers. The test database was
  888. sharded into 1250 Paxos groups, and 100 test clients constantly
  889. issued non-snapshot reads at an aggregrate rate
  890. of 50K reads/second. All of the leaders were explicitly
  891. placed in Z1. Five seconds into each test, all of
  892. the servers in one zone were killed: non-leader kills Z2;
  893. leader-hard kills Z1; leader-soft kills Z1, but it gives notifications
  894. to all of the servers that they should handoff
  895. leadership first.
  896. Killing Z2 has no effect on read throughput. Killing
  897. Z1 while giving the leaders time to handoff leadership to
  898. 0 5 10 15 20
  899. Time in seconds
  900. 200K
  901. 400K
  902. 600K
  903. 800K
  904. 1M
  905. 1.2M
  906. 1.4M
  907. Cumulative reads completed
  908. non-leader
  909. leader-soft
  910. leader-hard
  911. Figure 5: Effect of killing servers on throughput.
  912. a different zone has a minor effect: the throughput drop
  913. is not visible in the graph, but is around 3-4%. On the
  914. other hand, killing Z1 with no warning has a severe effect:
  915. the rate of completion drops almost to 0. As leaders
  916. get re-elected, though, the throughput of the system rises
  917. to approximately 100K reads/second because of two artifacts
  918. of our experiment: there is extra capacity in the
  919. system, and operations are queued while the leader is unavailable.
  920. As a result, the throughput of the system rises
  921. before leveling off again at its steady-state rate.
  922. We can also see the effect of the fact that Paxos leader
  923. leases are set to 10 seconds. When we kill the zone,
  924. the leader-lease expiration times for the groups should
  925. be evenly distributed over the next 10 seconds. Soon after
  926. each lease from a dead leader expires, a new leader is
  927. elected. Approximately 10 seconds after the kill time, all
  928. of the groups have leaders and throughput has recovered.
  929. Shorter lease times would reduce the effect of server
  930. deaths on availability, but would require greater amounts
  931. of lease-renewal network traffic. We are in the process of
  932. designing and implementing a mechanism that will cause
  933. slaves to release Paxos leader leases upon leader failure.
  934. 5.3 TrueTime
  935. Two questions must be answered with respect to True-
  936. Time: is  truly a bound on clock uncertainty, and how
  937. bad does  get? For the former, the most serious problem
  938. would be if a local clock’s drift were greater than
  939. 200us/sec: that would break assumptions made by True-
  940. Time. Our machine statistics show that bad CPUs are 6
  941. times more likely than bad clocks. That is, clock issues
  942. are extremely infrequent, relative to much more serious
  943. hardware problems. As a result, we believe that True-
  944. Time’s implementation is as trustworthy as any other
  945. piece of software upon which Spanner depends.
  946. Figure 6 presents TrueTime data taken at several thousand
  947. spanserver machines across datacenters up to 2200
  948. Published in the Proceedings of OSDI 2012 10
  949. Mar 29 Mar 30 Mar 31 Apr 1
  950. Date
  951. 2
  952. 4
  953. 6
  954. 8
  955. 10
  956. Epsilon (ms)
  957. 99.9
  958. 99
  959. 90
  960. 6AM 8AM 10AM 12PM
  961. Date (April 13)
  962. 1
  963. 2
  964. 3
  965. 4
  966. 5
  967. 6
  968. Figure 6: Distribution of TrueTime  values, sampled right
  969. after timeslave daemon polls the time masters. 90th, 99th, and
  970. 99.9th percentiles are graphed.
  971. km apart. It plots the 90th, 99th, and 99.9th percentiles
  972. of , sampled at timeslave daemons immediately after
  973. polling the time masters. This sampling elides the sawtooth
  974. in  due to local-clock uncertainty, and therefore
  975. measures time-master uncertainty (which is generally 0)
  976. plus communication delay to the time masters.
  977. The data shows that these two factors in determining
  978. the base value of  are generally not a problem. However,
  979. there can be significant tail-latency issues that cause
  980. higher values of . The reduction in tail latencies beginning
  981. on March 30 were due to networking improvements
  982. that reduced transient network-link congestion. The increase
  983. in  on April 13, approximately one hour in duration,
  984. resulted from the shutdown of 2 time masters at a
  985. datacenter for routine maintenance. We continue to investigate
  986. and remove causes of TrueTime spikes.
  987. 5.4 F1
  988. Spanner started being experimentally evaluated under
  989. production workloads in early 2011, as part of a rewrite
  990. of Google’s advertising backend called F1 [35]. This
  991. backend was originally based on a MySQL database that
  992. was manually sharded many ways. The uncompressed
  993. dataset is tens of terabytes, which is small compared to
  994. many NoSQL instances, but was large enough to cause
  995. difficulties with sharded MySQL. The MySQL sharding
  996. scheme assigned each customer and all related data to a
  997. fixed shard. This layout enabled the use of indexes and
  998. complex query processing on a per-customer basis, but
  999. required some knowledge of the sharding in application
  1000. business logic. Resharding this revenue-critical database
  1001. as it grew in the number of customers and their data was
  1002. extremely costly. The last resharding took over two years
  1003. of intense effort, and involved coordination and testing
  1004. across dozens of teams to minimize risk. This operation
  1005. was too complex to do regularly: as a result, the team had
  1006. to limit growth on the MySQL database by storing some
  1007. # fragments # directories
  1008. 1 >100M
  1009. 2–4 341
  1010. 5–9 5336
  1011. 10–14 232
  1012. 15–99 34
  1013. 100–500 7
  1014. Table 5: Distribution of directory-fragment counts in F1.
  1015. data in external Bigtables, which compromised transactional
  1016. behavior and the ability to query across all data.
  1017. The F1 team chose to use Spanner for several reasons.
  1018. First, Spanner removes the need to manually reshard.
  1019. Second, Spanner provides synchronous replication
  1020. and automatic failover. With MySQL master-slave
  1021. replication, failover was difficult, and risked data loss
  1022. and downtime. Third, F1 requires strong transactional
  1023. semantics, which made using other NoSQL systems impractical.
  1024. Application semantics requires transactions
  1025. across arbitrary data, and consistent reads. The F1 team
  1026. also needed secondary indexes on their data (since Spanner
  1027. does not yet provide automatic support for secondary
  1028. indexes), and was able to implement their own consistent
  1029. global indexes using Spanner transactions.
  1030. All application writes are now by default sent through
  1031. F1 to Spanner, instead of the MySQL-based application
  1032. stack. F1 has 2 replicas on the west coast of the US, and
  1033. 3 on the east coast. This choice of replica sites was made
  1034. to cope with outages due to potential major natural disasters,
  1035. and also the choice of their frontend sites. Anecdotally,
  1036. Spanner’s automatic failover has been nearly invisible
  1037. to them. Although there have been unplanned cluster
  1038. failures in the last few months, the most that the F1 team
  1039. has had to do is update their database’s schema to tell
  1040. Spanner where to preferentially place Paxos leaders, so
  1041. as to keep them close to where their frontends moved.
  1042. Spanner’s timestamp semantics made it efficient for
  1043. F1 to maintain in-memory data structures computed from
  1044. the database state. F1 maintains a logical history log of
  1045. all changes, which is written into Spanner itself as part
  1046. of every transaction. F1 takes full snapshots of data at a
  1047. timestamp to initialize its data structures, and then reads
  1048. incremental changes to update them.
  1049. Table 5 illustrates the distribution of the number of
  1050. fragments per directory in F1. Each directory typically
  1051. corresponds to a customer in the application stack above
  1052. F1. The vast majority of directories (and therefore customers)
  1053. consist of only 1 fragment, which means that
  1054. reads and writes to those customers’ data are guaranteed
  1055. to occur on only a single server. The directories with
  1056. more than 100 fragments are all tables that contain F1
  1057. secondary indexes: writes to more than a few fragments
  1058. Published in the Proceedings of OSDI 2012 11
  1059. latency (ms)
  1060. operation mean std dev count
  1061. all reads 8.7 376.4 21.5B
  1062. single-site commit 72.3 112.8 31.2M
  1063. multi-site commit 103.0 52.2 32.1M
  1064. Table 6: F1-perceived operation latencies measured over the
  1065. course of 24 hours.
  1066. of such tables are extremely uncommon. The F1 team
  1067. has only seen such behavior when they do untuned bulk
  1068. data loads as transactions.
  1069. Table 6 presents Spanner operation latencies as measured
  1070. from F1 servers. Replicas in the east-coast data
  1071. centers are given higher priority in choosing Paxos leaders.
  1072. The data in the table is measured from F1 servers
  1073. in those data centers. The large standard deviation in
  1074. write latencies is caused by a pretty fat tail due to lock
  1075. conflicts. The even larger standard deviation in read latencies
  1076. is partially due to the fact that Paxos leaders are
  1077. spread across two data centers, only one of which has
  1078. machines with SSDs. In addition, the measurement includes
  1079. every read in the system from two datacenters:
  1080. the mean and standard deviation of the bytes read were
  1081. roughly 1.6KB and 119KB, respectively.
  1082. 6 Related Work
  1083. Consistent replication across datacenters as a storage
  1084. service has been provided by Megastore [5] and DynamoDB
  1085. [3]. DynamoDB presents a key-value interface,
  1086. and only replicates within a region. Spanner follows
  1087. Megastore in providing a semi-relational data model,
  1088. and even a similar schema language. Megastore does
  1089. not achieve high performance. It is layered on top of
  1090. Bigtable, which imposes high communication costs. It
  1091. also does not support long-lived leaders: multiple replicas
  1092. may initiate writes. All writes from different replicas
  1093. necessarily conflict in the Paxos protocol, even if
  1094. they do not logically conflict: throughput collapses on
  1095. a Paxos group at several writes per second. Spanner provides
  1096. higher performance, general-purpose transactions,
  1097. and external consistency.
  1098. Pavlo et al. [31] have compared the performance of
  1099. databases and MapReduce [12]. They point to several
  1100. other efforts that have been made to explore database
  1101. functionality layered on distributed key-value stores [1,
  1102. 4, 7, 41] as evidence that the two worlds are converging.
  1103. We agree with the conclusion, but demonstrate that integrating
  1104. multiple layers has its advantages: integrating
  1105. concurrency control with replication reduces the cost of
  1106. commit wait in Spanner, for example.
  1107. The notion of layering transactions on top of a replicated
  1108. store dates at least as far back as Gifford’s dissertation
  1109. [16]. Scatter [17] is a recent DHT-based key-value
  1110. store that layers transactions on top of consistent replication.
  1111. Spanner focuses on providing a higher-level interface
  1112. than Scatter does. Gray and Lamport [18] describe
  1113. a non-blocking commit protocol based on Paxos.
  1114. Their protocol incurs more messaging costs than twophase
  1115. commit, which would aggravate the cost of commit
  1116. over widely distributed groups. Walter [36] provides
  1117. a variant of snapshot isolation that works within, but not
  1118. across datacenters. In contrast, our read-only transactions
  1119. provide a more natural semantics, because we support
  1120. external consistency over all operations.
  1121. There has been a spate of recent work on reducing
  1122. or eliminating locking overheads. Calvin [40] eliminates
  1123. concurrency control: it pre-assigns timestamps and
  1124. then executes the transactions in timestamp order. HStore
  1125. [39] and Granola [11] each supported their own
  1126. classification of transaction types, some of which could
  1127. avoid locking. None of these systems provides external
  1128. consistency. Spanner addresses the contention issue by
  1129. providing support for snapshot isolation.
  1130. VoltDB [42] is a sharded in-memory database that
  1131. supports master-slave replication over the wide area for
  1132. disaster recovery, but not more general replication configurations.
  1133. It is an example of what has been called
  1134. NewSQL, which is a marketplace push to support scalable
  1135. SQL [38]. A number of commercial databases implement
  1136. reads in the past, such as MarkLogic [26] and
  1137. Oracle’s Total Recall [30]. Lomet and Li [24] describe an
  1138. implementation strategy for such a temporal database.
  1139. Farsite derived bounds on clock uncertainty (much
  1140. looser than TrueTime’s) relative to a trusted clock reference
  1141. [13]: server leases in Farsite were maintained in the
  1142. same way that Spanner maintains Paxos leases. Loosely
  1143. synchronized clocks have been used for concurrencycontrol
  1144. purposes in prior work [2, 23]. We have shown
  1145. that TrueTime lets one reason about global time across
  1146. sets of Paxos state machines.
  1147. 7 Future Work
  1148. We have spent most of the last year working with the
  1149. F1 team to transition Google’s advertising backend from
  1150. MySQL to Spanner. We are actively improving its monitoring
  1151. and support tools, as well as tuning its performance.
  1152. In addition, we have been working on improving
  1153. the functionality and performance of our backup/restore
  1154. system. We are currently implementing the Spanner
  1155. schema language, automatic maintenance of secondary
  1156. indices, and automatic load-based resharding. Longer
  1157. term, there are a couple of features that we plan to in-
  1158. Published in the Proceedings of OSDI 2012 12
  1159. vestigate. Optimistically doing reads in parallel may be
  1160. a valuable strategy to pursue, but initial experiments have
  1161. indicated that the right implementation is non-trivial. In
  1162. addition, we plan to eventually support direct changes of
  1163. Paxos configurations [22, 34].
  1164. Given that we expect many applications to replicate
  1165. their data across datacenters that are relatively close to
  1166. each other, TrueTime  may noticeably affect performance.
  1167. We see no insurmountable obstacle to reducing
  1168.  below 1ms. Time-master-query intervals can be
  1169. reduced, and better clock crystals are relatively cheap.
  1170. Time-master query latency could be reduced with improved
  1171. networking technology, or possibly even avoided
  1172. through alternate time-distribution technology.
  1173. Finally, there are obvious areas for improvement. Although
  1174. Spanner is scalable in the number of nodes, the
  1175. node-local data structures have relatively poor performance
  1176. on complex SQL queries, because they were designed
  1177. for simple key-value accesses. Algorithms and
  1178. data structures from DB literature could improve singlenode
  1179. performance a great deal. Second, moving data automatically
  1180. between datacenters in response to changes
  1181. in client load has long been a goal of ours, but to make
  1182. that goal effective, we would also need the ability to
  1183. move client-application processes between datacenters in
  1184. an automated, coordinated fashion. Moving processes
  1185. raises the even more difficult problem of managing resource
  1186. acquisition and allocation between datacenters.
  1187. 8 Conclusions
  1188. To summarize, Spanner combines and extends on ideas
  1189. from two research communities: from the database community,
  1190. a familiar, easy-to-use, semi-relational interface,
  1191. transactions, and an SQL-based query language; from
  1192. the systems community, scalability, automatic sharding,
  1193. fault tolerance, consistent replication, external consistency,
  1194. and wide-area distribution. Since Spanner’s inception,
  1195. we have taken more than 5 years to iterate to the
  1196. current design and implementation. Part of this long iteration
  1197. phase was due to a slow realization that Spanner
  1198. should do more than tackle the problem of a globallyreplicated
  1199. namespace, and should also focus on database
  1200. features that Bigtable was missing.
  1201. One aspect of our design stands out: the linchpin of
  1202. Spanner’s feature set is TrueTime. We have shown that
  1203. reifying clock uncertainty in the time API makes it possible
  1204. to build distributed systems with much stronger time
  1205. semantics. In addition, as the underlying system enforces
  1206. tighter bounds on clock uncertainty, the overhead
  1207. of the stronger semantics decreases. As a community, we
  1208. should no longer depend on loosely synchronized clocks
  1209. and weak time APIs in designing distributed algorithms.
  1210. Acknowledgements
  1211. Many people have helped to improve this paper: our
  1212. shepherd Jon Howell, who went above and beyond
  1213. his responsibilities; the anonymous referees; and many
  1214. Googlers: Atul Adya, Fay Chang, Frank Dabek, Sean
  1215. Dorward, Bob Gruber, David Held, Nick Kline, Alex
  1216. Thomson, and Joel Wein. Our management has been
  1217. very supportive of both our work and of publishing this
  1218. paper: Aristotle Balogh, Bill Coughran, Urs H¨olzle,
  1219. Doron Meyer, Cos Nicolaou, Kathy Polizzi, Sridhar Ramaswany,
  1220. and Shivakumar Venkataraman.
  1221. We have built upon the work of the Bigtable and
  1222. Megastore teams. The F1 team, and Jeff Shute in particular,
  1223. worked closely with us in developing our data model
  1224. and helped immensely in tracking down performance and
  1225. correctness bugs. The Platforms team, and Luiz Barroso
  1226. and Bob Felderman in particular, helped to make True-
  1227. Time happen. Finally, a lot of Googlers used to be on our
  1228. team: Ken Ashcraft, Paul Cychosz, Krzysztof Ostrowski,
  1229. Amir Voskoboynik, Matthew Weaver, Theo Vassilakis,
  1230. and Eric Veach; or have joined our team recently: Nathan
  1231. Bales, Adam Beberg, Vadim Borisov, Ken Chen, Brian
  1232. Cooper, Cian Cullinan, Robert-Jan Huijsman, Milind
  1233. Joshi, Andrey Khorlin, Dawid Kuroczko, Laramie Leavitt,
  1234. Eric Li, Mike Mammarella, Sunil Mushran, Simon
  1235. Nielsen, Ovidiu Platon, Ananth Shrinivas, Vadim Suvorov,
  1236. and Marcel van der Holst.
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement