1
00:00:06,320 --> 00:00:11,499
[Music]

2
00:00:16,080 --> 00:00:20,640
third time's a charm um so i've caused

3
00:00:18,640 --> 00:00:22,640
worse outages than myself at work so

4
00:00:20,640 --> 00:00:24,640
this is fine um

5
00:00:22,640 --> 00:00:27,039
and now for our final keynote speaker of

6
00:00:24,640 --> 00:00:29,439
the conference liz fong jones

7
00:00:27,039 --> 00:00:31,199
as a bit of an aside as follows of liz

8
00:00:29,439 --> 00:00:32,800
on social media we'll know she's had

9
00:00:31,199 --> 00:00:34,559
something of an odyssey over the last

10
00:00:32,800 --> 00:00:36,719
couple days having to fly from sydney to

11
00:00:34,559 --> 00:00:38,800
the west coast of usa to get some urgent

12
00:00:36,719 --> 00:00:40,239
document documents notarized then

13
00:00:38,800 --> 00:00:42,079
getting back on a plane to sydney

14
00:00:40,239 --> 00:00:45,200
returning yesterday morning and it all

15
00:00:42,079 --> 00:00:47,280
seems to be sorted now about liz liz is

16
00:00:45,200 --> 00:00:49,840
a developer advocate labor and ethics

17
00:00:47,280 --> 00:00:51,760
organizer and site reliability engineer

18
00:00:49,840 --> 00:00:52,879
with 16 plus years of

19
00:00:51,760 --> 00:00:55,840
experience

20
00:00:52,879 --> 00:00:56,800
she's an advocate at honeycomb for sre

21
00:00:55,840 --> 00:00:58,879
and the

22
00:00:56,800 --> 00:01:00,960
observability communities

23
00:00:58,879 --> 00:01:03,120
and previously was an sre working on

24
00:01:00,960 --> 00:01:05,600
projects ranging from the google cloud

25
00:01:03,120 --> 00:01:07,840
load balancer to google flights

26
00:01:05,600 --> 00:01:10,000
liz will be talking about cultivating

27
00:01:07,840 --> 00:01:13,560
production excellence

28
00:01:10,000 --> 00:01:13,560
over to you liz

29
00:01:15,920 --> 00:01:20,880
good day everyone and thank you for

30
00:01:18,320 --> 00:01:22,960
joining me um i wanted to acknowledge

31
00:01:20,880 --> 00:01:25,280
that i am living on the lands of the

32
00:01:22,960 --> 00:01:27,119
gradual people and i also wanted to

33
00:01:25,280 --> 00:01:28,960
acknowledge that there are

34
00:01:27,119 --> 00:01:31,840
things happening in the world right now

35
00:01:28,960 --> 00:01:33,680
and that um tonga is under threat from

36
00:01:31,840 --> 00:01:36,240
climate change and from the volcano that

37
00:01:33,680 --> 00:01:38,640
has just erupted

38
00:01:36,240 --> 00:01:42,240
okay let us go ahead and begin

39
00:01:38,640 --> 00:01:44,399
so today i'm really excited to be

40
00:01:42,240 --> 00:01:47,119
telling you about all the lessons i've

41
00:01:44,399 --> 00:01:49,280
learned from the past

42
00:01:47,119 --> 00:01:51,840
from the past 15 years of my experience

43
00:01:49,280 --> 00:01:54,560
working in the field of site reliability

44
00:01:51,840 --> 00:01:56,560
engineering at companies large and small

45
00:01:54,560 --> 00:01:58,960
and how we can take those lessons and

46
00:01:56,560 --> 00:02:00,960
make our workplaces more humane for us

47
00:01:58,960 --> 00:02:03,680
and to make our systems more operable

48
00:02:00,960 --> 00:02:03,680
and reliable

49
00:02:03,840 --> 00:02:07,280
so

50
00:02:04,640 --> 00:02:09,119
those of us who are attending lca are

51
00:02:07,280 --> 00:02:10,720
very familiar with the idea of operating

52
00:02:09,119 --> 00:02:12,720
in production

53
00:02:10,720 --> 00:02:15,520
but some of our colleagues may not

54
00:02:12,720 --> 00:02:17,680
necessarily have that same notion so to

55
00:02:15,520 --> 00:02:20,239
get everyone on the same page we have to

56
00:02:17,680 --> 00:02:22,080
orient around what the objectives are of

57
00:02:20,239 --> 00:02:24,319
the business what the objectives are of

58
00:02:22,080 --> 00:02:26,640
the code that we are writing

59
00:02:24,319 --> 00:02:28,959
and fundamentally we're writing code in

60
00:02:26,640 --> 00:02:30,879
order to solve problems but

61
00:02:28,959 --> 00:02:32,879
the problem is that just landing the

62
00:02:30,879 --> 00:02:35,519
code in the main branch of get does not

63
00:02:32,879 --> 00:02:37,680
necessarily mean that your job is done

64
00:02:35,519 --> 00:02:39,360
that there are so many additional steps

65
00:02:37,680 --> 00:02:41,280
that you have to complete in order to

66
00:02:39,360 --> 00:02:44,319
make sure that your services are

67
00:02:41,280 --> 00:02:46,319
delivering value to end users

68
00:02:44,319 --> 00:02:48,239
because if your change has to be

69
00:02:46,319 --> 00:02:50,720
repeatedly rolled back or it takes

70
00:02:48,239 --> 00:02:54,640
months to push out you're not actually

71
00:02:50,720 --> 00:02:54,640
delivering that value until months later

72
00:02:54,800 --> 00:02:57,760
another challenge that we frequently

73
00:02:56,239 --> 00:02:59,599
encounter is that production is

74
00:02:57,760 --> 00:03:02,319
increasingly complex

75
00:02:59,599 --> 00:03:05,200
that we have all of these challenges of

76
00:03:02,319 --> 00:03:07,120
scale and and and uh

77
00:03:05,200 --> 00:03:09,599
and and reliability needs and future

78
00:03:07,120 --> 00:03:12,000
needs and we've developed these layers

79
00:03:09,599 --> 00:03:13,360
of infrastructure to try to

80
00:03:12,000 --> 00:03:15,200
make it easier

81
00:03:13,360 --> 00:03:16,720
but we've also made it hard to fit that

82
00:03:15,200 --> 00:03:18,480
all into the head of one individual

83
00:03:16,720 --> 00:03:20,879
person

84
00:03:18,480 --> 00:03:23,120
that with microservices the theory was

85
00:03:20,879 --> 00:03:25,040
that we could decouple our services from

86
00:03:23,120 --> 00:03:27,440
each other and decouple the release and

87
00:03:25,040 --> 00:03:29,120
launch cycles so that each team could

88
00:03:27,440 --> 00:03:31,200
ship independently

89
00:03:29,120 --> 00:03:33,040
but that now means that a service three

90
00:03:31,200 --> 00:03:34,959
layers away from you can break you in a

91
00:03:33,040 --> 00:03:37,040
way that you didn't anticipate in a way

92
00:03:34,959 --> 00:03:39,680
that was not a failure mode of a more

93
00:03:37,040 --> 00:03:41,599
tightly coupled system

94
00:03:39,680 --> 00:03:44,159
we also have problems with big data

95
00:03:41,599 --> 00:03:47,040
where big data can result in increasing

96
00:03:44,159 --> 00:03:49,040
demands upon compute and networking and

97
00:03:47,040 --> 00:03:52,080
we have to be able to scale our systems

98
00:03:49,040 --> 00:03:53,840
to meet and address that demand

99
00:03:52,080 --> 00:03:56,480
so some of the systems that we're

100
00:03:53,840 --> 00:03:58,879
creating add complexity in order to try

101
00:03:56,480 --> 00:04:01,920
to solve some of these challenges

102
00:03:58,879 --> 00:04:04,319
but when the systems don't work exactly

103
00:04:01,920 --> 00:04:06,480
as planned for instance as the audio

104
00:04:04,319 --> 00:04:08,799
feed this morning did not necessarily

105
00:04:06,480 --> 00:04:10,720
work right it takes a substantial amount

106
00:04:08,799 --> 00:04:13,680
of effort to try to figure out what's

107
00:04:10,720 --> 00:04:15,200
going on and how do we fix it

108
00:04:13,680 --> 00:04:18,320
so that's what the subject of today's

109
00:04:15,200 --> 00:04:21,519
talk is how do we fix our problems with

110
00:04:18,320 --> 00:04:23,360
more confidence more quickly

111
00:04:21,519 --> 00:04:25,120
but let's start with defining what the

112
00:04:23,360 --> 00:04:26,840
problem is

113
00:04:25,120 --> 00:04:30,000
what does uptime

114
00:04:26,840 --> 00:04:33,360
mean when i was a wii game developer

115
00:04:30,000 --> 00:04:35,680
when i was about 16 or 17 years old

116
00:04:33,360 --> 00:04:38,320
i worked at a small game studio and we

117
00:04:35,680 --> 00:04:40,560
had the database and we had the game

118
00:04:38,320 --> 00:04:42,639
world server and if those were up

119
00:04:40,560 --> 00:04:45,280
everything was fine but if they were not

120
00:04:42,639 --> 00:04:46,960
up everything was 100 down

121
00:04:45,280 --> 00:04:49,840
but that's no longer the world we live

122
00:04:46,960 --> 00:04:51,840
in today where you may have hundreds or

123
00:04:49,840 --> 00:04:53,919
thousands of linux servers that are

124
00:04:51,840 --> 00:04:57,919
running as vms on amazon's

125
00:04:53,919 --> 00:05:00,080
infrastructure or in azure or on gcp

126
00:04:57,919 --> 00:05:01,680
and we can't wait until our customers

127
00:05:00,080 --> 00:05:03,120
complain to us and ring us up on the

128
00:05:01,680 --> 00:05:05,600
phone and say i'd like to cancel my

129
00:05:03,120 --> 00:05:07,360
subscription right like we would like to

130
00:05:05,600 --> 00:05:09,120
be a little bit more proactive at

131
00:05:07,360 --> 00:05:11,520
detecting when our customers are having

132
00:05:09,120 --> 00:05:13,440
problems without necessarily getting

133
00:05:11,520 --> 00:05:15,759
paged every single time a single server

134
00:05:13,440 --> 00:05:17,520
flaps

135
00:05:15,759 --> 00:05:18,960
there are all of these demands for

136
00:05:17,520 --> 00:05:21,199
reliability

137
00:05:18,960 --> 00:05:22,400
for features and for

138
00:05:21,199 --> 00:05:23,759
and for investment and future

139
00:05:22,400 --> 00:05:26,320
scalability

140
00:05:23,759 --> 00:05:28,880
it's a lot to deal with

141
00:05:26,320 --> 00:05:31,440
and honestly as someone who has been in

142
00:05:28,880 --> 00:05:34,160
the role of a hero in an organization

143
00:05:31,440 --> 00:05:36,639
who's the one person holding up the

144
00:05:34,160 --> 00:05:39,360
reliability on her shoulders

145
00:05:36,639 --> 00:05:41,360
it doesn't work not for not for you know

146
00:05:39,360 --> 00:05:43,039
a decade two decades you get really

147
00:05:41,360 --> 00:05:45,440
tired

148
00:05:43,039 --> 00:05:48,240
so what strategies can we employ in

149
00:05:45,440 --> 00:05:51,199
order to make our services more reliable

150
00:05:48,240 --> 00:05:54,240
and friendlier for the people

151
00:05:51,199 --> 00:05:56,160
so as miles introduced me yes hi my name

152
00:05:54,240 --> 00:05:58,400
is liz and i'm a principal developer

153
00:05:56,160 --> 00:06:00,800
advocate at honeycomb

154
00:05:58,400 --> 00:06:02,720
and honeycomb is a company that aims to

155
00:06:00,800 --> 00:06:04,639
help make developers

156
00:06:02,720 --> 00:06:08,560
happier and more productive through

157
00:06:04,639 --> 00:06:10,400
improving observability into production

158
00:06:08,560 --> 00:06:11,680
so here are some of the lessons i've

159
00:06:10,400 --> 00:06:14,000
learned from both working with

160
00:06:11,680 --> 00:06:16,000
honeycombs clients as well as in my

161
00:06:14,000 --> 00:06:18,319
previous life at google on the customer

162
00:06:16,000 --> 00:06:21,520
reliability engineering team as well as

163
00:06:18,319 --> 00:06:23,039
on various other esri teams at google

164
00:06:21,520 --> 00:06:24,880
the lesson that i learned is that

165
00:06:23,039 --> 00:06:26,639
heroism doesn't work and that we need

166
00:06:24,880 --> 00:06:28,240
different strategies

167
00:06:26,639 --> 00:06:30,720
and that those strategies cannot

168
00:06:28,240 --> 00:06:33,840
necessarily just be buying tools and

169
00:06:30,720 --> 00:06:35,600
fixing to fix things

170
00:06:33,840 --> 00:06:37,199
and i know that this is ironic because i

171
00:06:35,600 --> 00:06:39,520
do work at a company that sells

172
00:06:37,199 --> 00:06:43,199
developer tooling and here i am telling

173
00:06:39,520 --> 00:06:46,240
you don't buy devops right why is that

174
00:06:43,199 --> 00:06:47,919
the answer is that when you have all of

175
00:06:46,240 --> 00:06:49,680
these technologies that you're trying to

176
00:06:47,919 --> 00:06:51,599
integrate together that you're trying to

177
00:06:49,680 --> 00:06:53,039
make work together sometimes you're

178
00:06:51,599 --> 00:06:54,880
adding to your workload rather than

179
00:06:53,039 --> 00:06:57,120
reducing your workload

180
00:06:54,880 --> 00:06:58,400
when you actually try to glom on all

181
00:06:57,120 --> 00:07:01,199
these things that are supposed to make

182
00:06:58,400 --> 00:07:02,880
your developers lives easier

183
00:07:01,199 --> 00:07:04,880
for instance

184
00:07:02,880 --> 00:07:06,319
you get told to write more tests but

185
00:07:04,880 --> 00:07:07,840
what happens if you add more tests and

186
00:07:06,319 --> 00:07:09,919
add more tests and add more tests and

187
00:07:07,840 --> 00:07:12,080
they start flaking all the time and then

188
00:07:09,919 --> 00:07:13,599
people ignore the red build because oh

189
00:07:12,080 --> 00:07:16,160
that's just flaky tests i'm just going

190
00:07:13,599 --> 00:07:17,599
to merge anyways right congratulations

191
00:07:16,160 --> 00:07:19,680
you've just made your problem harder not

192
00:07:17,599 --> 00:07:22,080
easier

193
00:07:19,680 --> 00:07:24,080
or let's talk about the drive recently

194
00:07:22,080 --> 00:07:26,400
to kind of encourage people to push to

195
00:07:24,080 --> 00:07:27,919
production quickly recklessly almost

196
00:07:26,400 --> 00:07:30,000
right what happens if you're continuous

197
00:07:27,919 --> 00:07:32,479
integration and continuous delivery

198
00:07:30,000 --> 00:07:34,960
that's meant to ship [ __ ] fast

199
00:07:32,479 --> 00:07:37,120
ships [ __ ] fast

200
00:07:34,960 --> 00:07:37,919
oops

201
00:07:37,120 --> 00:07:39,520
or

202
00:07:37,919 --> 00:07:41,039
those of us who have been around the

203
00:07:39,520 --> 00:07:44,000
community for a long time you've

204
00:07:41,039 --> 00:07:45,919
probably run rm-rf rf

205
00:07:44,000 --> 00:07:47,360
a couple of times before

206
00:07:45,919 --> 00:07:49,360
right now imagine doing that to your

207
00:07:47,360 --> 00:07:50,960
entire production infrastructure well

208
00:07:49,360 --> 00:07:52,319
congratulations you can do that now with

209
00:07:50,960 --> 00:07:53,759
infrastructure as code you can delete

210
00:07:52,319 --> 00:07:56,800
your entire production infrastructure

211
00:07:53,759 --> 00:07:56,800
with one stray commit

212
00:07:56,840 --> 00:08:03,759
whoops or let's talk about kubernetes

213
00:08:00,960 --> 00:08:06,000
not everyone needs kubernetes it's a lot

214
00:08:03,759 --> 00:08:08,560
of complexity and it's only worth it to

215
00:08:06,000 --> 00:08:10,240
solve certain problems

216
00:08:08,560 --> 00:08:11,599
but the number one thing that i see

217
00:08:10,240 --> 00:08:14,479
people getting wrong

218
00:08:11,599 --> 00:08:16,319
is adopting the idea of production

219
00:08:14,479 --> 00:08:18,319
ownership or let's put everyone into

220
00:08:16,319 --> 00:08:20,240
pagerduty

221
00:08:18,319 --> 00:08:22,000
and i think that that can have some

222
00:08:20,240 --> 00:08:24,000
negative side effects let's talk about

223
00:08:22,000 --> 00:08:26,080
them

224
00:08:24,000 --> 00:08:27,759
when you put everyone on call for

225
00:08:26,080 --> 00:08:29,520
systems that they are not prepared to

226
00:08:27,759 --> 00:08:30,960
run right those of us who have scar

227
00:08:29,520 --> 00:08:33,360
tissue from years and years of being

228
00:08:30,960 --> 00:08:36,240
around production we can handle to an

229
00:08:33,360 --> 00:08:37,680
extent um being paid at 3am

230
00:08:36,240 --> 00:08:39,760
but let's suppose you're a brand new

231
00:08:37,680 --> 00:08:44,159
engineer and your first experience with

232
00:08:39,760 --> 00:08:47,200
on-call is being paged at 1 am 2 am 3 am

233
00:08:44,159 --> 00:08:48,000
multiple times per week

234
00:08:47,200 --> 00:08:50,399
and

235
00:08:48,000 --> 00:08:52,399
you eventually are going to say take me

236
00:08:50,399 --> 00:08:53,440
off this on call rotation or i quit

237
00:08:52,399 --> 00:08:55,760
right

238
00:08:53,440 --> 00:08:57,839
it's not a happy situation if you have a

239
00:08:55,760 --> 00:08:58,959
system that is continuously generating

240
00:08:57,839 --> 00:09:00,959
noise

241
00:08:58,959 --> 00:09:03,279
and even when you do try to debug it you

242
00:09:00,959 --> 00:09:05,600
don't know heads from tails where do i

243
00:09:03,279 --> 00:09:07,839
get started

244
00:09:05,600 --> 00:09:09,200
and if you have dashboards those

245
00:09:07,839 --> 00:09:11,600
dashboards are often a source of

246
00:09:09,200 --> 00:09:13,839
technical that in and of themselves

247
00:09:11,600 --> 00:09:15,839
because they're chasing your last outage

248
00:09:13,839 --> 00:09:17,920
there are you know 20 different graphs

249
00:09:15,839 --> 00:09:19,120
on 20 different pages and you're trying

250
00:09:17,920 --> 00:09:21,519
to figure out which line wiggle the same

251
00:09:19,120 --> 00:09:23,200
line as a southern line right like

252
00:09:21,519 --> 00:09:25,040
this isn't actually helping you solve

253
00:09:23,200 --> 00:09:26,880
the problem all that's happening is

254
00:09:25,040 --> 00:09:28,880
you're spending time in a state of

255
00:09:26,880 --> 00:09:30,720
cognitive overload while your customers

256
00:09:28,880 --> 00:09:33,519
are suffering and incidents are taking

257
00:09:30,720 --> 00:09:35,440
forever to fix

258
00:09:33,519 --> 00:09:38,160
so finally you pick up the phone and you

259
00:09:35,440 --> 00:09:40,720
call someone like me or like miles and

260
00:09:38,160 --> 00:09:42,399
and we you know tell you oh you just

261
00:09:40,720 --> 00:09:43,920
frob the thing restart it it'll be fine

262
00:09:42,399 --> 00:09:45,360
in the morning right

263
00:09:43,920 --> 00:09:47,200
except for someone like mirror miles

264
00:09:45,360 --> 00:09:49,200
like we've gotten really tired of

265
00:09:47,200 --> 00:09:50,959
getting woken up every single week at

266
00:09:49,200 --> 00:09:54,080
least once a week even if we're not on

267
00:09:50,959 --> 00:09:54,080
call for years on end

268
00:09:54,160 --> 00:09:57,040
and finally

269
00:09:55,360 --> 00:09:59,120
you go back to sleep you wake up in the

270
00:09:57,040 --> 00:10:01,200
morning cup of coffee in the hand and

271
00:09:59,120 --> 00:10:03,200
you realize that you can't actually push

272
00:10:01,200 --> 00:10:04,399
fix because someone has broken the build

273
00:10:03,200 --> 00:10:06,480
overnight

274
00:10:04,399 --> 00:10:08,160
and even though each individual set of

275
00:10:06,480 --> 00:10:09,760
unit tests passes

276
00:10:08,160 --> 00:10:12,160
the integration tests don't because

277
00:10:09,760 --> 00:10:13,680
there is a problem in the gaps between

278
00:10:12,160 --> 00:10:16,000
our services that is causing things to

279
00:10:13,680 --> 00:10:17,760
flake

280
00:10:16,000 --> 00:10:19,920
so this is what we described in the

281
00:10:17,760 --> 00:10:22,320
field of site reliability engineering as

282
00:10:19,920 --> 00:10:24,320
a state of operational overload

283
00:10:22,320 --> 00:10:26,560
where there's no time to do projects and

284
00:10:24,320 --> 00:10:28,560
even if you did have time to do projects

285
00:10:26,560 --> 00:10:30,640
there's really not much of a plan of how

286
00:10:28,560 --> 00:10:32,959
do i get myself out of this situation

287
00:10:30,640 --> 00:10:35,040
how do i spend one or two spare hours in

288
00:10:32,959 --> 00:10:38,399
order to chip away at this pile of

289
00:10:35,040 --> 00:10:40,320
technical and operational debt

290
00:10:38,399 --> 00:10:42,480
so this can feel very very draining this

291
00:10:40,320 --> 00:10:43,600
can feel stressful right like many of us

292
00:10:42,480 --> 00:10:45,040
have been here

293
00:10:43,600 --> 00:10:46,720
and it can feel like you're barely

294
00:10:45,040 --> 00:10:49,279
hanging on to the edge of the cliff with

295
00:10:46,720 --> 00:10:51,519
your fingernails

296
00:10:49,279 --> 00:10:52,800
so how do we make this better what are

297
00:10:51,519 --> 00:10:55,839
we missing

298
00:10:52,800 --> 00:10:57,680
and why is it that tools don't help

299
00:10:55,839 --> 00:11:00,800
well i think the thing that we need to

300
00:10:57,680 --> 00:11:02,480
focus on is that people are the ones who

301
00:11:00,800 --> 00:11:05,040
operate your systems that you cannot

302
00:11:02,480 --> 00:11:07,600
have a healthy system without healthy

303
00:11:05,040 --> 00:11:10,640
people standing behind it

304
00:11:07,600 --> 00:11:12,240
for instance let's take me as an example

305
00:11:10,640 --> 00:11:15,120
every single time i've probably given

306
00:11:12,240 --> 00:11:17,040
this talk several dozen times now

307
00:11:15,120 --> 00:11:19,279
every time i go through this section my

308
00:11:17,040 --> 00:11:22,079
heart rate goes up right like i have all

309
00:11:19,279 --> 00:11:25,600
that packed up anxiety about production

310
00:11:22,079 --> 00:11:28,560
outages of getting paid at 3 am

311
00:11:25,600 --> 00:11:30,480
and it gets really challenging right but

312
00:11:28,560 --> 00:11:32,079
i need to take a moment and focus on me

313
00:11:30,480 --> 00:11:34,240
i encourage you to focus on yourself

314
00:11:32,079 --> 00:11:35,839
right like to take care of your needs in

315
00:11:34,240 --> 00:11:37,200
my case i'm going to take a deep breath

316
00:11:35,839 --> 00:11:38,160
and i'm going to take take a drink of

317
00:11:37,200 --> 00:11:40,079
water

318
00:11:38,160 --> 00:11:42,079
and that will help me deliver a better

319
00:11:40,079 --> 00:11:44,480
talk in the same way you should make

320
00:11:42,079 --> 00:11:46,240
sure to take a deep breath everything's

321
00:11:44,480 --> 00:11:47,600
going to be fine and drink water even

322
00:11:46,240 --> 00:11:51,800
when you're in the middle of an outage

323
00:11:47,600 --> 00:11:51,800
focus on the people first

324
00:11:58,399 --> 00:12:01,680
so much better right

325
00:12:00,079 --> 00:12:03,360
so you cannot run a healthy system

326
00:12:01,680 --> 00:12:06,000
without healthy people

327
00:12:03,360 --> 00:12:07,920
and that means that tools cannot fix a

328
00:12:06,000 --> 00:12:09,600
culture if your people are not healthy

329
00:12:07,920 --> 00:12:11,600
right

330
00:12:09,600 --> 00:12:13,519
that your tools can help automate

331
00:12:11,600 --> 00:12:15,519
processes that you already are inclined

332
00:12:13,519 --> 00:12:17,360
to do um but it can't fix a culture

333
00:12:15,519 --> 00:12:18,959
where people are blamed where people are

334
00:12:17,360 --> 00:12:20,079
on call all the time and stressed out

335
00:12:18,959 --> 00:12:21,680
right like

336
00:12:20,079 --> 00:12:24,240
tooling that generates more alerts is

337
00:12:21,680 --> 00:12:26,160
not going to be helpful there

338
00:12:24,240 --> 00:12:28,079
so what should we do instead

339
00:12:26,160 --> 00:12:29,680
i argue that we should invest in people

340
00:12:28,079 --> 00:12:32,000
culture and process

341
00:12:29,680 --> 00:12:33,920
that investing in those three things is

342
00:12:32,000 --> 00:12:37,600
how we get out of this mess of

343
00:12:33,920 --> 00:12:38,959
operational overload and production pain

344
00:12:37,600 --> 00:12:41,120
this is what i call production

345
00:12:38,959 --> 00:12:43,760
excellence the idea that our system

346
00:12:41,120 --> 00:12:46,160
should be more reliable and friendlier

347
00:12:43,760 --> 00:12:49,839
that we shouldn't have to feed computers

348
00:12:46,160 --> 00:12:51,600
with the blood and sweat of human beings

349
00:12:49,839 --> 00:12:54,079
that we have to plan intentionally and

350
00:12:51,600 --> 00:12:56,880
figure out how are we getting there

351
00:12:54,079 --> 00:12:58,560
what milestones are we aiming for

352
00:12:56,880 --> 00:13:00,639
and what key performance indicators are

353
00:12:58,560 --> 00:13:02,079
we measuring

354
00:13:00,639 --> 00:13:04,079
we also have to think about getting the

355
00:13:02,079 --> 00:13:05,360
rights of people into the room

356
00:13:04,079 --> 00:13:06,959
and this means

357
00:13:05,360 --> 00:13:08,880
a lot of the people who should be in

358
00:13:06,959 --> 00:13:11,600
this room are not necessarily in this

359
00:13:08,880 --> 00:13:13,760
room uh even at lca

360
00:13:11,600 --> 00:13:15,839
that we need to involve everyone from

361
00:13:13,760 --> 00:13:17,680
the business level to the support teams

362
00:13:15,839 --> 00:13:19,200
right we need to involve people who are

363
00:13:17,680 --> 00:13:21,040
even in sales right like we need to

364
00:13:19,200 --> 00:13:22,880
involve people all across the spectrum

365
00:13:21,040 --> 00:13:24,880
in order to make sure that we're aligned

366
00:13:22,880 --> 00:13:27,839
about delivering value to customers in a

367
00:13:24,880 --> 00:13:27,839
sustainable fashion

368
00:13:28,000 --> 00:13:30,560
so that means that we need a

369
00:13:28,959 --> 00:13:32,880
psychologically safe environment that

370
00:13:30,560 --> 00:13:34,320
people need to be able to contribute and

371
00:13:32,880 --> 00:13:36,000
that if you cannot feel safe to

372
00:13:34,320 --> 00:13:38,639
contribute none of the rest of this talk

373
00:13:36,000 --> 00:13:40,240
is going to make sense

374
00:13:38,639 --> 00:13:42,160
but let's talk about the four key

375
00:13:40,240 --> 00:13:44,639
elements of production excellence that

376
00:13:42,160 --> 00:13:46,720
will make your life easier

377
00:13:44,639 --> 00:13:48,399
first of all in order to operate an

378
00:13:46,720 --> 00:13:50,480
excellent system you need to know when

379
00:13:48,399 --> 00:13:52,480
that system is too broken

380
00:13:50,480 --> 00:13:55,360
when it's outside of the bounds of the

381
00:13:52,480 --> 00:13:57,360
normal expected system behavior

382
00:13:55,360 --> 00:13:59,199
secondly you need to be able to debug

383
00:13:57,360 --> 00:14:01,120
those failures in order to restore them

384
00:13:59,199 --> 00:14:02,720
to working order

385
00:14:01,120 --> 00:14:04,560
third you have to be able to collaborate

386
00:14:02,720 --> 00:14:06,240
across multiple teams

387
00:14:04,560 --> 00:14:08,560
in order to solve these problems across

388
00:14:06,240 --> 00:14:10,399
multiple microservices

389
00:14:08,560 --> 00:14:12,560
and then fourth and finally you need to

390
00:14:10,399 --> 00:14:14,480
close that feedback loop we don't live

391
00:14:12,560 --> 00:14:16,560
in the world of groundhog day it's not

392
00:14:14,480 --> 00:14:18,880
okay to repeat the same outages over and

393
00:14:16,560 --> 00:14:20,720
over so we need that feedback loop to

394
00:14:18,880 --> 00:14:22,880
solve the longer running issues that we

395
00:14:20,720 --> 00:14:24,639
face as engineers

396
00:14:22,880 --> 00:14:26,560
so if you do those four things you'll

397
00:14:24,639 --> 00:14:28,000
have a system that is much friendlier to

398
00:14:26,560 --> 00:14:30,320
your human beings who are operating the

399
00:14:28,000 --> 00:14:32,399
system

400
00:14:30,320 --> 00:14:34,320
so why did i say

401
00:14:32,399 --> 00:14:35,839
no on our systems are failing too much

402
00:14:34,320 --> 00:14:37,920
right why did they not say no when our

403
00:14:35,839 --> 00:14:39,519
systems are failing at all

404
00:14:37,920 --> 00:14:41,920
the answer is that

405
00:14:39,519 --> 00:14:44,480
if you got alerted every single time a

406
00:14:41,920 --> 00:14:46,160
packet dropped in the uh in the fiber

407
00:14:44,480 --> 00:14:48,399
optic channel between here in la right

408
00:14:46,160 --> 00:14:50,160
like you'd get paged all the time right

409
00:14:48,399 --> 00:14:53,040
our systems are always failing in small

410
00:14:50,160 --> 00:14:55,199
microscopic ways but we build redundancy

411
00:14:53,040 --> 00:14:57,120
into our systems in order to try to

412
00:14:55,199 --> 00:14:58,639
solve those problems

413
00:14:57,120 --> 00:15:00,480
so instead of thinking about are the

414
00:14:58,639 --> 00:15:02,560
systems failing at all right like is the

415
00:15:00,480 --> 00:15:04,399
lawn outside on the domain outside where

416
00:15:02,560 --> 00:15:06,800
i am right like is every single blade of

417
00:15:04,399 --> 00:15:09,279
grass on that lawn green no

418
00:15:06,800 --> 00:15:11,360
but overall does it look green enough

419
00:15:09,279 --> 00:15:13,279
yes right like it's soft it's blush i

420
00:15:11,360 --> 00:15:14,800
can go lie down and and have a picnic

421
00:15:13,279 --> 00:15:16,399
right like that's that i think is good

422
00:15:14,800 --> 00:15:17,680
enough

423
00:15:16,399 --> 00:15:19,600
so instead of saying you know is the

424
00:15:17,680 --> 00:15:22,880
system failing at all we need to talk

425
00:15:19,600 --> 00:15:24,560
about is the system failing too much

426
00:15:22,880 --> 00:15:26,160
and that means we need a quantitative

427
00:15:24,560 --> 00:15:28,480
measure of that that we can use to

428
00:15:26,160 --> 00:15:30,720
operate the system

429
00:15:28,480 --> 00:15:32,639
this is the idea from site reliability

430
00:15:30,720 --> 00:15:34,399
engineering at google the idea of the

431
00:15:32,639 --> 00:15:37,759
service level indicator and its

432
00:15:34,399 --> 00:15:40,000
companion the service level objective

433
00:15:37,759 --> 00:15:41,279
a service level indicator and service

434
00:15:40,000 --> 00:15:43,279
level objective

435
00:15:41,279 --> 00:15:45,920
what they represent is a common language

436
00:15:43,279 --> 00:15:47,600
between us as engineers our business

437
00:15:45,920 --> 00:15:49,759
stakeholders and our customers about

438
00:15:47,600 --> 00:15:51,920
what the expected level of reliability

439
00:15:49,759 --> 00:15:53,839
is for our services

440
00:15:51,920 --> 00:15:56,240
and we need to think about our services

441
00:15:53,839 --> 00:15:57,680
in terms of the broader context

442
00:15:56,240 --> 00:15:58,800
what is it that our customers are trying

443
00:15:57,680 --> 00:16:00,720
to achieve

444
00:15:58,800 --> 00:16:03,680
what is that critical user journey that

445
00:16:00,720 --> 00:16:03,680
people are trying to do

446
00:16:04,240 --> 00:16:08,720
for instance maybe it's people buying

447
00:16:06,800 --> 00:16:10,639
tickets to attend the outdoor film

448
00:16:08,720 --> 00:16:13,279
festival in sydney

449
00:16:10,639 --> 00:16:16,000
or maybe it's someone who is trying to

450
00:16:13,279 --> 00:16:18,160
place an order for uh for computer parts

451
00:16:16,000 --> 00:16:20,079
from scorpiotech right

452
00:16:18,160 --> 00:16:22,240
either way there's some action that a

453
00:16:20,079 --> 00:16:23,680
customer is trying to achieve and we

454
00:16:22,240 --> 00:16:25,199
need to make sure that we're measuring

455
00:16:23,680 --> 00:16:27,040
it to ensure that it is of a

456
00:16:25,199 --> 00:16:28,720
satisfactory level of performance and

457
00:16:27,040 --> 00:16:30,800
quality

458
00:16:28,720 --> 00:16:32,560
so we need to understand is a given

459
00:16:30,800 --> 00:16:33,519
interaction between our customers and

460
00:16:32,560 --> 00:16:35,040
ourselves

461
00:16:33,519 --> 00:16:36,720
creating a good or bad experience for

462
00:16:35,040 --> 00:16:38,079
that customer

463
00:16:36,720 --> 00:16:40,000
how do you know

464
00:16:38,079 --> 00:16:42,399
well this is where product managers and

465
00:16:40,000 --> 00:16:44,399
user experience researchers and customer

466
00:16:42,399 --> 00:16:46,639
success can really help you out to help

467
00:16:44,399 --> 00:16:50,240
you understand what differentiates a

468
00:16:46,639 --> 00:16:52,160
successful from a failing transaction

469
00:16:50,240 --> 00:16:54,720
or if you work in a field where you are

470
00:16:52,160 --> 00:16:56,079
able to experiment on your own systems

471
00:16:54,720 --> 00:16:58,240
and where you use those systems

472
00:16:56,079 --> 00:17:00,320
yourselves my main piece of advice is to

473
00:16:58,240 --> 00:17:02,399
do chaos engineering to deliberately

474
00:17:00,320 --> 00:17:04,079
slow down transactions

475
00:17:02,399 --> 00:17:06,400
so you can understand where does it

476
00:17:04,079 --> 00:17:09,280
start to feel sluggish and then set your

477
00:17:06,400 --> 00:17:11,280
threshold just shy of there

478
00:17:09,280 --> 00:17:13,199
because our goal is to give machines

479
00:17:11,280 --> 00:17:15,760
rules for deciding whether customers are

480
00:17:13,199 --> 00:17:18,319
having a good or bad experience

481
00:17:15,760 --> 00:17:19,360
and sure we're all used to trans-pacific

482
00:17:18,319 --> 00:17:21,199
latency

483
00:17:19,360 --> 00:17:22,079
but we can in general say you know hey

484
00:17:21,199 --> 00:17:24,319
like

485
00:17:22,079 --> 00:17:26,559
a request is successful if it responds

486
00:17:24,319 --> 00:17:28,640
in you know maybe not 100 milliseconds

487
00:17:26,559 --> 00:17:30,320
but 300 milliseconds right that's fast

488
00:17:28,640 --> 00:17:32,400
enough and then it has to return a

489
00:17:30,320 --> 00:17:36,080
success it can't return a fast fail is

490
00:17:32,400 --> 00:17:36,080
not a success to us as a customer

491
00:17:36,160 --> 00:17:39,600
and then we can think about what's the

492
00:17:37,840 --> 00:17:41,919
total denominator right how many

493
00:17:39,600 --> 00:17:44,480
eligible events did we see

494
00:17:41,919 --> 00:17:46,640
how many transactions were attempted

495
00:17:44,480 --> 00:17:47,919
and i don't just mean like you know just

496
00:17:46,640 --> 00:17:49,039
take everything comes into your load

497
00:17:47,919 --> 00:17:50,960
balancer

498
00:17:49,039 --> 00:17:52,720
because there's often health check

499
00:17:50,960 --> 00:17:54,880
traffic there's often denial of service

500
00:17:52,720 --> 00:17:56,559
attack traffic right like we need to

501
00:17:54,880 --> 00:17:59,520
think only about the real customer

502
00:17:56,559 --> 00:18:01,760
traffic not synthetic events

503
00:17:59,520 --> 00:18:04,320
and then we can compute our actual

504
00:18:01,760 --> 00:18:06,080
achieved availability the number of good

505
00:18:04,320 --> 00:18:07,120
events divided by the number of eligible

506
00:18:06,080 --> 00:18:08,960
events

507
00:18:07,120 --> 00:18:10,799
and we can set a target that we can

508
00:18:08,960 --> 00:18:12,160
measure that is our service level

509
00:18:10,799 --> 00:18:14,720
objective

510
00:18:12,160 --> 00:18:16,320
so the service level indicator earlier

511
00:18:14,720 --> 00:18:18,799
helped separate good events from bad

512
00:18:16,320 --> 00:18:21,120
events but the service level objective

513
00:18:18,799 --> 00:18:23,360
helps us decide over a longer period of

514
00:18:21,120 --> 00:18:26,840
time what percentage of the events in

515
00:18:23,360 --> 00:18:29,679
the sli should be should be

516
00:18:26,840 --> 00:18:32,000
successful so for instance

517
00:18:29,679 --> 00:18:34,480
if the website of that i'm operating was

518
00:18:32,000 --> 00:18:36,000
100 down today

519
00:18:34,480 --> 00:18:38,080
but i came to you and said you know what

520
00:18:36,000 --> 00:18:39,919
but tomorrow we're going to be 100 up

521
00:18:38,080 --> 00:18:41,600
you'd laugh at me right

522
00:18:39,919 --> 00:18:43,600
so it turns out that human beings have

523
00:18:41,600 --> 00:18:45,679
memories longer than day so you need to

524
00:18:43,600 --> 00:18:47,360
think about measuring reliability and

525
00:18:45,679 --> 00:18:50,160
setting goals for it over a longer

526
00:18:47,360 --> 00:18:51,919
period like 30 or 90 days

527
00:18:50,160 --> 00:18:53,600
we also need to set an appropriate

528
00:18:51,919 --> 00:18:56,799
target percentage

529
00:18:53,600 --> 00:18:58,960
for instance i might aim to have 99.9 of

530
00:18:56,799 --> 00:19:00,960
events be successful over the past 30

531
00:18:58,960 --> 00:19:03,200
days where an event is successful if it

532
00:19:00,960 --> 00:19:07,120
was served in less than 300 milliseconds

533
00:19:03,200 --> 00:19:07,919
and with an http response code of 200.

534
00:19:07,120 --> 00:19:10,280
so

535
00:19:07,919 --> 00:19:12,400
why did i not say aim for

536
00:19:10,280 --> 00:19:14,799
99.99999 why should i not have infinite

537
00:19:12,400 --> 00:19:16,160
nines right

538
00:19:14,799 --> 00:19:18,960
the answer is

539
00:19:16,160 --> 00:19:20,240
that every nine is additionally costly

540
00:19:18,960 --> 00:19:23,120
right

541
00:19:20,240 --> 00:19:24,799
and indeed we see this in australia with

542
00:19:23,120 --> 00:19:26,160
nbn right

543
00:19:24,799 --> 00:19:28,000
that's sure

544
00:19:26,160 --> 00:19:29,679
we could spend

545
00:19:28,000 --> 00:19:31,679
trillions of dollars

546
00:19:29,679 --> 00:19:33,600
putting putting you know dozens of

547
00:19:31,679 --> 00:19:35,760
satellites into space

548
00:19:33,600 --> 00:19:37,840
in order to fulfill

549
00:19:35,760 --> 00:19:39,919
99.99999

550
00:19:37,840 --> 00:19:42,080
population coverage

551
00:19:39,919 --> 00:19:43,919
or we could put two satellites into

552
00:19:42,080 --> 00:19:46,160
space and say you know what we're

553
00:19:43,919 --> 00:19:48,400
covering four nines of the population

554
00:19:46,160 --> 00:19:50,799
that's good enough right

555
00:19:48,400 --> 00:19:52,640
there is a cost to each incremental

556
00:19:50,799 --> 00:19:54,960
failure that you prevent

557
00:19:52,640 --> 00:19:56,880
and a incremental benefit associated

558
00:19:54,960 --> 00:19:58,000
with that individual failure that was

559
00:19:56,880 --> 00:20:00,400
prevented

560
00:19:58,000 --> 00:20:01,919
at some point those two lines cross over

561
00:20:00,400 --> 00:20:04,720
so it's up to you to figure out how

562
00:20:01,919 --> 00:20:06,799
critical is a service and then decide

563
00:20:04,720 --> 00:20:08,400
okay what is the level of reliability

564
00:20:06,799 --> 00:20:11,120
that we're aiming for and what's going

565
00:20:08,400 --> 00:20:13,039
to be cost effective

566
00:20:11,120 --> 00:20:14,559
now what do i do with the service level

567
00:20:13,039 --> 00:20:16,240
objective right all this is abstract

568
00:20:14,559 --> 00:20:17,919
right setting goals why do these goals

569
00:20:16,240 --> 00:20:20,000
matter how do i make my life easier as

570
00:20:17,919 --> 00:20:21,679
an operations engineer let me tell you

571
00:20:20,000 --> 00:20:23,840
why

572
00:20:21,679 --> 00:20:25,600
the answer is going back to that first

573
00:20:23,840 --> 00:20:27,840
point i made about operationally

574
00:20:25,600 --> 00:20:30,400
overloaded engineers

575
00:20:27,840 --> 00:20:33,120
often when you get paged at 2 am there's

576
00:20:30,400 --> 00:20:35,039
no actual end user impact it's just

577
00:20:33,120 --> 00:20:37,520
something paging you because the disk

578
00:20:35,039 --> 00:20:41,520
usage got to 90.1

579
00:20:37,520 --> 00:20:43,919
90.01 oh no such an emergency right

580
00:20:41,520 --> 00:20:46,840
were any users impacted

581
00:20:43,919 --> 00:20:50,559
no was this worth waking someone up for

582
00:20:46,840 --> 00:20:53,120
no so we can instead think about using

583
00:20:50,559 --> 00:20:55,200
our measurements of what the reliability

584
00:20:53,120 --> 00:20:57,039
level experienced by our customers is

585
00:20:55,200 --> 00:20:59,919
and use that to decide is this important

586
00:20:57,039 --> 00:21:02,400
enough or not to wake a human up

587
00:20:59,919 --> 00:21:03,840
so for instance let's talk about

588
00:21:02,400 --> 00:21:06,000
the idea of the error budget which is

589
00:21:03,840 --> 00:21:09,600
the inverse of your service level

590
00:21:06,000 --> 00:21:11,280
objective if i'm targeting 99.9

591
00:21:09,600 --> 00:21:14,080
reliability

592
00:21:11,280 --> 00:21:15,760
that means 1 in 1 000 requests are

593
00:21:14,080 --> 00:21:18,400
allowed to fail

594
00:21:15,760 --> 00:21:21,200
so if i'm serving 10 million requests

595
00:21:18,400 --> 00:21:23,360
per month that means 10 000 requests per

596
00:21:21,200 --> 00:21:25,120
month can fail

597
00:21:23,360 --> 00:21:27,919
and i can have a different degree of

598
00:21:25,120 --> 00:21:30,080
urgency based off of the error rate

599
00:21:27,919 --> 00:21:33,440
right it's no longer an emergency if one

600
00:21:30,080 --> 00:21:35,120
out of one requests fails at 2 am

601
00:21:33,440 --> 00:21:36,960
right instead i ask

602
00:21:35,120 --> 00:21:39,360
if i continue having this number of

603
00:21:36,960 --> 00:21:40,640
errors how many hours will it be until i

604
00:21:39,360 --> 00:21:43,039
run out

605
00:21:40,640 --> 00:21:45,360
if i'm serving a thousand bad requests

606
00:21:43,039 --> 00:21:47,360
per hour and i'm allowed to have 10 000

607
00:21:45,360 --> 00:21:49,039
bad requests per month i'm going to run

608
00:21:47,360 --> 00:21:50,720
out in 10 hours that's probably worth

609
00:21:49,039 --> 00:21:52,720
waking someone up for

610
00:21:50,720 --> 00:21:54,480
but if i'm only going to run out in days

611
00:21:52,720 --> 00:21:56,640
it can wait until the next business day

612
00:21:54,480 --> 00:21:58,000
it's okay

613
00:21:56,640 --> 00:21:59,919
and the beauty of this is that it

614
00:21:58,000 --> 00:22:02,880
catches subtle failures as well that are

615
00:21:59,919 --> 00:22:04,799
not entirely up or down

616
00:22:02,880 --> 00:22:06,320
here's an example of an actual failure

617
00:22:04,799 --> 00:22:08,240
that we encountered at honeycomb when we

618
00:22:06,320 --> 00:22:11,360
were first turning on service level

619
00:22:08,240 --> 00:22:13,280
objectives for our own services

620
00:22:11,360 --> 00:22:14,880
we started having an intermittent brown

621
00:22:13,280 --> 00:22:16,799
out where two percent of our traffic

622
00:22:14,880 --> 00:22:19,760
would fail for 20 minutes at a time

623
00:22:16,799 --> 00:22:21,440
repeating every three hours

624
00:22:19,760 --> 00:22:23,600
now it turns out that there is a memory

625
00:22:21,440 --> 00:22:25,280
leak and all of our servers had started

626
00:22:23,600 --> 00:22:26,720
at the same time and were crashing and

627
00:22:25,280 --> 00:22:28,559
stampeding at the same time but we

628
00:22:26,720 --> 00:22:29,919
didn't know that at the time

629
00:22:28,559 --> 00:22:32,880
but what we did know was that our

630
00:22:29,919 --> 00:22:35,039
service level objective fired saying

631
00:22:32,880 --> 00:22:36,799
that too many users were experiencing

632
00:22:35,039 --> 00:22:38,080
problems uploading their telemetry to

633
00:22:36,799 --> 00:22:39,840
honeycomb

634
00:22:38,080 --> 00:22:42,320
whereas our conventional black box

635
00:22:39,840 --> 00:22:44,640
alerts were not able to catch the issue

636
00:22:42,320 --> 00:22:46,080
because of the fact that they wait for

637
00:22:44,640 --> 00:22:48,080
two consecutive

638
00:22:46,080 --> 00:22:49,919
probes in a row to fail and we did not

639
00:22:48,080 --> 00:22:52,159
have two consecutive probe failures in a

640
00:22:49,919 --> 00:22:52,159
row

641
00:22:52,240 --> 00:22:56,240
so besides making your alerting much

642
00:22:54,559 --> 00:22:58,000
higher fidelity the other thing that

643
00:22:56,240 --> 00:23:00,880
service level objectives do is that they

644
00:22:58,000 --> 00:23:03,200
help you navigate the tension between

645
00:23:00,880 --> 00:23:04,559
product development and reliability

646
00:23:03,200 --> 00:23:06,080
instead of having product managers

647
00:23:04,559 --> 00:23:08,159
saying ship more features i want more

648
00:23:06,080 --> 00:23:10,400
features right like now you have a

649
00:23:08,159 --> 00:23:12,400
framework to treat reliability as a

650
00:23:10,400 --> 00:23:14,159
product feature

651
00:23:12,400 --> 00:23:16,480
so for instance if you're having too

652
00:23:14,159 --> 00:23:18,000
much reliability that can almost be a

653
00:23:16,480 --> 00:23:19,919
bad thing because people's expectations

654
00:23:18,000 --> 00:23:21,600
will ratchet up and your competitors

655
00:23:19,919 --> 00:23:23,200
might be innovating faster than you if

656
00:23:21,600 --> 00:23:25,200
you're focusing on delivering

657
00:23:23,200 --> 00:23:26,559
reliability at the expense of product

658
00:23:25,200 --> 00:23:28,480
features

659
00:23:26,559 --> 00:23:30,559
so you might decide i'm going to do an

660
00:23:28,480 --> 00:23:32,080
experiment and i'm going to

661
00:23:30,559 --> 00:23:33,679
roll something out to one percent or two

662
00:23:32,080 --> 00:23:36,080
percent or five percent of my users

663
00:23:33,679 --> 00:23:37,600
knowing that even even if it fails 100

664
00:23:36,080 --> 00:23:39,600
you know of five views of five percent

665
00:23:37,600 --> 00:23:40,720
of users you can roll it back within

666
00:23:39,600 --> 00:23:44,240
five minutes and you're not going to

667
00:23:40,720 --> 00:23:45,679
damage that many people's experiences

668
00:23:44,240 --> 00:23:47,600
but conversely if you had a set of

669
00:23:45,679 --> 00:23:49,120
really bad outages recently

670
00:23:47,600 --> 00:23:50,799
you can think instead about how do we

671
00:23:49,120 --> 00:23:52,240
invest in more reliability how do we

672
00:23:50,799 --> 00:23:54,400
make the business case right the answer

673
00:23:52,240 --> 00:23:56,159
is we're failing our objectives

674
00:23:54,400 --> 00:23:58,000
therefore our customers are at risk of

675
00:23:56,159 --> 00:23:59,520
no longer trusting our service no

676
00:23:58,000 --> 00:24:01,360
feature that we ship is going to matter

677
00:23:59,520 --> 00:24:03,600
unless we bring reliability back up to

678
00:24:01,360 --> 00:24:03,600
par

679
00:24:03,760 --> 00:24:07,279
now you don't have to have super complex

680
00:24:05,840 --> 00:24:08,640
slos to start with even your load

681
00:24:07,279 --> 00:24:10,320
balancer logs are great because they

682
00:24:08,640 --> 00:24:11,600
help you understand

683
00:24:10,320 --> 00:24:13,760
what is it that your customers are

684
00:24:11,600 --> 00:24:15,520
experiencing from a neutral-ish point of

685
00:24:13,760 --> 00:24:16,400
view right it helps you measure what you

686
00:24:15,520 --> 00:24:18,559
can

687
00:24:16,400 --> 00:24:20,720
in order to deliver a better customer

688
00:24:18,559 --> 00:24:22,720
experience and actually put yourselves

689
00:24:20,720 --> 00:24:24,240
in the shoes of those users

690
00:24:22,720 --> 00:24:25,919
and over time you can iterate to make

691
00:24:24,240 --> 00:24:27,520
those slos better right like you can

692
00:24:25,919 --> 00:24:29,360
start incorporating things like real

693
00:24:27,520 --> 00:24:31,039
user monitoring from your client devices

694
00:24:29,360 --> 00:24:32,480
right like there are a million things

695
00:24:31,039 --> 00:24:34,880
you can do

696
00:24:32,480 --> 00:24:36,559
but start by meeting your user needs and

697
00:24:34,880 --> 00:24:38,159
then if users are having problems that

698
00:24:36,559 --> 00:24:40,720
your oslo isn't firing you need to

699
00:24:38,159 --> 00:24:42,880
correct your slo

700
00:24:40,720 --> 00:24:44,960
so slos help you reduce alerting noise

701
00:24:42,880 --> 00:24:46,400
and really help you hone in only on what

702
00:24:44,960 --> 00:24:49,039
matters but that's only half of the

703
00:24:46,400 --> 00:24:51,440
story to me why

704
00:24:49,039 --> 00:24:53,039
because we don't just need to focus on

705
00:24:51,440 --> 00:24:55,200
the issue of operator burnout we also

706
00:24:53,039 --> 00:24:57,039
need to focus on restoring customer

707
00:24:55,200 --> 00:24:58,799
experience as quickly as possible when

708
00:24:57,039 --> 00:25:00,799
we do confirm that there's a genuine

709
00:24:58,799 --> 00:25:03,039
issue

710
00:25:00,799 --> 00:25:05,120
so let's talk about how that works

711
00:25:03,039 --> 00:25:06,720
our outages are never exactly identical

712
00:25:05,120 --> 00:25:08,559
right that we

713
00:25:06,720 --> 00:25:10,480
always have if we're doing our job as

714
00:25:08,559 --> 00:25:12,000
engineers we're always going to have

715
00:25:10,480 --> 00:25:14,320
these new kinds of failures that are

716
00:25:12,000 --> 00:25:15,360
happening that we didn't anticipate

717
00:25:14,320 --> 00:25:16,799
because you wouldn't have written the

718
00:25:15,360 --> 00:25:18,640
bug in the first place if you knew it

719
00:25:16,799 --> 00:25:20,240
was going to break so therefore by

720
00:25:18,640 --> 00:25:21,600
definition anything that goes wrong in

721
00:25:20,240 --> 00:25:23,840
production is going to be something

722
00:25:21,600 --> 00:25:25,520
unexpected

723
00:25:23,840 --> 00:25:27,200
and not only that

724
00:25:25,520 --> 00:25:29,360
it may be something that is challenging

725
00:25:27,200 --> 00:25:31,279
or hard to reproduce in staging because

726
00:25:29,360 --> 00:25:32,559
staging is not production

727
00:25:31,279 --> 00:25:34,480
right staging doesn't have the same

728
00:25:32,559 --> 00:25:37,120
scale and it's a futile waste of money

729
00:25:34,480 --> 00:25:39,120
to make staging look like production

730
00:25:37,120 --> 00:25:40,720
we have to be able to debug things as

731
00:25:39,120 --> 00:25:42,559
they're happening live in production

732
00:25:40,720 --> 00:25:44,240
rather than waiting weeks to reproduce

733
00:25:42,559 --> 00:25:45,919
them

734
00:25:44,240 --> 00:25:48,000
and we have to avoid creating these

735
00:25:45,919 --> 00:25:51,360
silos between different tooling across

736
00:25:48,000 --> 00:25:52,880
different uh problem domains services uh

737
00:25:51,360 --> 00:25:55,120
environments right like we have to be

738
00:25:52,880 --> 00:25:56,720
able to use the same tooling to very

739
00:25:55,120 --> 00:25:58,559
quickly iterate and understand what's

740
00:25:56,720 --> 00:26:00,080
happening

741
00:25:58,559 --> 00:26:02,080
now the thing that i've discovered is

742
00:26:00,080 --> 00:26:04,240
that in my 15 years as a site

743
00:26:02,080 --> 00:26:05,760
reliability engineer

744
00:26:04,240 --> 00:26:07,520
when we have an outage when something is

745
00:26:05,760 --> 00:26:08,960
bumped in the middle of the night

746
00:26:07,520 --> 00:26:11,279
the thing that takes the longest is

747
00:26:08,960 --> 00:26:13,120
figuring out what do we think is going

748
00:26:11,279 --> 00:26:14,880
wrong and how can we verify that as

749
00:26:13,120 --> 00:26:16,720
quickly as possible that's what takes

750
00:26:14,880 --> 00:26:19,679
the most time

751
00:26:16,720 --> 00:26:20,880
or kind of going to um fighter pilot

752
00:26:19,679 --> 00:26:23,279
school for a moment right like there's

753
00:26:20,880 --> 00:26:25,760
this idea in fighter pilot school of uh

754
00:26:23,279 --> 00:26:27,440
orient of orient observe decide act

755
00:26:25,760 --> 00:26:29,520
right the ooda loop

756
00:26:27,440 --> 00:26:31,600
so we have to think about how do we

757
00:26:29,520 --> 00:26:33,279
accelerate that orient and observe part

758
00:26:31,600 --> 00:26:35,600
as quickly as possible

759
00:26:33,279 --> 00:26:37,520
how do we actually explore that data in

760
00:26:35,600 --> 00:26:39,440
order to ask new questions rather than

761
00:26:37,520 --> 00:26:40,960
just leafing through the existing

762
00:26:39,440 --> 00:26:43,600
dashboards that showed us the questions

763
00:26:40,960 --> 00:26:45,360
we thought to ask before

764
00:26:43,600 --> 00:26:47,440
all this is to say that our services

765
00:26:45,360 --> 00:26:49,760
must be observable

766
00:26:47,440 --> 00:26:51,919
in control theory observability is the

767
00:26:49,760 --> 00:26:55,120
ability to based off of the outputs of a

768
00:26:51,919 --> 00:26:56,799
system in for its inner state

769
00:26:55,120 --> 00:26:59,120
but i prefer to in the systems

770
00:26:56,799 --> 00:27:00,240
engineering context of computer systems

771
00:26:59,120 --> 00:27:02,880
think about

772
00:27:00,240 --> 00:27:04,320
how do we ask and answer unknown

773
00:27:02,880 --> 00:27:05,679
unknowns things that we didn't

774
00:27:04,320 --> 00:27:07,279
anticipate would break in the first

775
00:27:05,679 --> 00:27:09,279
place

776
00:27:07,279 --> 00:27:10,720
this requires us to be able to examine

777
00:27:09,279 --> 00:27:12,720
the events that are happening inside of

778
00:27:10,720 --> 00:27:14,960
our system in their full context to

779
00:27:12,720 --> 00:27:18,159
understand properties like

780
00:27:14,960 --> 00:27:20,240
you know which request is this from

781
00:27:18,159 --> 00:27:22,720
this happens to be some of the innards

782
00:27:20,240 --> 00:27:25,120
of our uh query engine

783
00:27:22,720 --> 00:27:27,840
and we can understand things like

784
00:27:25,120 --> 00:27:29,919
for this query which service issued it

785
00:27:27,840 --> 00:27:32,799
um you know was it an end user was it

786
00:27:29,919 --> 00:27:35,120
our uh was our alerting service

787
00:27:32,799 --> 00:27:36,960
and where do we get slow right did it

788
00:27:35,120 --> 00:27:39,600
get slow waiting on aws lambda did it

789
00:27:36,960 --> 00:27:42,080
get slow uh reading files off of disk um

790
00:27:39,600 --> 00:27:44,240
did it gets did was it blocked in cpu

791
00:27:42,080 --> 00:27:45,840
and which user is it right like not not

792
00:27:44,240 --> 00:27:47,760
just kind of mushing all of our users

793
00:27:45,840 --> 00:27:49,200
together but instead thinking about how

794
00:27:47,760 --> 00:27:50,799
can we break apart and understand the

795
00:27:49,200 --> 00:27:52,640
behavior of each individual user in

796
00:27:50,799 --> 00:27:53,840
isolation

797
00:27:52,640 --> 00:27:54,960
we have to be able to explain the

798
00:27:53,840 --> 00:27:56,320
variance we have to be able to

799
00:27:54,960 --> 00:27:58,640
understand

800
00:27:56,320 --> 00:28:01,520
what separates a good user experience

801
00:27:58,640 --> 00:28:03,679
right a succeeding sli from a bad user

802
00:28:01,520 --> 00:28:05,600
experience a failing soi

803
00:28:03,679 --> 00:28:07,520
so for instance this is one way that you

804
00:28:05,600 --> 00:28:10,559
might visualize this data right to look

805
00:28:07,520 --> 00:28:11,760
at how can i see what dimensions are

806
00:28:10,559 --> 00:28:13,840
different between the succeeding and

807
00:28:11,760 --> 00:28:15,760
failing requests

808
00:28:13,840 --> 00:28:17,440
but even better yet why should we have

809
00:28:15,760 --> 00:28:18,960
to do this at 2am when we're at our

810
00:28:17,440 --> 00:28:21,279
worst

811
00:28:18,960 --> 00:28:23,760
i do not think that we should need to be

812
00:28:21,279 --> 00:28:27,200
doing debugging at 2am

813
00:28:23,760 --> 00:28:28,640
it's about mitigating impact first

814
00:28:27,200 --> 00:28:30,720
as long as you're collecting enough

815
00:28:28,640 --> 00:28:32,880
telemetry along the way and you don't

816
00:28:30,720 --> 00:28:34,559
have to catch it live in the act

817
00:28:32,880 --> 00:28:36,000
then you can roll back the bad release

818
00:28:34,559 --> 00:28:37,840
drain the bad data center and then you

819
00:28:36,000 --> 00:28:40,159
can debug it sitting cup of coffee in

820
00:28:37,840 --> 00:28:42,320
hand at 9am instead i think that's a lot

821
00:28:40,159 --> 00:28:44,399
better

822
00:28:42,320 --> 00:28:46,640
but to me observability is not just a

823
00:28:44,399 --> 00:28:48,799
benefit to the break fix it's not just a

824
00:28:46,640 --> 00:28:50,240
benefit to ops it's a benefit to

825
00:28:48,799 --> 00:28:52,320
everyone

826
00:28:50,240 --> 00:28:54,000
because observability helps us gain

827
00:28:52,320 --> 00:28:56,159
better confidence in what our code is

828
00:28:54,000 --> 00:28:57,520
doing in the first place why are tests

829
00:28:56,159 --> 00:28:59,679
failing

830
00:28:57,520 --> 00:29:01,679
or why is it that our deployment loop is

831
00:28:59,679 --> 00:29:03,919
taking two hours to run where it

832
00:29:01,679 --> 00:29:05,760
previously took 30 minutes

833
00:29:03,919 --> 00:29:07,200
or what are these users actually doing

834
00:29:05,760 --> 00:29:09,120
with these features that we released

835
00:29:07,200 --> 00:29:11,840
last monday which users are making the

836
00:29:09,120 --> 00:29:13,600
most use of those features

837
00:29:11,840 --> 00:29:15,120
and do i have any single points of

838
00:29:13,600 --> 00:29:16,559
failure in my system or circular

839
00:29:15,120 --> 00:29:18,320
dependencies that i should be working to

840
00:29:16,559 --> 00:29:19,919
fix over the longer term

841
00:29:18,320 --> 00:29:21,840
those are all use cases that i think

842
00:29:19,919 --> 00:29:24,640
everyone can benefit from not just the

843
00:29:21,840 --> 00:29:26,080
people who are on call

844
00:29:24,640 --> 00:29:27,919
the other misconception that i often

845
00:29:26,080 --> 00:29:29,279
hear is that observability is logs

846
00:29:27,919 --> 00:29:31,360
choices and metrics that it's kind of

847
00:29:29,279 --> 00:29:33,440
these three pillars

848
00:29:31,360 --> 00:29:35,679
and that's not true to me

849
00:29:33,440 --> 00:29:37,399
observability as i described is a

850
00:29:35,679 --> 00:29:40,159
capability it is in fact a

851
00:29:37,399 --> 00:29:42,559
socio-technical capability it's one

852
00:29:40,159 --> 00:29:44,320
where the humans have to be able to use

853
00:29:42,559 --> 00:29:46,880
the tools appropriately to be able to

854
00:29:44,320 --> 00:29:48,799
answer their own questions

855
00:29:46,880 --> 00:29:50,240
and that means that it should be as easy

856
00:29:48,799 --> 00:29:51,919
to add

857
00:29:50,240 --> 00:29:53,520
the necessary debugging and

858
00:29:51,919 --> 00:29:55,520
instrumentation

859
00:29:53,520 --> 00:29:57,360
as it is to add a printf to bug line

860
00:29:55,520 --> 00:29:59,279
because as much as i have done a lot of

861
00:29:57,360 --> 00:30:01,039
printf debugging and got here before

862
00:29:59,279 --> 00:30:02,720
like that's not a scalable method for

863
00:30:01,039 --> 00:30:05,120
working in prod how do we make it as

864
00:30:02,720 --> 00:30:07,360
easy to add good observability as it is

865
00:30:05,120 --> 00:30:09,120
to do a printf

866
00:30:07,360 --> 00:30:11,279
how do we send that data and store it in

867
00:30:09,120 --> 00:30:13,840
an economical fashion and finally most

868
00:30:11,279 --> 00:30:15,919
importantly can you actually query that

869
00:30:13,840 --> 00:30:17,600
data if you cannot actually query that

870
00:30:15,919 --> 00:30:18,640
data and answer your unknown unknown

871
00:30:17,600 --> 00:30:20,159
questions

872
00:30:18,640 --> 00:30:24,159
you don't have observability you just

873
00:30:20,159 --> 00:30:26,240
have a very expensive dev null

874
00:30:24,159 --> 00:30:28,799
so hopefully this elucidates for you why

875
00:30:26,240 --> 00:30:31,039
slos and observability go together

876
00:30:28,799 --> 00:30:32,880
because slos help you understand when

877
00:30:31,039 --> 00:30:34,720
things have gone too wrong and

878
00:30:32,880 --> 00:30:37,600
observability helps you piece together

879
00:30:34,720 --> 00:30:39,919
why so you can remediate the issue

880
00:30:37,600 --> 00:30:40,960
and then finally deliver a lasting fix

881
00:30:39,919 --> 00:30:44,240
to it

882
00:30:40,960 --> 00:30:46,000
as soon as it's convenient to you

883
00:30:44,240 --> 00:30:47,360
but we need to talk about the two other

884
00:30:46,000 --> 00:30:49,679
elements of production excellence

885
00:30:47,360 --> 00:30:51,760
besides slos and observability we need

886
00:30:49,679 --> 00:30:53,760
to talk about collaboration

887
00:30:51,760 --> 00:30:55,840
because the reality is that no

888
00:30:53,760 --> 00:30:58,080
individual human debugs and solves

889
00:30:55,840 --> 00:31:01,360
things alone no individual human is on

890
00:30:58,080 --> 00:31:03,360
call for services 24 7 365 that just is

891
00:31:01,360 --> 00:31:05,200
not sustainable anymore

892
00:31:03,360 --> 00:31:07,919
people deserve to go on vacations people

893
00:31:05,200 --> 00:31:09,440
deserve to uh retire right like people

894
00:31:07,919 --> 00:31:11,360
might might

895
00:31:09,440 --> 00:31:13,279
be out on sick leave because of covid

896
00:31:11,360 --> 00:31:16,000
right we have to be able to allow that

897
00:31:13,279 --> 00:31:18,080
slack in our system

898
00:31:16,000 --> 00:31:20,000
so how do we make it possible for

899
00:31:18,080 --> 00:31:22,399
everyone to collaborate and debug

900
00:31:20,000 --> 00:31:24,559
together how do we raise every human

901
00:31:22,399 --> 00:31:25,919
being to the level of the best debugger

902
00:31:24,559 --> 00:31:28,320
on their team

903
00:31:25,919 --> 00:31:30,720
and how do we make sure that information

904
00:31:28,320 --> 00:31:33,039
is not lost as we cross organizational

905
00:31:30,720 --> 00:31:34,960
gaps

906
00:31:33,039 --> 00:31:36,559
we have to think about a broader set of

907
00:31:34,960 --> 00:31:39,279
users we need to think about the

908
00:31:36,559 --> 00:31:42,720
customer support agent or the uh or the

909
00:31:39,279 --> 00:31:44,399
cloud provider as equal customers to our

910
00:31:42,720 --> 00:31:46,720
engineering teams as far as who is a

911
00:31:44,399 --> 00:31:48,640
client of our debugging

912
00:31:46,720 --> 00:31:52,399
and are we working together have we

913
00:31:48,640 --> 00:31:54,240
practiced this at 3 pm and not at 3 am

914
00:31:52,399 --> 00:31:56,240
are we doing game days right are we

915
00:31:54,240 --> 00:31:58,840
doing disaster drills so that we work

916
00:31:56,240 --> 00:32:00,399
out those kinks rather than doing it

917
00:31:58,840 --> 00:32:02,320
live

918
00:32:00,399 --> 00:32:03,279
and we need to make sure that we have

919
00:32:02,320 --> 00:32:05,519
you know

920
00:32:03,279 --> 00:32:07,760
not service selfishness right service

921
00:32:05,519 --> 00:32:09,120
ownership does not mean selfishness

922
00:32:07,760 --> 00:32:10,720
we have to make sure that people

923
00:32:09,120 --> 00:32:12,320
understand that

924
00:32:10,720 --> 00:32:14,320
we are working together and that

925
00:32:12,320 --> 00:32:16,000
hoarding knowledge inside of your head

926
00:32:14,320 --> 00:32:17,760
you know sure

927
00:32:16,000 --> 00:32:20,559
many of us have read the uh bastard

928
00:32:17,760 --> 00:32:21,679
operator from hell the bo foh comics

929
00:32:20,559 --> 00:32:23,600
right

930
00:32:21,679 --> 00:32:25,679
but you know that's not a model to

931
00:32:23,600 --> 00:32:27,440
follow right it's not okay to have your

932
00:32:25,679 --> 00:32:29,360
job security tied up and how much you

933
00:32:27,440 --> 00:32:31,200
alone know

934
00:32:29,360 --> 00:32:33,760
a modern system is built up of people

935
00:32:31,200 --> 00:32:35,760
who are working together

936
00:32:33,760 --> 00:32:38,080
my colleague jessica care calls this the

937
00:32:35,760 --> 00:32:40,080
idea of semanthicy the idea of learning

938
00:32:38,080 --> 00:32:42,080
systems that are learning together made

939
00:32:40,080 --> 00:32:44,000
up of humans and machines they're trying

940
00:32:42,080 --> 00:32:45,679
to do better and to iterate and to work

941
00:32:44,000 --> 00:32:47,679
better together and build better tools

942
00:32:45,679 --> 00:32:48,960
over time

943
00:32:47,679 --> 00:32:51,039
we have to be able to work together and

944
00:32:48,960 --> 00:32:52,080
lean on our teams right we have to be

945
00:32:51,039 --> 00:32:54,159
able to hand off on-call

946
00:32:52,080 --> 00:32:56,159
responsibilities we have to recognize

947
00:32:54,159 --> 00:32:58,000
that on-call is a team level

948
00:32:56,159 --> 00:32:59,679
responsibility not an individual level

949
00:32:58,000 --> 00:33:01,600
responsibility

950
00:32:59,679 --> 00:33:03,200
if someone is a new parent and they are

951
00:33:01,600 --> 00:33:06,240
having sleepless nights because of their

952
00:33:03,200 --> 00:33:08,320
child don't put them on call right

953
00:33:06,240 --> 00:33:10,240
well i do believe that every developer

954
00:33:08,320 --> 00:33:11,919
should have some exposure to production

955
00:33:10,240 --> 00:33:13,919
it doesn't have to take the form of

956
00:33:11,919 --> 00:33:16,000
on-call it can take the form of working

957
00:33:13,919 --> 00:33:18,000
tickets during normal business hours

958
00:33:16,000 --> 00:33:20,000
right it just involves some form of

959
00:33:18,000 --> 00:33:21,679
feedback loops that you are exposed to

960
00:33:20,000 --> 00:33:23,279
the consequences of the technical

961
00:33:21,679 --> 00:33:24,720
decisions of your team

962
00:33:23,279 --> 00:33:27,440
regardless of whether it's during

963
00:33:24,720 --> 00:33:29,440
working hours or not

964
00:33:27,440 --> 00:33:30,640
and we have to document things so that

965
00:33:29,440 --> 00:33:33,120
we're leaving

966
00:33:30,640 --> 00:33:35,120
as my colleague tanya riley says we're

967
00:33:33,120 --> 00:33:36,240
leaving cookies for our future self not

968
00:33:35,120 --> 00:33:37,919
traps

969
00:33:36,240 --> 00:33:40,320
we have to keep things organized and not

970
00:33:37,919 --> 00:33:41,919
just you know have misleading documents

971
00:33:40,320 --> 00:33:43,279
that that lead us astray that have

972
00:33:41,919 --> 00:33:45,039
gotten out of date but we have to at

973
00:33:43,279 --> 00:33:45,840
least have these common patterns right

974
00:33:45,039 --> 00:33:48,159
of

975
00:33:45,840 --> 00:33:50,399
what is the service for how do i shut it

976
00:33:48,159 --> 00:33:52,000
off how do i update it what happens if

977
00:33:50,399 --> 00:33:55,279
this breaks right those are kind of the

978
00:33:52,000 --> 00:33:55,279
key things that we need to remember

979
00:33:55,440 --> 00:33:58,320
and we need to make sure that you know

980
00:33:56,799 --> 00:33:59,840
you don't have that single person like

981
00:33:58,320 --> 00:34:01,600
miles or myself you know who has been

982
00:33:59,840 --> 00:34:03,919
that hero for ages and ages right we

983
00:34:01,600 --> 00:34:05,679
have to share that knowledge and we have

984
00:34:03,919 --> 00:34:07,840
to make sure that people are able to

985
00:34:05,679 --> 00:34:09,599
collaborate and having common sources of

986
00:34:07,840 --> 00:34:11,280
data

987
00:34:09,599 --> 00:34:13,599
one common source of data that i want to

988
00:34:11,280 --> 00:34:15,679
point out is open telemetry

989
00:34:13,599 --> 00:34:18,000
which is a vendor neutral standard that

990
00:34:15,679 --> 00:34:20,560
is developed by end users and multiple

991
00:34:18,000 --> 00:34:21,919
vendors working together in order to

992
00:34:20,560 --> 00:34:24,240
make sure that when you produce

993
00:34:21,919 --> 00:34:25,919
observability data that is a single

994
00:34:24,240 --> 00:34:27,200
source of truth that you can pipe to

995
00:34:25,919 --> 00:34:29,679
anywhere you like

996
00:34:27,200 --> 00:34:31,280
without experiencing lock-in and that's

997
00:34:29,679 --> 00:34:33,119
really powerful because it means that

998
00:34:31,280 --> 00:34:35,119
you no longer have data that is tied to

999
00:34:33,119 --> 00:34:37,280
one specific platform that for instance

1000
00:34:35,119 --> 00:34:39,440
no one else besides one team can access

1001
00:34:37,280 --> 00:34:41,200
or that you know you lose your data if

1002
00:34:39,440 --> 00:34:42,720
you wind up having to uh change

1003
00:34:41,200 --> 00:34:44,480
providers right like

1004
00:34:42,720 --> 00:34:45,839
having the shared understanding and

1005
00:34:44,480 --> 00:34:48,560
ground truth about what's happening in

1006
00:34:45,839 --> 00:34:49,839
our systems is really powerful

1007
00:34:48,560 --> 00:34:52,159
we also have to make sure that we're

1008
00:34:49,839 --> 00:34:54,639
rewarding curiosity and teamwork right

1009
00:34:52,159 --> 00:34:56,240
that instead of saying you know uh hey

1010
00:34:54,639 --> 00:34:57,920
jess i can't believe you didn't know

1011
00:34:56,240 --> 00:34:59,839
that right like do you think just gonna

1012
00:34:57,920 --> 00:35:01,040
ask me another question after that no

1013
00:34:59,839 --> 00:35:02,400
she's not

1014
00:35:01,040 --> 00:35:03,599
so instead we have to say things like

1015
00:35:02,400 --> 00:35:04,960
you know hey

1016
00:35:03,599 --> 00:35:06,560
thank you for asking the question i'm

1017
00:35:04,960 --> 00:35:08,240
sorry that the documentation wasn't

1018
00:35:06,560 --> 00:35:09,599
clear let's work together to document

1019
00:35:08,240 --> 00:35:11,839
that better so that no one else has to

1020
00:35:09,599 --> 00:35:14,079
stumble into that again

1021
00:35:11,839 --> 00:35:15,760
so by rewarding curiosity and teamwork

1022
00:35:14,079 --> 00:35:17,760
that really creates a healthier culture

1023
00:35:15,760 --> 00:35:19,280
of collaboration

1024
00:35:17,760 --> 00:35:21,119
but it's not just about collaborating

1025
00:35:19,280 --> 00:35:22,960
with your current colleagues it's also

1026
00:35:21,119 --> 00:35:25,280
about collaborating with your present

1027
00:35:22,960 --> 00:35:28,320
past and future self

1028
00:35:25,280 --> 00:35:29,760
that often you know we pull up get blame

1029
00:35:28,320 --> 00:35:31,520
and get blame says you know who that

1030
00:35:29,760 --> 00:35:33,520
idiot was who wrote that code that that

1031
00:35:31,520 --> 00:35:36,800
crashed

1032
00:35:33,520 --> 00:35:38,720
yeah it was me it was me again

1033
00:35:36,800 --> 00:35:40,400
right so you have to leave yourself

1034
00:35:38,720 --> 00:35:41,440
breadcrumbs and be kind to yourself

1035
00:35:40,400 --> 00:35:43,520
right to make sure that you're

1036
00:35:41,440 --> 00:35:45,040
documenting things for future you so

1037
00:35:43,520 --> 00:35:46,880
that you're not cursing future you're in

1038
00:35:45,040 --> 00:35:49,040
the future but be kind to yourself about

1039
00:35:46,880 --> 00:35:50,560
it

1040
00:35:49,040 --> 00:35:53,119
but while i did say earlier that

1041
00:35:50,560 --> 00:35:55,200
outreaches are not exactly the same

1042
00:35:53,119 --> 00:35:56,960
there are common patterns that we often

1043
00:35:55,200 --> 00:35:58,400
see in outages and it behooves us as

1044
00:35:56,960 --> 00:36:00,400
engineers to think about closing

1045
00:35:58,400 --> 00:36:04,400
feedback loops and eliminating some of

1046
00:36:00,400 --> 00:36:06,320
the more common categories of outages

1047
00:36:04,400 --> 00:36:08,000
and we can think about employing risk

1048
00:36:06,320 --> 00:36:10,839
analysis to help us become more

1049
00:36:08,000 --> 00:36:12,400
proactive in approaching uh our failure

1050
00:36:10,839 --> 00:36:14,160
cases

1051
00:36:12,400 --> 00:36:17,359
so i have out my window the sydney

1052
00:36:14,160 --> 00:36:20,079
harbour bridge and let's suppose that um

1053
00:36:17,359 --> 00:36:21,599
there's been uh some lack of maintenance

1054
00:36:20,079 --> 00:36:23,920
and cars are falling through the road

1055
00:36:21,599 --> 00:36:26,240
bed and the sydney transport trains are

1056
00:36:23,920 --> 00:36:28,400
falling through it's no good right and

1057
00:36:26,240 --> 00:36:30,000
also we all know um especially in light

1058
00:36:28,400 --> 00:36:31,680
of the volcanic activity that at some

1059
00:36:30,000 --> 00:36:34,400
point you know we live along the pacific

1060
00:36:31,680 --> 00:36:37,599
rim the earthquake's gonna happen right

1061
00:36:34,400 --> 00:36:38,800
and finally it's probably long past time

1062
00:36:37,599 --> 00:36:41,359
that we took down the christmas lighting

1063
00:36:38,800 --> 00:36:42,800
on the sydney harbor bridge right

1064
00:36:41,359 --> 00:36:44,720
which one of those three things do we

1065
00:36:42,800 --> 00:36:46,480
address first

1066
00:36:44,720 --> 00:36:48,000
we fix the cars that are falling through

1067
00:36:46,480 --> 00:36:49,040
the road bed and the trains that are

1068
00:36:48,000 --> 00:36:50,560
traveling along the tracks they're

1069
00:36:49,040 --> 00:36:52,000
leaving nowhere right like we fix that

1070
00:36:50,560 --> 00:36:55,119
first because it's having the highest

1071
00:36:52,000 --> 00:36:56,640
most critical impact on our users

1072
00:36:55,119 --> 00:37:00,720
so we need to think about what's the

1073
00:36:56,640 --> 00:37:02,480
frequency and impact of a risk to users

1074
00:37:00,720 --> 00:37:05,359
now let's move back from the physical

1075
00:37:02,480 --> 00:37:07,359
world to the computer world

1076
00:37:05,359 --> 00:37:08,960
how many of you have that shutter down

1077
00:37:07,359 --> 00:37:12,400
your spine when i say

1078
00:37:08,960 --> 00:37:12,400
the my sequel database

1079
00:37:13,359 --> 00:37:16,640
right

1080
00:37:14,800 --> 00:37:17,520
that is a single point of failure and we

1081
00:37:16,640 --> 00:37:20,000
know

1082
00:37:17,520 --> 00:37:21,520
as systems engineers that the my sequel

1083
00:37:20,000 --> 00:37:23,040
database is going to fail at some point

1084
00:37:21,520 --> 00:37:24,400
in the next year right it's going to

1085
00:37:23,040 --> 00:37:25,680
happen

1086
00:37:24,400 --> 00:37:27,119
you don't often have control over

1087
00:37:25,680 --> 00:37:29,119
frequency but what you do have control

1088
00:37:27,119 --> 00:37:30,960
over is impact so how can we make the

1089
00:37:29,119 --> 00:37:33,119
mysql database failing

1090
00:37:30,960 --> 00:37:34,880
not take out everything well

1091
00:37:33,119 --> 00:37:36,960
you could shard the database right you

1092
00:37:34,880 --> 00:37:39,040
could um have it only take down two

1093
00:37:36,960 --> 00:37:40,400
percent of users data at a time if it

1094
00:37:39,040 --> 00:37:42,480
goes out

1095
00:37:40,400 --> 00:37:43,839
you could decrease the amount of time

1096
00:37:42,480 --> 00:37:45,520
that it takes to identify that it's a

1097
00:37:43,839 --> 00:37:46,800
mysql database

1098
00:37:45,520 --> 00:37:48,160
you could also decrease the amount of

1099
00:37:46,800 --> 00:37:49,520
time that it takes to fail over right

1100
00:37:48,160 --> 00:37:50,720
you could have a hot spare running

1101
00:37:49,520 --> 00:37:52,800
instead of needing to restore from a

1102
00:37:50,720 --> 00:37:55,359
backup right that cuts the time from you

1103
00:37:52,800 --> 00:37:57,760
know two hours to restore the backup to

1104
00:37:55,359 --> 00:37:59,040
two seconds before the uh the watchdog

1105
00:37:57,760 --> 00:38:01,680
notices and kicks things over to the

1106
00:37:59,040 --> 00:38:03,280
replica right

1107
00:38:01,680 --> 00:38:04,880
so think about what are the risks that

1108
00:38:03,280 --> 00:38:06,560
are the most significant what's

1109
00:38:04,880 --> 00:38:07,839
happening with the highest impact

1110
00:38:06,560 --> 00:38:10,800
impacting the highest percentage of

1111
00:38:07,839 --> 00:38:12,640
users and lasting the longest

1112
00:38:10,800 --> 00:38:14,880
and then trying to move that needle

1113
00:38:12,640 --> 00:38:16,880
trying to work on on reducing those

1114
00:38:14,880 --> 00:38:18,320
significant impacts that doesn't mean

1115
00:38:16,880 --> 00:38:20,880
you have to eliminate them entirely but

1116
00:38:18,320 --> 00:38:23,680
you can think about ways to mitigate the

1117
00:38:20,880 --> 00:38:25,280
impact that they're less bad

1118
00:38:23,680 --> 00:38:27,040
and how do i decide what's a bad risk

1119
00:38:25,280 --> 00:38:28,880
what's not acceptable well the answer is

1120
00:38:27,040 --> 00:38:30,480
we just defined a service level

1121
00:38:28,880 --> 00:38:32,160
objective right

1122
00:38:30,480 --> 00:38:34,880
so if you have a service level objective

1123
00:38:32,160 --> 00:38:37,040
in mind then that helps you figure out

1124
00:38:34,880 --> 00:38:38,480
okay i'm allowed to have 10 000 bed

1125
00:38:37,040 --> 00:38:40,640
requests per month

1126
00:38:38,480 --> 00:38:43,040
this one failure case for instance the

1127
00:38:40,640 --> 00:38:44,800
mysql database is responsible for you

1128
00:38:43,040 --> 00:38:46,960
know in expectation it's going to cause

1129
00:38:44,800 --> 00:38:48,720
5000 failed requests per month

1130
00:38:46,960 --> 00:38:50,880
that's too high because that doesn't

1131
00:38:48,720 --> 00:38:52,400
leave us enough room for our unknown

1132
00:38:50,880 --> 00:38:54,720
unknowns for things that we didn't think

1133
00:38:52,400 --> 00:38:56,240
were going to happen

1134
00:38:54,720 --> 00:38:58,320
so there right there that's your

1135
00:38:56,240 --> 00:39:00,320
business case for being proactive for

1136
00:38:58,320 --> 00:39:01,839
fixing those issues before they cause

1137
00:39:00,320 --> 00:39:02,800
you to burn through your entire error

1138
00:39:01,839 --> 00:39:04,640
budget

1139
00:39:02,800 --> 00:39:06,560
because it's not possible to operate a

1140
00:39:04,640 --> 00:39:08,400
sustainable error budget

1141
00:39:06,560 --> 00:39:10,320
if that error budget is already filled

1142
00:39:08,400 --> 00:39:12,560
with known things that is going that are

1143
00:39:10,320 --> 00:39:14,160
going to break

1144
00:39:12,560 --> 00:39:15,599
and finally this also helps give you a

1145
00:39:14,160 --> 00:39:18,000
framework for how to approach your

1146
00:39:15,599 --> 00:39:19,839
retrospectives from incidents

1147
00:39:18,000 --> 00:39:21,359
many of us when we create retrospectives

1148
00:39:19,839 --> 00:39:22,800
kind of create this list of 10 things

1149
00:39:21,359 --> 00:39:24,560
that would have prevented this specific

1150
00:39:22,800 --> 00:39:26,720
outage right it's not the right way to

1151
00:39:24,560 --> 00:39:28,800
think about it instead think about what

1152
00:39:26,720 --> 00:39:30,240
were the highest impact things that

1153
00:39:28,800 --> 00:39:32,160
resulted in

1154
00:39:30,240 --> 00:39:33,839
an outage or an outage similar to this

1155
00:39:32,160 --> 00:39:35,280
being able to happen in the future right

1156
00:39:33,839 --> 00:39:36,320
pick those one or two most important

1157
00:39:35,280 --> 00:39:38,079
things

1158
00:39:36,320 --> 00:39:40,800
based off of

1159
00:39:38,079 --> 00:39:41,920
the frequency and the blast radius

1160
00:39:40,800 --> 00:39:43,680
and those are the ones that you should

1161
00:39:41,920 --> 00:39:46,400
actually fix and you can put the other

1162
00:39:43,680 --> 00:39:48,800
ones on a bug bash list or on a or a

1163
00:39:46,400 --> 00:39:50,640
hack week list right don't waste your

1164
00:39:48,800 --> 00:39:53,359
precious production improvement time

1165
00:39:50,640 --> 00:39:55,280
doing chrome polishing

1166
00:39:53,359 --> 00:39:57,040
but i want to call out two things

1167
00:39:55,280 --> 00:39:59,520
first of all that if you have a lack of

1168
00:39:57,040 --> 00:40:01,760
observability that is systematic risk

1169
00:39:59,520 --> 00:40:04,079
that if every algae just taking half an

1170
00:40:01,760 --> 00:40:05,599
hour an hour two hours three hours even

1171
00:40:04,079 --> 00:40:07,359
for people to figure out what's even

1172
00:40:05,599 --> 00:40:08,960
going on what services are impacted

1173
00:40:07,359 --> 00:40:10,480
who's impacted like what do we think is

1174
00:40:08,960 --> 00:40:12,079
happening right like

1175
00:40:10,480 --> 00:40:13,119
that is time that your users are

1176
00:40:12,079 --> 00:40:14,880
suffering

1177
00:40:13,119 --> 00:40:17,040
and if you shrink the amount of time

1178
00:40:14,880 --> 00:40:18,880
that it takes to debug things

1179
00:40:17,040 --> 00:40:21,200
and know like you know we don't believe

1180
00:40:18,880 --> 00:40:22,800
in the idea of mean time to recovery but

1181
00:40:21,200 --> 00:40:25,280
certainly you know you can think about

1182
00:40:22,800 --> 00:40:26,880
the number of dropped queries right as

1183
00:40:25,280 --> 00:40:29,040
users are experiencing pain as you're

1184
00:40:26,880 --> 00:40:30,560
figuring out what's even going on right

1185
00:40:29,040 --> 00:40:32,800
so better observability is something

1186
00:40:30,560 --> 00:40:34,800
that impacts every item on that list of

1187
00:40:32,800 --> 00:40:36,079
risks

1188
00:40:34,800 --> 00:40:38,000
and similarly if you have a lack of

1189
00:40:36,079 --> 00:40:39,920
collaboration right if i'd talked to the

1190
00:40:38,000 --> 00:40:41,760
av team just now and i and i'd said

1191
00:40:39,920 --> 00:40:43,200
something like you know can't believe

1192
00:40:41,760 --> 00:40:45,119
this get your [ __ ] together how could

1193
00:40:43,200 --> 00:40:46,160
this possibly really

1194
00:40:45,119 --> 00:40:48,160
do you think that's going to make them

1195
00:40:46,160 --> 00:40:50,720
work better no that's going to make them

1196
00:40:48,160 --> 00:40:52,400
work worse right like

1197
00:40:50,720 --> 00:40:54,640
being friendly being supportive right

1198
00:40:52,400 --> 00:40:56,240
like that's how you ensure that people

1199
00:40:54,640 --> 00:40:59,839
report issues early that people

1200
00:40:56,240 --> 00:40:59,839
collaborate that they're transparent

1201
00:41:00,000 --> 00:41:04,400
in 2018 google cloud had a large outage

1202
00:41:02,800 --> 00:41:07,200
that impacted global networking for

1203
00:41:04,400 --> 00:41:08,400
about 45 or 50 minutes

1204
00:41:07,200 --> 00:41:10,400
do you know what caused that edge to be

1205
00:41:08,400 --> 00:41:11,760
solved in 45 50 minutes and not three

1206
00:41:10,400 --> 00:41:13,760
hours

1207
00:41:11,760 --> 00:41:14,720
the developer who had pushed the code

1208
00:41:13,760 --> 00:41:16,400
change

1209
00:41:14,720 --> 00:41:18,480
that had inadvertently tickled that

1210
00:41:16,400 --> 00:41:20,240
outage

1211
00:41:18,480 --> 00:41:21,599
they raised their hand and they said i

1212
00:41:20,240 --> 00:41:24,560
think it might be my change and i'm

1213
00:41:21,599 --> 00:41:26,720
already reverting it instead of cowering

1214
00:41:24,560 --> 00:41:28,800
and being in fear and not speaking up

1215
00:41:26,720 --> 00:41:31,200
for fear of being fired right

1216
00:41:28,800 --> 00:41:32,839
good collaboration decreases the amount

1217
00:41:31,200 --> 00:41:35,760
of time that outages

1218
00:41:32,839 --> 00:41:37,200
take so you don't have to be a hero in

1219
00:41:35,760 --> 00:41:39,599
order to have a

1220
00:41:37,200 --> 00:41:41,760
system that is production excellent

1221
00:41:39,599 --> 00:41:43,599
you just have to follow this recipe of

1222
00:41:41,760 --> 00:41:45,200
those four things

1223
00:41:43,599 --> 00:41:47,200
and i'm not just talking the abstract

1224
00:41:45,200 --> 00:41:48,880
i'm talking about what i've learned from

1225
00:41:47,200 --> 00:41:51,760
my time at honeycomb

1226
00:41:48,880 --> 00:41:53,280
that we're today about 40 50 engineers

1227
00:41:51,760 --> 00:41:54,720
but when i started we're 10 engineers

1228
00:41:53,280 --> 00:41:57,040
right and we're competing with companies

1229
00:41:54,720 --> 00:41:59,119
that are 10 times their size

1230
00:41:57,040 --> 00:42:00,640
and we managed to pull it off only

1231
00:41:59,119 --> 00:42:03,280
because of some of these factors of

1232
00:42:00,640 --> 00:42:06,000
production excellence of feeling like we

1233
00:42:03,280 --> 00:42:07,839
have the freedom to deploy on fridays to

1234
00:42:06,000 --> 00:42:10,160
deploy confidently

1235
00:42:07,839 --> 00:42:12,480
but also to have the responsibility to

1236
00:42:10,160 --> 00:42:15,119
look after changes after they go out

1237
00:42:12,480 --> 00:42:17,680
we do not push and run right like

1238
00:42:15,119 --> 00:42:19,839
it's not safe to push at 6 pm and walk

1239
00:42:17,680 --> 00:42:21,359
out the door and go get a and and go get

1240
00:42:19,839 --> 00:42:22,720
a beer right

1241
00:42:21,359 --> 00:42:24,240
no matter what day of the week it is on

1242
00:42:22,720 --> 00:42:25,520
the other hand if it's friday morning go

1243
00:42:24,240 --> 00:42:27,680
ahead and push as long as you're going

1244
00:42:25,520 --> 00:42:29,440
to be awake and around to look at it

1245
00:42:27,680 --> 00:42:32,400
so you can see here that we push up to

1246
00:42:29,440 --> 00:42:34,160
14 times per day

1247
00:42:32,400 --> 00:42:35,920
and we've done this all while traffic

1248
00:42:34,160 --> 00:42:37,440
has gone up three to five times in a

1249
00:42:35,920 --> 00:42:39,200
single year

1250
00:42:37,440 --> 00:42:40,960
covid has been really interesting for a

1251
00:42:39,200 --> 00:42:42,560
business because more things are moving

1252
00:42:40,960 --> 00:42:45,680
online

1253
00:42:42,560 --> 00:42:47,839
we have had basically a tripling of the

1254
00:42:45,680 --> 00:42:49,200
amount of right workload of the amount

1255
00:42:47,839 --> 00:42:51,680
of telemetry data coming from our

1256
00:42:49,200 --> 00:42:53,839
customers into honeycomb

1257
00:42:51,680 --> 00:42:56,000
and we've also had three times as many

1258
00:42:53,839 --> 00:42:57,200
people who are asking us questions

1259
00:42:56,000 --> 00:42:59,440
right people who are trying to

1260
00:42:57,200 --> 00:43:01,280
understand the behavior of their systems

1261
00:42:59,440 --> 00:43:03,040
so we've had to add all these features

1262
00:43:01,280 --> 00:43:04,960
and scale out our system and keep it

1263
00:43:03,040 --> 00:43:07,040
reliable at the same time

1264
00:43:04,960 --> 00:43:07,920
and how did we do that

1265
00:43:07,040 --> 00:43:10,240
well

1266
00:43:07,920 --> 00:43:12,800
we defined our slos

1267
00:43:10,240 --> 00:43:14,560
we thought about our areas of risk

1268
00:43:12,800 --> 00:43:16,240
and then we designed experiments to

1269
00:43:14,560 --> 00:43:19,760
validate that that risk was not there

1270
00:43:16,240 --> 00:43:22,480
and to fix those risks if we found them

1271
00:43:19,760 --> 00:43:24,839
right that's how we approach this

1272
00:43:22,480 --> 00:43:28,000
so we actually practice slos and

1273
00:43:24,839 --> 00:43:30,079
honeycomb and our slos reflect the user

1274
00:43:28,000 --> 00:43:32,160
value that we provide

1275
00:43:30,079 --> 00:43:34,319
remember when i said earlier that we're

1276
00:43:32,160 --> 00:43:36,079
a observability provider right what that

1277
00:43:34,319 --> 00:43:38,400
means fundamentally is we are a big data

1278
00:43:36,079 --> 00:43:40,960
platform right we ingest telemetry in

1279
00:43:38,400 --> 00:43:42,880
for instance open telemetry format

1280
00:43:40,960 --> 00:43:45,040
and we have to break apart that data and

1281
00:43:42,880 --> 00:43:46,640
provide kind of almost an index on it so

1282
00:43:45,040 --> 00:43:49,040
that it can be quickly retrieved and

1283
00:43:46,640 --> 00:43:51,359
queried later

1284
00:43:49,040 --> 00:43:53,680
so therefore our slos are shaped like

1285
00:43:51,359 --> 00:43:56,400
those user journeys that if you view the

1286
00:43:53,680 --> 00:43:58,240
home page 99.9 of the time it should

1287
00:43:56,400 --> 00:43:59,920
load quickly enough within 250

1288
00:43:58,240 --> 00:44:02,000
milliseconds

1289
00:43:59,920 --> 00:44:04,319
but that it's okay for some queries to

1290
00:44:02,000 --> 00:44:05,599
fail even up to one percent of arbitrary

1291
00:44:04,319 --> 00:44:07,440
queries that those could take longer

1292
00:44:05,599 --> 00:44:08,480
than 10 seconds right like because the

1293
00:44:07,440 --> 00:44:10,079
idea is

1294
00:44:08,480 --> 00:44:12,000
you might be able to hit retry right

1295
00:44:10,079 --> 00:44:13,520
there that's okay

1296
00:44:12,000 --> 00:44:15,440
but the one thing that we do not mess

1297
00:44:13,520 --> 00:44:17,440
around with is user data coming in

1298
00:44:15,440 --> 00:44:19,359
because we know we in general have like

1299
00:44:17,440 --> 00:44:20,800
one maybe two chances to get it right if

1300
00:44:19,359 --> 00:44:23,040
you happen to have a buffer and retry

1301
00:44:20,800 --> 00:44:24,720
right that if we drop customer data on

1302
00:44:23,040 --> 00:44:27,119
the floor that's going to result in a

1303
00:44:24,720 --> 00:44:30,800
graph in a divot in the graph of every

1304
00:44:27,119 --> 00:44:30,800
honeycomb customer in perpetuity

1305
00:44:31,040 --> 00:44:35,440
so that means that we have to think

1306
00:44:32,720 --> 00:44:37,440
about how do we adhere to such a high

1307
00:44:35,440 --> 00:44:40,319
slow right

1308
00:44:37,440 --> 00:44:42,319
right that's 99.99 that's 4.3 minutes of

1309
00:44:40,319 --> 00:44:43,920
violation a month that can be really

1310
00:44:42,319 --> 00:44:46,720
hairy

1311
00:44:43,920 --> 00:44:48,640
so how do we stay within slo

1312
00:44:46,720 --> 00:44:50,400
the answer is the state accelerate state

1313
00:44:48,640 --> 00:44:52,480
of devops metrics right which tell us

1314
00:44:50,400 --> 00:44:55,520
that if you deploy on demand multiple

1315
00:44:52,480 --> 00:44:56,319
times per day it's safer to do that

1316
00:44:55,520 --> 00:44:57,760
right

1317
00:44:56,319 --> 00:44:59,200
no amount of qa checking is going to

1318
00:44:57,760 --> 00:45:01,599
find these big giant issues so it's

1319
00:44:59,200 --> 00:45:04,720
better almost like riding a bicycle the

1320
00:45:01,599 --> 00:45:06,319
faster you ride to a degree the more

1321
00:45:04,720 --> 00:45:09,040
stable you are because you kind of have

1322
00:45:06,319 --> 00:45:10,720
that spinning gyroscope right

1323
00:45:09,040 --> 00:45:12,560
that the quicker the feedback loop is

1324
00:45:10,720 --> 00:45:14,400
between you pushing it to change into

1325
00:45:12,560 --> 00:45:15,760
main and it's landing in production the

1326
00:45:14,400 --> 00:45:17,680
more likely it is that you'll remember

1327
00:45:15,760 --> 00:45:19,520
what was going on and have context to

1328
00:45:17,680 --> 00:45:21,119
figure out what's happening

1329
00:45:19,520 --> 00:45:23,280
and that if there is a problem right

1330
00:45:21,119 --> 00:45:25,520
like you'll notice that the dora metrics

1331
00:45:23,280 --> 00:45:28,720
say that 15 of companies

1332
00:45:25,520 --> 00:45:30,800
or 15 of the changes pushed by elite

1333
00:45:28,720 --> 00:45:32,640
companies can fail that's okay the thing

1334
00:45:30,800 --> 00:45:34,000
that distinguishes them is that failures

1335
00:45:32,640 --> 00:45:35,920
are not expensive that they can roll

1336
00:45:34,000 --> 00:45:38,800
back in minutes or at most an hour

1337
00:45:35,920 --> 00:45:40,480
rather than than days

1338
00:45:38,800 --> 00:45:43,040
so for us

1339
00:45:40,480 --> 00:45:45,040
it starts with lead time we think about

1340
00:45:43,040 --> 00:45:46,400
how do you actually

1341
00:45:45,040 --> 00:45:47,520
take less than three hours from the time

1342
00:45:46,400 --> 00:45:49,200
that you're sitting hands on keyboard

1343
00:45:47,520 --> 00:45:51,280
writing the tests making sure they pass

1344
00:45:49,200 --> 00:45:53,119
to having it running in production

1345
00:45:51,280 --> 00:45:55,760
and the answer is we keep the build time

1346
00:45:53,119 --> 00:45:57,359
fast less than 10 minutes per build and

1347
00:45:55,760 --> 00:46:00,800
every time it gets higher than that we

1348
00:45:57,359 --> 00:46:03,119
do some debugging to figure out why

1349
00:46:00,800 --> 00:46:04,800
we automatically push once an hour and

1350
00:46:03,119 --> 00:46:06,480
the reason for that is that we are

1351
00:46:04,800 --> 00:46:08,160
trying to keep the number of commits per

1352
00:46:06,480 --> 00:46:09,839
build artifact low

1353
00:46:08,160 --> 00:46:11,760
and again to keep that lead time very

1354
00:46:09,839 --> 00:46:12,640
low

1355
00:46:11,760 --> 00:46:14,240
and

1356
00:46:12,640 --> 00:46:15,520
while it is true that you know maybe

1357
00:46:14,240 --> 00:46:16,720
five percent of the time our changes

1358
00:46:15,520 --> 00:46:18,480
don't work exactly the way that we

1359
00:46:16,720 --> 00:46:20,079
anticipated

1360
00:46:18,480 --> 00:46:21,920
we only have about one in a thousand

1361
00:46:20,079 --> 00:46:24,160
changes fail in a way that we cannot

1362
00:46:21,920 --> 00:46:25,920
quickly remediate right that require

1363
00:46:24,160 --> 00:46:26,960
actually doing a fixed forward or a

1364
00:46:25,920 --> 00:46:30,880
rollback

1365
00:46:26,960 --> 00:46:30,880
rather than just doing a flag flip

1366
00:46:30,960 --> 00:46:34,240
and that means that we've optimized

1367
00:46:32,480 --> 00:46:38,000
around that path of making things take

1368
00:46:34,240 --> 00:46:39,760
as little time as possible to repair

1369
00:46:38,000 --> 00:46:42,000
so what this means is that in practice

1370
00:46:39,760 --> 00:46:44,319
we add instrumentation right we add open

1371
00:46:42,000 --> 00:46:45,920
telemetry spans as we write our code so

1372
00:46:44,319 --> 00:46:47,599
we can better understand that behavior

1373
00:46:45,920 --> 00:46:49,440
both as we're coding

1374
00:46:47,599 --> 00:46:51,599
writing our tests and also as it reaches

1375
00:46:49,440 --> 00:46:53,839
production

1376
00:46:51,599 --> 00:46:57,680
we don't do kind of heavyweight

1377
00:46:53,839 --> 00:46:59,680
you know selenium tests we do dom tests

1378
00:46:57,680 --> 00:47:02,000
to kind of serialize and diff the dom to

1379
00:46:59,680 --> 00:47:03,760
make sure that it's good enough

1380
00:47:02,000 --> 00:47:05,280
and we make sure every major change has

1381
00:47:03,760 --> 00:47:08,640
the ability to turn on and off so that

1382
00:47:05,280 --> 00:47:10,240
we decouple a release from a deploy

1383
00:47:08,640 --> 00:47:12,480
and we really prioritize making those

1384
00:47:10,240 --> 00:47:14,319
unit tests fast as i said earlier a

1385
00:47:12,480 --> 00:47:15,920
build time 10 minute or bust right and

1386
00:47:14,319 --> 00:47:18,160
if it takes longer than that we're doing

1387
00:47:15,920 --> 00:47:21,040
some digging to figure out why

1388
00:47:18,160 --> 00:47:23,440
and also human beings prioritize doing

1389
00:47:21,040 --> 00:47:25,040
their reviews as quickly as possible

1390
00:47:23,440 --> 00:47:26,160
said people don't drop standing floor

1391
00:47:25,040 --> 00:47:27,920
and people are not blocking on each

1392
00:47:26,160 --> 00:47:29,280
other that doesn't mean you interrupt

1393
00:47:27,920 --> 00:47:30,640
each other all the time but like i'll

1394
00:47:29,280 --> 00:47:32,079
check once an hour to see whether there

1395
00:47:30,640 --> 00:47:33,839
are code reviews waiting for me because

1396
00:47:32,079 --> 00:47:36,319
i know that that's blocking someone from

1397
00:47:33,839 --> 00:47:38,160
getting their work into production

1398
00:47:36,319 --> 00:47:39,760
but what we don't do is we don't kind of

1399
00:47:38,160 --> 00:47:41,920
hold things in the holding pen forever

1400
00:47:39,760 --> 00:47:42,880
right as soon as the tests are green we

1401
00:47:41,920 --> 00:47:45,119
merge

1402
00:47:42,880 --> 00:47:46,960
and not only do we merge we push the

1403
00:47:45,119 --> 00:47:48,800
latest green build once an hour we

1404
00:47:46,960 --> 00:47:51,040
automatically push through each of the

1405
00:47:48,800 --> 00:47:53,119
environments in sequence

1406
00:47:51,040 --> 00:47:55,599
but we have the ability to stop to kind

1407
00:47:53,119 --> 00:47:57,280
of pull and on chord to stop releases if

1408
00:47:55,599 --> 00:47:58,640
we think that there's a problem so

1409
00:47:57,280 --> 00:48:01,520
that's kind of how we keep the assembly

1410
00:47:58,640 --> 00:48:03,280
line of production changes rolling out

1411
00:48:01,520 --> 00:48:04,880
but the most important thing that we do

1412
00:48:03,280 --> 00:48:06,559
is that we observe real customer

1413
00:48:04,880 --> 00:48:08,800
behavior and production

1414
00:48:06,559 --> 00:48:11,440
that we have a set of environments not

1415
00:48:08,800 --> 00:48:13,599
just for customers to observe their data

1416
00:48:11,440 --> 00:48:15,520
but for us to observe what's happening

1417
00:48:13,599 --> 00:48:17,520
inside production how are customers

1418
00:48:15,520 --> 00:48:20,240
experiencing honeycomb

1419
00:48:17,520 --> 00:48:21,760
and are there any problems right like

1420
00:48:20,240 --> 00:48:23,440
for instance how are people actually

1421
00:48:21,760 --> 00:48:25,760
using the feature that we built how fast

1422
00:48:23,440 --> 00:48:27,359
is it how performing is it

1423
00:48:25,760 --> 00:48:28,880
and we also have to be able to observe

1424
00:48:27,359 --> 00:48:30,000
dog food so of course we have a third

1425
00:48:28,880 --> 00:48:32,160
environment for that because it's

1426
00:48:30,000 --> 00:48:33,760
turtles all the way down

1427
00:48:32,160 --> 00:48:36,960
and that's how we have 40 engineers

1428
00:48:33,760 --> 00:48:38,240
deploying 18 times per day

1429
00:48:36,960 --> 00:48:39,760
but it's not just kind of product

1430
00:48:38,240 --> 00:48:41,440
deployment we also have applied that

1431
00:48:39,760 --> 00:48:43,200
same mentality to our to our

1432
00:48:41,440 --> 00:48:45,359
infrastructure

1433
00:48:43,200 --> 00:48:47,040
so for instance we think about using

1434
00:48:45,359 --> 00:48:48,880
terraform and chef and kind of all these

1435
00:48:47,040 --> 00:48:49,920
lovely tools to manage our linux vms

1436
00:48:48,880 --> 00:48:52,000
every day

1437
00:48:49,920 --> 00:48:54,319
and we apply these approaches of

1438
00:48:52,000 --> 00:48:56,319
continuous integration right

1439
00:48:54,319 --> 00:48:58,240
so for instance we use terraform cloud

1440
00:48:56,319 --> 00:49:00,960
to automatically push the latest screen

1441
00:48:58,240 --> 00:49:02,640
build so we can't drift out of sync

1442
00:49:00,960 --> 00:49:05,359
we use feature flags to handle things

1443
00:49:02,640 --> 00:49:07,440
like you know hey if i'm behind i can

1444
00:49:05,359 --> 00:49:10,800
toggle a feature flag and terraform to

1445
00:49:07,440 --> 00:49:12,960
stand up a catch-up fleet automatically

1446
00:49:10,800 --> 00:49:14,400
and i can quarantine bad traffic that's

1447
00:49:12,960 --> 00:49:16,240
causing crashes or that i want to

1448
00:49:14,400 --> 00:49:17,680
profile just with a feature flag in our

1449
00:49:16,240 --> 00:49:20,240
terraform code

1450
00:49:17,680 --> 00:49:22,800
and that really helps make life a lot

1451
00:49:20,240 --> 00:49:24,880
easier for us as as people who are

1452
00:49:22,800 --> 00:49:26,800
responsible for the platform who are

1453
00:49:24,880 --> 00:49:28,880
responsible for making it possible for

1454
00:49:26,800 --> 00:49:31,839
the product to move fast and scale on

1455
00:49:28,880 --> 00:49:31,839
top of our infrastructure

1456
00:49:32,240 --> 00:49:35,520
but we don't just design an abstract we

1457
00:49:33,839 --> 00:49:37,839
actually validate we actually test are

1458
00:49:35,520 --> 00:49:41,680
these things working correctly

1459
00:49:37,839 --> 00:49:43,200
so we experiment using our error budgets

1460
00:49:41,680 --> 00:49:44,640
now when we talk about chaos engineering

1461
00:49:43,200 --> 00:49:45,680
we're not just talking about enrolled

1462
00:49:44,640 --> 00:49:47,920
chaos

1463
00:49:45,680 --> 00:49:49,599
we have a goal of the experiment in mind

1464
00:49:47,920 --> 00:49:51,359
what are we validating

1465
00:49:49,599 --> 00:49:53,440
and is there a stop button is there

1466
00:49:51,359 --> 00:49:55,760
ability to pause or reset an experiment

1467
00:49:53,440 --> 00:49:57,359
if it's causing a problem

1468
00:49:55,760 --> 00:49:59,920
so we'll use feature flags for instance

1469
00:49:57,359 --> 00:50:02,000
to control this sort of thing

1470
00:49:59,920 --> 00:50:04,000
in the event of our persistent services

1471
00:50:02,000 --> 00:50:05,359
we have to do a lot of work in order to

1472
00:50:04,000 --> 00:50:08,240
make sure that the persistence

1473
00:50:05,359 --> 00:50:10,400
mechanisms work as we anticipate

1474
00:50:08,240 --> 00:50:12,640
because about half of our microservices

1475
00:50:10,400 --> 00:50:14,079
are stateless but half are stateful and

1476
00:50:12,640 --> 00:50:15,440
we have to treat them as separate

1477
00:50:14,079 --> 00:50:17,119
classes

1478
00:50:15,440 --> 00:50:19,359
now there's a lot been said in the past

1479
00:50:17,119 --> 00:50:21,280
about kind of uh you know chaos monkey

1480
00:50:19,359 --> 00:50:22,640
of automatically restarting stateless

1481
00:50:21,280 --> 00:50:24,880
servers i'm not going to hone in on that

1482
00:50:22,640 --> 00:50:27,040
too much what i want to hone in on is

1483
00:50:24,880 --> 00:50:28,640
our stateful workload

1484
00:50:27,040 --> 00:50:30,000
where we have to be able to tolerate

1485
00:50:28,640 --> 00:50:30,720
things like

1486
00:50:30,000 --> 00:50:32,559
a

1487
00:50:30,720 --> 00:50:34,720
individual kafka broker going away or

1488
00:50:32,559 --> 00:50:36,880
one of our indexing workers going away

1489
00:50:34,720 --> 00:50:39,040
we have to be able to validate that we

1490
00:50:36,880 --> 00:50:40,559
are able to do deploy safely that amazon

1491
00:50:39,040 --> 00:50:41,760
is able to terminate our instances

1492
00:50:40,559 --> 00:50:43,280
without causing failures to our

1493
00:50:41,760 --> 00:50:44,800
customers

1494
00:50:43,280 --> 00:50:47,280
so when we have these kind of

1495
00:50:44,800 --> 00:50:49,680
long-running storage instances that need

1496
00:50:47,280 --> 00:50:51,920
data integrity and consistency how do we

1497
00:50:49,680 --> 00:50:53,599
manage that

1498
00:50:51,920 --> 00:50:56,160
well the answer is we test these

1499
00:50:53,599 --> 00:50:58,240
failover dances we test the system to

1500
00:50:56,160 --> 00:50:59,760
make sure that in practice it works as

1501
00:50:58,240 --> 00:51:03,040
designed

1502
00:50:59,760 --> 00:51:05,599
for instance if a kafka broker is lost

1503
00:51:03,040 --> 00:51:07,920
are we able to successfully

1504
00:51:05,599 --> 00:51:08,800
start reading or writing to the new

1505
00:51:07,920 --> 00:51:10,880
leader

1506
00:51:08,800 --> 00:51:12,480
and in the background replicate in a new

1507
00:51:10,880 --> 00:51:14,319
kafka broker

1508
00:51:12,480 --> 00:51:16,960
or if we take out an indexing worker is

1509
00:51:14,319 --> 00:51:19,680
it able to replace successfully

1510
00:51:16,960 --> 00:51:22,880
based off of a snapshot stored to s3

1511
00:51:19,680 --> 00:51:25,680
plus replaying off of kafka

1512
00:51:22,880 --> 00:51:28,400
so we do experiments in production

1513
00:51:25,680 --> 00:51:29,599
so we restart one server one service at

1514
00:51:28,400 --> 00:51:31,040
a time right controlled experiments

1515
00:51:29,599 --> 00:51:33,200
we're not restarting all five things at

1516
00:51:31,040 --> 00:51:34,960
once we're just trying to test one thing

1517
00:51:33,200 --> 00:51:38,319
at a time to make sure

1518
00:51:34,960 --> 00:51:40,000
at 3 pm and not 3 a.m for two reasons

1519
00:51:38,319 --> 00:51:42,240
number one all hands are on deck right

1520
00:51:40,000 --> 00:51:44,000
bugs are more shallow with more eyes

1521
00:51:42,240 --> 00:51:46,240
but also number two

1522
00:51:44,000 --> 00:51:48,240
you know 2 p.m at least in the us is our

1523
00:51:46,240 --> 00:51:50,480
peak traffic time right if a failure

1524
00:51:48,240 --> 00:51:52,559
happens during peak traffic that's kind

1525
00:51:50,480 --> 00:51:53,760
of the decimal situation as far as you

1526
00:51:52,559 --> 00:51:55,599
know the amount of load on the system

1527
00:51:53,760 --> 00:51:57,040
and we want to verify we can always

1528
00:51:55,599 --> 00:51:59,280
catch up and make progress even under

1529
00:51:57,040 --> 00:52:01,599
the most load

1530
00:51:59,280 --> 00:52:03,520
and we are monitoring to make sure that

1531
00:52:01,599 --> 00:52:06,000
we are not damaging user experience

1532
00:52:03,520 --> 00:52:08,079
using our slos and slis

1533
00:52:06,000 --> 00:52:10,079
and we are using our observability

1534
00:52:08,079 --> 00:52:11,760
infrastructure to debug to understand if

1535
00:52:10,079 --> 00:52:13,280
our experiment didn't go the way that we

1536
00:52:11,760 --> 00:52:15,200
anticipated

1537
00:52:13,280 --> 00:52:16,720
why is it happening and how do we repair

1538
00:52:15,200 --> 00:52:18,640
it

1539
00:52:16,720 --> 00:52:20,559
and we also need to make sure that we

1540
00:52:18,640 --> 00:52:21,920
are not you know dropping telemetry

1541
00:52:20,559 --> 00:52:24,000
right like that if we perform one of

1542
00:52:21,920 --> 00:52:25,680
these experiments this actually happened

1543
00:52:24,000 --> 00:52:27,440
uh last week

1544
00:52:25,680 --> 00:52:28,800
a kafka broker failed to come back

1545
00:52:27,440 --> 00:52:30,640
correctly after being killed and

1546
00:52:28,800 --> 00:52:32,480
restarted

1547
00:52:30,640 --> 00:52:34,319
and we didn't have the telemetry to tell

1548
00:52:32,480 --> 00:52:36,480
us that the kafka broker was missing it

1549
00:52:34,319 --> 00:52:37,839
didn't report itself as missing and

1550
00:52:36,480 --> 00:52:39,200
therefore we didn't actually have

1551
00:52:37,839 --> 00:52:41,680
telemetry to tell us that there was a

1552
00:52:39,200 --> 00:52:43,440
problem so that was a useful scenario

1553
00:52:41,680 --> 00:52:44,839
for us to know that that that was a

1554
00:52:43,440 --> 00:52:47,520
problem that we needed to

1555
00:52:44,839 --> 00:52:48,720
fix and then you verify your fixes right

1556
00:52:47,520 --> 00:52:51,119
like if something doesn't go according

1557
00:52:48,720 --> 00:52:53,920
to plan you gotta keep doing the painful

1558
00:52:51,119 --> 00:52:55,920
thing until it stops being painful

1559
00:52:53,920 --> 00:52:58,160
so for instance taking out indexing

1560
00:52:55,920 --> 00:53:01,520
workers taking out kafka brokers

1561
00:52:58,160 --> 00:53:02,400
guess what we do that once per week

1562
00:53:01,520 --> 00:53:04,079
right

1563
00:53:02,400 --> 00:53:06,079
we try to not keep things running

1564
00:53:04,079 --> 00:53:08,079
forever so that we can validate that if

1565
00:53:06,079 --> 00:53:09,440
things do go wrong it's not been more

1566
00:53:08,079 --> 00:53:11,680
than a week since our most recent

1567
00:53:09,440 --> 00:53:13,280
failure test

1568
00:53:11,680 --> 00:53:14,800
same thing with zookeeper right like we

1569
00:53:13,280 --> 00:53:18,000
discovered oops

1570
00:53:14,800 --> 00:53:20,240
our zookeeper um coordinators

1571
00:53:18,000 --> 00:53:21,839
were we were only pulling the first

1572
00:53:20,240 --> 00:53:24,079
zookeeper worker

1573
00:53:21,839 --> 00:53:25,359
and that if you took out the zookeeper

1574
00:53:24,079 --> 00:53:28,319
first worker

1575
00:53:25,359 --> 00:53:30,079
none of our alerts would run oops so we

1576
00:53:28,319 --> 00:53:31,760
fixed that right

1577
00:53:30,079 --> 00:53:33,680
when you de-risk things with the design

1578
00:53:31,760 --> 00:53:35,599
and automation it makes it a lot easier

1579
00:53:33,680 --> 00:53:36,880
to run a sustainable system that is not

1580
00:53:35,599 --> 00:53:38,800
going to page you in the middle of the

1581
00:53:36,880 --> 00:53:41,040
night

1582
00:53:38,800 --> 00:53:42,880
so we just have to continuously verify

1583
00:53:41,040 --> 00:53:45,119
and keep on making sure that our system

1584
00:53:42,880 --> 00:53:47,200
is working the way that we intend

1585
00:53:45,119 --> 00:53:49,440
and this doesn't have benefits just for

1586
00:53:47,200 --> 00:53:50,960
reliability it also has benefits for

1587
00:53:49,440 --> 00:53:52,400
cost

1588
00:53:50,960 --> 00:53:54,800
why because if you're allowed to

1589
00:53:52,400 --> 00:53:57,839
tolerate failures more often

1590
00:53:54,800 --> 00:54:01,200
you can do things like adopt spot

1591
00:53:57,839 --> 00:54:02,960
instances or preemptable uh instances

1592
00:54:01,200 --> 00:54:05,520
because you know that you can tolerate

1593
00:54:02,960 --> 00:54:07,200
losing those stateless workers

1594
00:54:05,520 --> 00:54:09,200
or for instance

1595
00:54:07,200 --> 00:54:12,000
if you know that there's a well-managed

1596
00:54:09,200 --> 00:54:13,760
process for replacing worker nodes

1597
00:54:12,000 --> 00:54:15,680
that means you can gradually and

1598
00:54:13,760 --> 00:54:18,000
incrementally roll your fleet without

1599
00:54:15,680 --> 00:54:20,960
fear of damaging user experience

1600
00:54:18,000 --> 00:54:23,359
including rolling out arm 64

1601
00:54:20,960 --> 00:54:24,960
architecture instances

1602
00:54:23,359 --> 00:54:27,119
this is almost an entire talk in itself

1603
00:54:24,960 --> 00:54:29,040
but i firmly believe that the future of

1604
00:54:27,119 --> 00:54:30,640
solving climate change within the tech

1605
00:54:29,040 --> 00:54:33,440
industry at least

1606
00:54:30,640 --> 00:54:35,680
is by consuming less power per compute

1607
00:54:33,440 --> 00:54:37,839
and when you adopt arm instances you're

1608
00:54:35,680 --> 00:54:39,520
playing a direct role in fighting the

1609
00:54:37,839 --> 00:54:41,119
amount of carbon emissions that your

1610
00:54:39,520 --> 00:54:44,000
services produce

1611
00:54:41,119 --> 00:54:47,280
as well as of course saving about 40 40

1612
00:54:44,000 --> 00:54:47,280
to 60 on your bill

1613
00:54:47,760 --> 00:54:51,119
but not every experiment is going to

1614
00:54:49,359 --> 00:54:53,040
succeed and that's okay right you can

1615
00:54:51,119 --> 00:54:55,359
mitigate those risks and it's just

1616
00:54:53,040 --> 00:54:57,200
important to think about how do i bound

1617
00:54:55,359 --> 00:54:59,680
the risk to make it acceptable to do

1618
00:54:57,200 --> 00:55:01,520
these experiments how do i make it safe

1619
00:54:59,680 --> 00:55:03,280
to do these things and design for

1620
00:55:01,520 --> 00:55:06,720
reliability through my life cycle how do

1621
00:55:03,280 --> 00:55:07,440
i make my system excellent in production

1622
00:55:06,720 --> 00:55:10,079
so

1623
00:55:07,440 --> 00:55:12,640
feature flags can help um kind of doing

1624
00:55:10,079 --> 00:55:14,160
proactive risk experiments can help

1625
00:55:12,640 --> 00:55:15,359
but above all i think kind of you have

1626
00:55:14,160 --> 00:55:17,599
to ground yourself in the four things

1627
00:55:15,359 --> 00:55:20,160
they said originally right slos

1628
00:55:17,599 --> 00:55:22,240
observability collaboration and risk

1629
00:55:20,160 --> 00:55:25,520
management if you do those things you're

1630
00:55:22,240 --> 00:55:25,520
going to be in a much better place

1631
00:55:26,160 --> 00:55:30,880
and always be prepared always expect the

1632
00:55:28,799 --> 00:55:32,640
unexpected because we're always working

1633
00:55:30,880 --> 00:55:34,960
in the turbulent situations of

1634
00:55:32,640 --> 00:55:37,599
production

1635
00:55:34,960 --> 00:55:40,240
make experimentation routine and you can

1636
00:55:37,599 --> 00:55:42,559
not so jokingly talk about the idea of

1637
00:55:40,240 --> 00:55:44,559
oh i'm just going to set a chaos money

1638
00:55:42,559 --> 00:55:46,720
monkey running loose on our fleet you

1639
00:55:44,559 --> 00:55:48,160
don't get there no from nothing you have

1640
00:55:46,720 --> 00:55:51,520
to do the groundwork you have to do the

1641
00:55:48,160 --> 00:55:51,520
preparation in order to get there

1642
00:55:51,760 --> 00:55:54,880
so

1643
00:55:52,480 --> 00:55:56,640
we're all part of sociotechnical systems

1644
00:55:54,880 --> 00:55:58,799
as customers as engineers and as

1645
00:55:56,640 --> 00:56:00,960
stakeholders

1646
00:55:58,799 --> 00:56:03,040
learn from your outages invest

1647
00:56:00,960 --> 00:56:04,720
appropriately in observability

1648
00:56:03,040 --> 00:56:06,559
and have those conversations about how

1649
00:56:04,720 --> 00:56:08,160
we can improve our tools and improve

1650
00:56:06,559 --> 00:56:11,520
ourselves and improve our processes in

1651
00:56:08,160 --> 00:56:13,680
order to react better in the future

1652
00:56:11,520 --> 00:56:15,440
and yes you know tools can help buy the

1653
00:56:13,680 --> 00:56:17,359
right tools buy just the right amount of

1654
00:56:15,440 --> 00:56:19,200
tools right it's almost like uh that

1655
00:56:17,359 --> 00:56:21,280
author who said you know

1656
00:56:19,200 --> 00:56:23,200
eat food not too much mostly grains

1657
00:56:21,280 --> 00:56:25,200
right i think that's that's true that's

1658
00:56:23,200 --> 00:56:27,280
true about about our socio-technical

1659
00:56:25,200 --> 00:56:28,720
systems too right like you know run your

1660
00:56:27,280 --> 00:56:31,280
systems

1661
00:56:28,720 --> 00:56:33,440
by buy some tools buy the right tools

1662
00:56:31,280 --> 00:56:36,400
but above all else focus on your culture

1663
00:56:33,440 --> 00:56:38,960
first and the rest will follow

1664
00:56:36,400 --> 00:56:41,440
so i implore you think about how do you

1665
00:56:38,960 --> 00:56:42,559
measure your your reliability levels how

1666
00:56:41,440 --> 00:56:43,760
do you actually debug how do you

1667
00:56:42,559 --> 00:56:45,440
actually understand what's happening in

1668
00:56:43,760 --> 00:56:46,960
production

1669
00:56:45,440 --> 00:56:48,960
do you have the ability to collaborate

1670
00:56:46,960 --> 00:56:50,880
across teams

1671
00:56:48,960 --> 00:56:52,480
and are you actually investing time in

1672
00:56:50,880 --> 00:56:53,920
closing that feedback loop and not just

1673
00:56:52,480 --> 00:56:56,559
repeating the same outages over and over

1674
00:56:53,920 --> 00:56:58,240
like groundhog day

1675
00:56:56,559 --> 00:56:59,599
so that's all that i have for you today

1676
00:56:58,240 --> 00:57:01,599
and i understand that we're a little bit

1677
00:56:59,599 --> 00:57:03,359
behind schedule because of the av issues

1678
00:57:01,599 --> 00:57:05,040
so i'm not going to delay you too long

1679
00:57:03,359 --> 00:57:08,319
um but if you like and copy these slides

1680
00:57:05,040 --> 00:57:09,920
go check out honeycomb.io liz and also

1681
00:57:08,319 --> 00:57:11,280
um i happen to be around sydney at least

1682
00:57:09,920 --> 00:57:12,960
for the next couple of months so if

1683
00:57:11,280 --> 00:57:15,599
you'd like to go for a walk in hyde park

1684
00:57:12,960 --> 00:57:17,760
um just uh i'll drop a link into the

1685
00:57:15,599 --> 00:57:19,200
conference chat and you can uh just book

1686
00:57:17,760 --> 00:57:21,280
an hour with me and i'd love to go walk

1687
00:57:19,200 --> 00:57:23,839
around and meet some of you

1688
00:57:21,280 --> 00:57:25,839
thanks cheers everyone

1689
00:57:23,839 --> 00:57:27,839
thanks liz that was that was really good

1690
00:57:25,839 --> 00:57:29,440
again another great speak that uh speech

1691
00:57:27,839 --> 00:57:31,040
that uh was looking at the chat and then

1692
00:57:29,440 --> 00:57:32,640
we're saying oh this is really hitting

1693
00:57:31,040 --> 00:57:35,520
with a lot of people

1694
00:57:32,640 --> 00:57:36,960
um really good stuff um i especially

1695
00:57:35,520 --> 00:57:39,040
like all those little illustrations

1696
00:57:36,960 --> 00:57:41,119
you've got they're fantastic

1697
00:57:39,040 --> 00:57:43,040
yes there by the wonderful emily griffin

1698
00:57:41,119 --> 00:57:44,400
uh at emily with curls highly encouraged

1699
00:57:43,040 --> 00:57:45,839
checking her out

1700
00:57:44,400 --> 00:57:48,319
excellent great

1701
00:57:45,839 --> 00:57:50,160
anyway so we're about to finish off this

1702
00:57:48,319 --> 00:57:51,760
session um we're just coming up to the

1703
00:57:50,160 --> 00:57:53,280
first break of the day sorry about all

1704
00:57:51,760 --> 00:57:54,960
the delays there we've still got a fair

1705
00:57:53,280 --> 00:57:57,040
amount of time

1706
00:57:54,960 --> 00:57:58,559
remember that the main sessions start at

1707
00:57:57,040 --> 00:58:00,000
10 45

1708
00:57:58,559 --> 00:58:02,799
and then the lunch break will be around

1709
00:58:00,000 --> 00:58:05,280
12 25 depending on when stuff finishes

1710
00:58:02,799 --> 00:58:06,880
the afternoon session starts at 1 30 and

1711
00:58:05,280 --> 00:58:09,280
don't forget the conference close at 5

1712
00:58:06,880 --> 00:58:11,440
35 where you'll get to you know cry and

1713
00:58:09,280 --> 00:58:13,200
cheer and all that stuff so please go

1714
00:58:11,440 --> 00:58:16,599
and have fun and we'll see you a little

1715
00:58:13,200 --> 00:58:16,599
bit later on