Guest User

Untitled

a guest
Jun 13th, 2023
36
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 33.04 KB | None | 0 0
  1. Hey, everyone, and welcome to the June 13th stand up.
  2. Today for planning.
  3. I have mostly questions, I guess, catching up with with the chats and the issues in regards
  4. to 1.9.
  5. Thanks guys for cutting an RC yesterday and deploying it.
  6. looking to see or hear of any updates and questions, I guess,
  7. that we might have on what's happening with v1.9.0-Rc2.
  8. If anybody wants to give just a quick overview
  9. about where we're at, that would be great.
  10. Just have a look.
  11. Seems like for today, we dropped some mesh peers.
  12. But on beta previously, I deployed a branch
  13. to test the batch delete stuff.
  14. And when I look at the seven days chart,
  15. seems like it's very stable except for today.
  16. So maybe it's just nothing.
  17. We need to monitor more.
  18. Right.
  19. OK, then in the chat as well, I think there was an idea put up.
  20. I came in to deploy it on some mainnet validators,
  21. suggesting the CIP canary validators
  22. to hopefully get some better metrics on that.
  23. Is there any sort of contention to that idea?
  24. - Yeah, my thought was that we could just deploy it
  25. And to some, some nodes that have validators attached,
  26. 'cause it seems like, you know,
  27. the scoring is gonna be a little different.
  28. We're gonna be giving the network actual useful data
  29. rather than just forwarding stuff, so.
  30. Sorry, I lied, I said deploy to the whole CIP fleet.
  31. - Yeah.
  32. I mean, the worst that could happen is a bit less rewards
  33. and we don't care.
  34. So we just as well use them.
  35. >>Okay.
  36. Yeah, that makes sense.
  37. Okay.
  38. And then the--
  39. >>I feel pretty confident about this release.
  40. I feel like it's just the network
  41. we're seeing with this MeshPeer stuff.
  42. Like it hasn't come up before in any of our other testing.
  43. And...
  44. >>Yeah, for stable mainnet,
  45. it has been stable in the last seven days
  46. to accept for some last hours.
  47. So yeah, it's just like some incident in the network.
  48. - Yeah, I think it's pretty interesting
  49. 'cause like not all of the, like the beacon nodes
  50. that we have deployed necessarily perform the same.
  51. A lot of it seems to be also dependent on the luck
  52. of the mesh peering that you kind of get.
  53. I don't know if I'm mistaken in this,
  54. but there are definitely some beacon nodes
  55. that we had on mainnet with Lido
  56. that don't have this stable mesh peering.
  57. And it's very specific ones sometimes.
  58. So I'm not entirely sure what the solution is for that
  59. when we start seeing some poor mesh peering
  60. on specific nodes.
  61. I would say maybe we should make an issue about this just so we can crack it and I'm
  62. not exactly sure how we would go about debugging that but I guess I would tend to agree that
  63. there's some some luck like if you're connecting to nodes that are more connected to the network
  64. to the rest of the network.
  65. You might be getting messages a little bit faster.
  66. And then not having as many.
  67. Missed attestations because you're better synced, I don't know, I have no data to back this up.
  68. OK, yeah, sounds good.
  69. All right, well, we definitely have some next steps here in regards to seeing if we can push out
  70. v1.9 RC2.
  71. So that's probably the most important thing
  72. on my list right now.
  73. There are a couple of people that are waiting to try this out.
  74. Some larger node operators like RockLogic.
  75. And I've also reached out to
  76. to some relayers as another target group of users as well.
  77. we should all be in a chat with Aestus Relay on Telegram now. So they will be sort of like our
  78. test clients or customer in regards to diversifying that aspect of infrastructure on the network.
  79. Does anybody have any points for planning that they'd like to bring up specifically?
  80. Maybe we can talk about the network threads or the different strategies that we have for
  81. how to improve performance and what and prioritize them.
  82. Yeah, that sounds good.
  83. I would say, has someone confirmed the fact that
  84. those that mainly experiences issues
  85. when we have a lot of keys,
  86. say the nodes that have zero keys or little keys,
  87. are they doing fine?
  88. - With the network thread enabled?
  89. - No, in general.
  90. - In general, I've noticed that the CIP validators
  91. seem to have better effectiveness
  92. comparatively to the Lido nodes.
  93. When you say like little keys,
  94. like I don't know if there's like a specific number
  95. that you're considering,
  96. but our CIP validators have a division of,
  97. I think it's 16 canaries.
  98. So, and then, so one node has 16
  99. and then the other one has 56 keys in it.
  100. Whereas all of our Lido ones have like,
  101. I think 200 per beacon node at this point.
  102. - Go ahead.
  103. - Yeah, also when I looked at the performance
  104. of the Lido nodes, when I debug that,
  105. I also saw that there are way more late attestations
  106. and way more missed attestations
  107. compared to my private server.
  108. And also the performance and rate,
  109. it was, I think a difference over,
  110. It was about 3% difference.
  111. I think the timeframe was 30 days, so yeah.
  112. So it makes a huge difference, I think.
  113. - I see the same thing with my personal validators too.
  114. - Got it.
  115. I think there's also a piece that whenever we talk
  116. with operators, we can say confidently then
  117. that Lodestar is at a good level
  118. that we are comfortable recommending for, say,
  119. not huge stakers. And then for bigger stakers, we are working on it. And then we can put this
  120. relative to size beta tag or stable tag on the software, so we can cover both grounds.
  121. Cool. Then do we want the network, I guess the answer is yes, but do we want the network thread
  122. to be enabled by default eventually?
  123. >> Absolutely.
  124. I think, well, it seems like we're
  125. able to actually process all of the messages.
  126. Well, for larger stakers that are connected to more subnets,
  127. we're actually performing well for the network's sake.
  128. I remember we were talking about being a good network
  129. participant and how we aren't always a great network participant right now. I think enabling
  130. the network thread by default is kind of key to doing more work that we should already be doing.
  131. Okay, I agree. I can share the findings I got so far, which are not very conclusive.
  132. Tuyen has been able to take regular CPU profiles with the Chrome DevTools,
  133. I was building a hack in the API. I was able to take a performance record of the whole process
  134. and then label stacks by thread ID so we can look exclusively at the network.
  135. And at least from there, there is nothing obvious that stands out in either of the two.
  136. So it looks like the node is doing what it's supposed to do.
  137. It's just doing Gossip, Mplex, Crypto, Allocations, TCP handling.
  138. There are two minor issues that Tuyen is already handling.
  139. Like we do PRID conversion and deconversion.
  140. That's stupid, but we're not going to do that anymore.
  141. But that's about 4% of CPU time.
  142. So it just looks like for some reason the thread is overloaded.
  143. The thing that I don't understand, to be honest, is why the main thread was doing fine,
  144. and now the network thread is clogged.
  145. That's a question I haven't been able to answer yet.
  146. I don't know if someone has.
  147. There's a definite overhead to spinning up a second isolate.
  148. So it's definitely doing a lot of work just handling another node instance.
  149. One thing to be clarified, the main was not a thread, it was a process.
  150. And this network thread is the first time we introduced a thread.
  151. So there's definitely going to be context switching at the CPU level.
  152. Because if you have two separate sets of worker pools, it's basically switching between the two sets of worker pools in order to accommodate all the work at the processor level.
  153. It's possible.
  154. I mean, I don't really buy that because
  155. CPUs are very optimized for that purpose.
  156. And if you look at the profile, you don't see anything that would
  157. relate to that.
  158. One difference that
  159. Tuyen also was able to see in the
  160. in his profiles that he took, when
  161. we have everything in the main thread, we explicitly
  162. yield back to the macro thread, to the macro queue,
  163. rather than have a bunch of these promises being
  164. awaited in a--
  165. We don't have huge uninterrupted periods of micro queue tasks.
  166. We explicitly yield back to the macro queue
  167. for timers and other things to be able to be
  168. triggered at the right time.
  169. And if you don't do that, then your timers
  170. won't be able to fire because there's too much work
  171. to be done.
  172. There's a huge uninterrupted stream of micro queue tasks.
  173. And that's one of the things we saw in the profile.
  174. Yeah, I would be. So that makes sense, but I would be cautious when you look at the time
  175. render of the profile, you have to take into account that that's a sample. So if there
  176. is a sample at time X with a stack that's rooted on the micro queue, and then the next
  177. one also in the micro queue, but it happened that between that there was a micro queue
  178. yield. If there is not a sample at that specific time, it would look like it's a continuous
  179. micro queue you run when that's not true.
  180. Just just an FYI.
  181. I feel like we do have a metric that kind of
  182. is a counter point to that, though we see the event loop lag,
  183. which is the time between macro queue runs.
  184. And that is up to like, a second or more.
  185. when we're running the network thread.
  186. That should be, you know, in milliseconds.
  187. - We have to, 'cause I don't know why the hell
  188. Node.js doesn't expose that in a nice way.
  189. I'm not sure if they do.
  190. So, Prom Client captures the metric in two ways.
  191. And if you look at the metrics in the Grafana dashboard,
  192. you will see that there are two different dashboards,
  193. sorry, two different charts for Event Loopback,
  194. which are widely different.
  195. Also for the GC metrics,
  196. there are two different set of metrics
  197. because they use different techniques
  198. and are widely different, which is really annoying.
  199. So I don't know, we just don't know what's happening.
  200. But anyway, so--
  201. - Well, that would be making sense
  202. because the micro queues and the first garbage collection,
  203. I don't think runs on the full cycle.
  204. And also the polling also runs on the full event loop,
  205. not on the, like if you insert promises
  206. in the promise in the micro queue,
  207. it doesn't actually get to the next step.
  208. So it's not pulling the network.
  209. It's not pulling new stuff off sockets.
  210. It's not running the major GC,
  211. but the minor GC is basically just a memory pointer
  212. that runs back and forth that would be running in between.
  213. I just thought.
  214. - Got it.
  215. Okay, so if we want to enable, then we have different strategies.
  216. Something that I'm not sure if we can do, but would the network thread be able to self-regulate
  217. itself to not choke?
  218. That would be great, but I'm not sure if we can do that.
  219. Then we can, I mean, we are already doing the network thread, so we are scaling horizontally.
  220. We are moving loads across different CPU threads.
  221. We cannot do that anymore because it's already one thread.
  222. It wouldn't make sense to have multiple threads.
  223. Then we can try to do less things.
  224. Reintroduce somehow a mechanism to drop messages.
  225. So we reuse load if the thread, the text is overloaded.
  226. But that's -- I mean, it would be cool to do that.
  227. This is why I was bringing up how is Yamux doing?
  228. Because Yamux has back pressure built into it and may be able to help in these situations.
  229. Because it will not be sending the window updates.
  230. So you will start receiving things.
  231. But I think Yamux is blocked.
  232. Because the performance was bad, right?
  233. No.
  234. There was a memory leak last time we tested it out.
  235. Okay.
  236. Because if the network is not doing anything stupid, which it's not.
  237. that we fix these two little things,
  238. then what options we have?
  239. We just do less things of the things we are already doing
  240. that we have to do, or we optimize the pipeline further,
  241. which that would be another option,
  242. but that would take a while.
  243. Like, how is it going on libp2p land?
  244. Yeah, so I was going to mention this in my update,
  245. but I think I've got a branch open for upgrading to 0.45,
  246. which is the latest version of the libP2P.
  247. I think we definitely need to prioritize that after we
  248. get this release out, because we have a lot of--
  249. so we have some fixes in the TCP library.
  250. We have some improvements that are Open PR's Gossip sub.
  251. And all of those are blocked because we're
  252. running several versions behind in production,
  253. like several versions behind of Gossip Sub and of TCP.
  254. So we want to get back up and get all these latest goodies.
  255. That'll also let us retest Yamux.
  256. And hopefully-- there have been a few fixes.
  257. And that memory issue may have been resolved.
  258. The other interesting thing that is kind of related to this
  259. is that in a future version of libp2p, maybe 0.46 or maybe 0.47,
  260. we are thinking about replacing the underlying
  261. implementation of the streams with the--
  262. there's a standard for streams.
  263. They're called web streams or web WG streams.
  264. And they basically provide a readable and a writable stream.
  265. And there is a promise of maybe being--
  266. that it may being more performant.
  267. Because you can use a readable stream in such a way
  268. that the consumer of the stream passes in a buffer
  269. to the stream to have data written directly to it
  270. to avoid additional memory copies when
  271. you're dealing with binary data.
  272. So this is pretty nice for us.
  273. And this would also allow us to stop--
  274. we're doing a lot of stupid wrapping of these streams
  275. to make them abortable at different levels of the stack.
  276. And the reason we have to do that is because our underlying
  277. source, our underlying streams are not abortable by default.
  278. So that's another piece of the puzzle
  279. that this implementation is abortable at the lowest level.
  280. So we don't have to do additional wrapping.
  281. And then another third thing about it
  282. is that there's built-in backpressure
  283. with these streams.
  284. And so this is not even just like Yamux backpressure,
  285. where we're sending data.
  286. It's like backpressure is built into the--
  287. we might be able to basically tell the TCP socket that we're
  288. not able to be writing more data.
  289. So we might be able to-- or not be able to be reading more data.
  290. There may be some ability to handle backpressure
  291. that way too. That's all in a future as like, version 0.46 or
  292. something 0.47. But we need to get to 45 first. So 45 then we can
  293. start testing yaml. So hopefully 0.46 and 0.47 we get more goodies.
  294. So with with a single variables, we do have back pressure because
  295. you are requesting each individual next item. If you
  296. don't request more items, the source should not produce more
  297. items. So the classic stream backpressure is that you have small buffers everywhere.
  298. But with async at the levels, you just you have a buff, you have backpressure, it's just
  299. like a zero item buffer, right?
  300. Okay, so with this, you basically have a, I guess the queuing strategy or like the buffering
  301. strategy is more explicit. So you the this readable stream, you could tell it how how
  302. to buffer the data and what to do when you're full.
  303. So right now we have these it pushables in different places
  304. that are performing that task, where it's like we're buffering
  305. things, and then we have to set a max buffer size for these it
  306. pushables.
  307. So it's kind of built in.
  308. So I guess related to this, I'm confident with the current
  309. design we could get rid of all the abort sources because if I mean probably the issue is that
  310. libp2p does the wiring of the transports but if we can declare at the libp2p constructor level
  311. I am aware of my maxer and I am aware of my protocols and I know they can
  312. they have all abortable sources so please don't wrap it.
  313. Like we could do that today with today's implementation.
  314. Like, I guess moving away from missing it that it was into something else either is
  315. a big thing.
  316. And I'm not sure if it's the right one.
  317. The thing is, these streams implement async intervals.
  318. So it's not, we're not losing async intervals.
  319. It's just that there's an implementation is a concrete implementation, backing, backing
  320. everything.
  321. Can you send me an issue if you're designing this in the open?
  322. Yes.
  323. I commented on your -- you opened a draft PR and gossip sub to remove a portable source.
  324. And I commented there where some of the discussion is happening.
  325. Okay.
  326. Is there a performance test somewhere on libp2p to be that we can test like the full throughput
  327. of the stack?
  328. Not yet.
  329. They're working on it.
  330. Like we have, I would like to test this hypothesis that our stack is slow.
  331. Because maybe it's not true.
  332. Maybe doing fine.
  333. I don't know.
  334. Like we have had surprises before.
  335. I actually just thought of something,
  336. is that if we're in micro, getting out of the micro task queue,
  337. or we're basically stuck in the micro task queue,
  338. the sockets could be loading instead of to L1,
  339. because it wasn't like going through the loop each time.
  340. And it's basically backing up all the socket data
  341. to like L2 or L3 cache, and it just takes much longer
  342. in order for the data to get to the CPU to process.
  343. So it would look like the CPU is doing
  344. what it's supposed to be doing, it's just waiting for data.
  345. Yeah, that's a hypothesis I don't know how to test.
  346. I don't either, though.
  347. I'd love to see what performance looks like if we just--
  348. I think we have something like this on the main thread,
  349. where it checks to see the last time that there was an event
  350. loop, macro queue event loop.
  351. And if there's been a certain amount of time,
  352. then it will yield--
  353. it'll sleep zero.
  354. It'll yield back to the macro queue
  355. just to avoid any long periods of micro-queue tasks.
  356. And see what that looks like.
  357. Yeah, actually, there was a very old issue I opened,
  358. that I never got to do because I didn't have the expertise,
  359. but maybe Matthew, you can take on it,
  360. is to investigate the hypothesis that our OS socket buffers
  361. are being read and written slower than they should
  362. due to the event loop being clogged.
  363. Like, I think that with the knowledge you have now,
  364. you should be able to confirm that.
  365. - At least be able to dig and find out.
  366. That's a good question.
  367. Because the hypothesis is, if we write the attestation to the socket,
  368. and then we stay busy for a while,
  369. it will take maybe five loops to copy all the data.
  370. So the message will be actually sent off the wire much later than we thought.
  371. And that would definitely cause it to be processor-level,
  372. like pushed back to slower cache, for sure.
  373. Cool. So regarding planning, we don't have any, well, we have multiple things to optimize
  374. the network. I think all take a while, but we'll work on it. Just to confirm from Tuyen,
  375. like the performance decrease that we're seeing now is one, well, the main issue is that we
  376. processing blocks late so we both float on the wrong head. Is that correct?
  377. Yeah, I think with the single thread model we mostly receive up, process up block rate and
  378. and then validate vote for wrong head. And is that that explains all the
  379. attestation problems that you have or it's also that our mesh sometimes
  380. gets into a bad situation so we cannot send the stations to the right aggregators on time?
  381. I think when we publish, we publish to all of the trophic keys, not mesh keys,
  382. so maybe it's not related to that.
  383. Cool, so all we have to focus now is on getting those blocks ahead as fast as possible.
  384. Yes.
  385. And then on the network thread, it
  386. seems like we're getting them a lot faster,
  387. but our peers are a lot lower.
  388. So we can't keep--
  389. so we just have an unstable peer set,
  390. and it's constantly rotating out.
  391. And so that becomes our problem.
  392. But we're getting things a lot faster.
  393. We're processing them a lot faster.
  394. I think after we get 1.9 out, we can also enable network prep in the CIP node too.
  395. Cool.
  396. Great.
  397. That's all I wanted to discuss.
  398. Okay.
  399. Cool.
  400. So I guess, what are we on?
  401. Day two of testing that RC pretty much.
  402. So we can probably make a decision on that
  403. as early as Thursday,
  404. but we'll still deploy to the CIP nodes anyway, for sure.
  405. And then we'll enable,
  406. actually, are we enabling the network thread immediately,
  407. or is that something that we're gonna wait?
  408. - I was thinking let's test the RC this week
  409. and use it as additional test data for Thursday.
  410. And then once we cut RC or once we cut the full 1.9,
  411. then deploy with the network thread to our CIP nodes.
  412. - Cool.
  413. And I threw observation on that,
  414. then we will figure out if we want to,
  415. I don't know if we want to consider that for the Lido nodes if we get good data.
  416. But we can make the decision later based on what we see.
  417. Any other points for planning?
  418. Otherwise we'll do a quick round of updates and that should cover up the last 20 minutes.
  419. All right, cool.
  420. Let's start with Gajinder.
  421. How are things going on the DevNet?
  422. Yep. So last week was all about pre-DevNet 6 and the most of,
  423. basically we were working quite good last week,
  424. but today DevNet 6 has started and it basically upgraded
  425. created the number of blobs that one can use in the network to six for a block.
  426. And some issues have cropped up and I'm sort of debugging them.
  427. And I have also generated a fix and we are trying to basically again,
  428. sync back to Net6 so that we can bring it to a healthy state.
  429. Apart from that, I did a little bit of PRs doing fixes here and there.
  430. Yeah, that's mostly it.
  431. - With the increase in the blobs,
  432. does that actually make a huge dent in our performance
  433. or anything like that?
  434. - No, no, it's, I mean, basically the problem started
  435. when I sent like 500 blobs in the network
  436. to be 500 transactions with 500 blobs in the network
  437. for them to be included.
  438. So basically each blob consequently
  439. was now getting six blobs, which was fine.
  440. But then not in the network,
  441. basically the problems started showing up
  442. with the EL clients.
  443. So earlier EL clients were not agreeing.
  444. And there was one problem with LoadStarware
  445. when somebody would do Blob site cars by range request,
  446. it was sending all Blob, all six Blob site cars together
  447. rather than checking it up one by one.
  448. So which was basically because of a typo
  449. for which PR has been generated
  450. and I've updated them there as well.
  451. But I mean, all this is right now running in one data center
  452. So there would be actually no issues
  453. with respect to network latency,
  454. but I don't expect network latency to be an issue over here.
  455. - Cool, thanks Gajinder.
  456. All right, next up we have Matt.
  457. - Good morning.
  458. So I spent some time studying through
  459. how the network processor was working
  460. and the network thread was working
  461. in the beacon node and got BLST brought into there properly.
  462. So it's only doing attestations and aggregates and proofs.
  463. So I have a draft PR up
  464. and I'm hoping to get that deployed today
  465. but it was not the most efficient week.
  466. As you can tell my background is here
  467. where Jordan's got surgery in a couple hours.
  468. It's only 6.30 in the morning here.
  469. So just trying to get ready to get here and get here for her.
  470. And then we're gonna, I'm gonna have to go online
  471. 'cause we're gonna go wait with the doctor.
  472. So I'll probably be off most of the today.
  473. I'm gonna bring my computer with me while I wait
  474. and try to get this deployed,
  475. this BLST version deployed to a feature node
  476. and get some metrics 'cause hopefully that will help
  477. with the network stuff, just taking some of the load
  478. off of main CPU might help with some of the other issues
  479. we're seeing.
  480. So that's my goal for today.
  481. And then I do have some small build issue,
  482. Linux build issue that I wasn't seeing on Mac.
  483. So I'll resolve that pretty quickly, I would guess,
  484. 'cause I've seen it before,
  485. I just don't remember which the fix was.
  486. And keep on going.
  487. And then I have the second piece ready for Gajinder,
  488. but I know it's been super busy
  489. with getting the next step of BLST approved.
  490. - Great, thanks, Matt.
  491. All right, moving on, we got came in.
  492. If you have anything else to add.
  493. Yeah.
  494. So two things.
  495. One we mentioned the working on that libp2p branch, try to keep it up to date.
  496. I think I'll push any late any fixes or any latest updates today on that other thing.
  497. I've been working on getting us ready for Node 20, which
  498. came out a while ago.
  499. I had some performance improvements.
  500. I've got a PR open to simplify our snappy frame decompression.
  501. We were using some kind of mildly supported, mildly
  502. unsupported libraries.
  503. And it could just all go away and be simpler.
  504. And in the process, we're updating this native library
  505. we're using, Snappy.
  506. So updating it to the latest version,
  507. which is Node 20 compatible.
  508. And that's it for me.
  509. Oh, actually, I got one thing.
  510. I made a comment on the networking channel.
  511. But it's kind of a cool type hack I found out about
  512. called branded types. And it's a way of like
  513. creating types,
  514. creating unique types, they call them nominal types. But
  515. basically, like, if we wanted to distinguish between a pure ID
  516. that is a string versus a, I don't know, a normal string, you
  517. can create this type that's called purity string, which has
  518. a special little twist to it.
  519. And then anything that's typecast as a string
  520. does not satisfy a PRID string.
  521. You would have to explicitly typecast it to PRID string
  522. or have some kind of function that would
  523. be able to typecast it for you.
  524. So it provides a little bit of assurances
  525. that you're not going to accidentally use a string
  526. when it should be a, you know,
  527. it needs to be validated first or whatever.
  528. So yeah, if anyone's interested in that,
  529. I wrote up a little comment
  530. and I've got an example library that uses it.
  531. So feel free to take a look at that.
  532. - Very cool, Cayman, thanks.
  533. All right, next up we have Tuyen.
  534. Hi, so I focused on 1.9.0 to investigate the external memory issue.
  535. Finally, I have a fix to the batch delete and it seems that's good.
  536. On the other side, I followed to revoke some of the third API.
  537. PR that I found is to introduce head block
  538. hacks which has nothing to do with the external memory. So I think we just leave it as is.
  539. But for now, I think the external memory is good for now. The other thing is to
  540. investigate network thread. I was able to take a profile that created on PR. From
  541. that I found some low-hanging fruit on
  542. the gossipsub. One is not to convert the PID
  543. when we call rip-off violation result of
  544. the gossipsub.
  545. The other thing is to unbundle two
  546. level metrics. Both of these PRs are likely
  547. to save us four percent of CPU time but
  548. Is not the root cause of the network thread issue. We'll continue with this step.
  549. cool. Thanks for the update.
  550. Next up we have Nico.
  551. So I guess the main thing was making the thread pool we use to
  552. decrypt key source reusable. I also improved the error handling
  553. there a bit and found some other issues.
  554. Like for example, we could not terminate the decryption without
  555. force closing the process.
  556. So this should be all fixed now.
  557. Yeah.
  558. And then I also submitted the PR so we can use that in the key manager API, which
  559. was then quite a simple change.
  560. So that's not that huge of a diff and thanks for the review there.
  561. And all those things are in RC2.
  562. Yay.
  563. Yeah.
  564. Yeah.
  565. That's quite the advantage, I guess, that we delayed it so much.
  566. So I think we got most important things now in that we planned for 1.9 actually.
  567. So yeah.
  568. Yeah.
  569. Besides that was just fixing few smaller things that came up on
  570. Discord or on GitHub issues.
  571. And one other issue I noticed that if the beacon node is running for a while,
  572. Sometimes I'm not sure what's the cause, but it does not seem to exit cleanly.
  573. So the process just keeps running.
  574. Um, I investigated it a bit, but did not found proper, like,
  575. uh, the handler that keeps it active.
  576. I did not figure out what is that.
  577. Um, so yeah, maybe someone has an idea.
  578. So it seems really random and not really testable.
  579. Um, but definitely only happens after like maybe 10 minutes or something like that.
  580. So maybe my idea was that we explicitly process exit once the beacon node is closed to avoid that.
  581. But it seems so rare, so I'm not sure.
  582. Does it happen on Mac or Linux?
  583. I only tested on Linux.
  584. Linux. So, yeah.
  585. Because we run some nodes on the E2E test and the simulations,
  586. they run over 10 minutes.
  587. They exit fine, at least on the CI.
  588. Yeah. So I'm still not sure what happens also in Docker.
  589. So when I updated my mainnet node, which runs in Docker,
  590. it looked like it was updating the container only after 60
  591. seconds, which indicates to me that that was the time mode when
  592. Docker force closed it. So yeah. So yeah, let's see. I still
  593. want to investigate that. And maybe we have to explicitly
  594. exit, which I kind of want to avoid. But yeah. So that's it.
  595. Interesting. Thanks, Nico.
  596. >>As for-- I just wanted to make one comment on the explicitly
  597. exiting.
  598. I know that for Geth, when you--
  599. I think you need Control-C to kill the process.
  600. But then if it doesn't die, it gives you--
  601. it's trying to shut down.
  602. And then I think if you hit Control-C again,
  603. then it doesn't-- when the first time you do it, it then--
  604. the second time you do it, I guess it shuts down.
  605. Or I don't know.
  606. Maybe it's not the second time.
  607. but you have to do it like 10 times or something like that.
  608. And then the 10th time it kills it,
  609. but that's just always, it's another option.
  610. - Yeah, I think, I mean, most process managers
  611. just have a timeout at some point.
  612. So usually like 30 seconds or one minute,
  613. but I guess it's still annoying if you update Lodestar
  614. and the container hangs for like one minute or something.
  615. If in reality, like it already shut down after a few seconds.
  616. So, but actually in geth, it may need like 20 minutes of graceful shutdown.
  617. Like at least in Dappnode, we used to have this problem.
  618. Like geth keeps an insane amount of memory of data in memory and it needs the clean shutdown
  619. to persist it.
  620. And if you don't wait 20 minutes, people had to like spend two days syncing.
  621. Luckily, we don't have that issue.
  622. - Wow.
  623. So what are, I guess, the side effects of doing that, Nico?
  624. Like force closing, is there anything that like-
  625. - Actually, no.
  626. So when we would force close is,
  627. so what happened before is that we appropriately exited
  628. because of this library we were using
  629. and that exited like uncontrolled
  630. in the middle of the shutdown process.
  631. Now, what I observed is that we always,
  632. also this beacon node close method we have always succeeds.
  633. But then after that,
  634. there might be still an active handler in really rare cases,
  635. which I need to identify.
  636. But yeah, I think nothing really happens if we exit there.
  637. So I guess one disadvantage would be that
  638. we would not detect as easily
  639. if the beacon node shuts down cleanly or not.
  640. because we always explicitly exit.
  641. So yeah.
  642. Maybe we could add a test for that, that just checks.
  643. - Would be nice to work.
  644. - Cool.
  645. Thanks for that Nico.
  646. All right.
  647. And then we have Nazar.
  648. - Thank you.
  649. I was working last week on some final features
  650. or refactoring of the prover.
  651. One of them was the batch request,
  652. which turned out to be a bit tricky
  653. because some providers, for example,
  654. Ethers.js does not have a public interface
  655. for the batch request.
  656. On the other hand, Web3JS do have a public interface
  657. for the batch request.
  658. But we wanted our prover to be compatible with both
  659. on the same time. That was a tricky which took a lot of time to finalize. So the PR is open.
  660. So in that PR, I covered a couple of things which were left from the major epic.
  661. And once this PR is merged, then I will hopefully close the epic issue that we have. And
  662. the only thing that will be left out from the epic will be the P2P interface for the light client.
  663. I will open the separate issue for that particular task and then close the EPIC.
  664. And in addition, I was going through some types and I saw a very useful ESLint rule
  665. for an unnecessary typecasting. I enabled the rule and found out that
  666. there was a lot of unnecessary typecasting in our source code around. So I had to open a PR.
  667. If you guys find it fine, then we can merge it. Or if you think keeping those unnecessary typecasting,
  668. for example, as begin or as string when the string values are there. So if we think it is necessary
  669. in our source code and we can close the PR.
  670. And in addition, I was doing research on MetaMask snaps
  671. for the future target of integrating Prover
  672. with MetaMask.
  673. It turns out that MetaMask snaps may not be the right
  674. framework or architecture for integrating the Prover
  675. because as per their documentation and architecture,
  676. they suggest that there should not be a long running process
  677. in the snaps.
  678. On the other hand, we need to run a light client
  679. that should keep running always.
  680. So we have to figure out a way of using snaps
  681. or if there's another way to do it, but I'm not sure yet.
  682. So I'm doing a bit research on this topic further.
  683. And yeah, next week I will continue the research
  684. on this topic and hopefully when v1.9 is released
  685. then we will make the Prover package public as well.
  686. And then I will update the Light Client demo that we have
  687. to use the Prover package instead of having this
  688. boilerplate code from the Light Client package itself.
  689. Yeah, there's all from.
  690. Awesome, thanks, Nazar.
  691. If that's the case, maybe we should escalate a little bit further and maybe just talk to
  692. the MetaMask guys directly.
  693. We should have a connect channel and Slack with them to sort of figure this out.
  694. Yeah.
  695. If you're not interested, no.
  696. To do the research on our side, complete, then we can talk to them.
  697. Because the first thing I foresee, that the first thing they're going to suggest is to use the snaps.
  698. Because that's what they developed to extend the behavior of the MetaMask.
  699. But in our case, we have to think how the snaps can fit in our use case.
  700. Right.
  701. I will update you about my findings by tomorrow, hopefully.
  702. Okay, thank you.
  703. Cool.
  704. And, Lain, if you have any additional points you want to add.
  705. Otherwise, if you're good, I think that about covers it for today.
  706. Okay, cool.
  707. Thanks, guys.
  708. I'll get a summary of the notes out in a bit today and have a good week.
  709. And we'll talk to you on Discord.
  710. >> Sounds good.
  711. >> Thanks.
  712. >> Bye-bye.
  713. >> Thank you.
  714. >> Bye-bye.
  715.  
Advertisement
Add Comment
Please, Sign In to add comment