Advertisement
philknows

Untitled

Aug 10th, 2023
22
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 39.48 KB | None | 0 0
  1. Recording in progress.
  2. Okay, so I'm doing this from my phone today, so we'll see how this goes.
  3. Wasn't able to charge my laptop in the car, but figure this out. Anyway, I think Nazar is stuck in something and also Gajinder is in his virtual call, so
  4. that's why we're recording this.
  5. Okay, to get started today here.
  6. Let's talk a little bit, I guess, about some of the issues that we're getting on the Big Boy testnet. I think Lion has really outlined it pretty well in some of the issues.
  7. Um, but I don't know if there's, um, additional discussion that you might want to have while we're all on the call here about this, but, um.
  8. But, yeah, if there's a any sort of comments.
  9. In regards to what line has written up in issue
  10. 5855 and then 5857.
  11. Please feel free to go ahead.
  12. By the way, today... I was just going to add...
  13. Go for it.
  14. Oh, yeah, I was just going to add, I think,
  15. Lion already messaged, but they're doing another DevNet,
  16. I guess, replicating the mainnet environment
  17. with the same amount of validators and allocation, I think.
  18. And that should be happening today at some point.
  19. But yeah, go ahead, Cayman.
  20. So I thought that the reducing cache beacon state size issue is very interesting.
  21. Yeah, it's the huge kind of change that can give a huge result, but it's also a huge,
  22. like a really big change.
  23. So I'm thinking about other options
  24. that are way less invasive and are kind of patches.
  25. And we may not have to do this whole crazy thing.
  26. Because it's really-- if we can represent pubkeys
  27. and withdrawal keys efficiently, that's basically--
  28. I mean--
  29. OK.
  30. - Yeah, we could have more gains, but it's not that bad.
  31. - So, pubkeys and withdrawal keys.
  32. Right now, we don't even support forking
  33. of the pubkey cache, do we?
  34. - That's a separate thing.
  35. And like in reality, on the happy case,
  36. all of this data is structurally shared.
  37. So it's not that bad to be honest.
  38. Like we know in terms of heavy forking,
  39. it's gonna die, we're gonna die anyway.
  40. So yeah, so a single state is very big,
  41. but you usually only pay it once.
  42. [silence]
  43. What I'm worried and I do not understand yet is
  44. why did memory went up to 12 gigs?
  45. That doesn't have anything to do with the state being big.
  46. This is a problem of the cache.
  47. Yeah, and I think, Tuyen, did you also see on one of the servers as well, like that we're
  48. nearing 12 gigs of memory somewhere?
  49. Remember reading that as well.
  50. And I'm not sure if that has anything to do with it.
  51. No, I just see it in the BigBoy net only.
  52. I think there are more than 20 checkpoints there.
  53. Maybe that's the reason.
  54. Okay. Well, we should be getting some more results and to sort of see how it does as well with the network worker enabled as well.
  55. Um, Lion already passed all the flags for them to run on the new V1.10. So, um, we should hopefully get some more data shortly.
  56. With this new test that but, um, yeah, in terms of the, um, the sort of next steps to resolving this problem.
  57. I don't know if there's anything else that you guys want to add to this discussion about
  58. how to resolve it, but if not, we can continue forward and take that async.
  59. Okay.
  60. [silence]
  61. Okay.
  62. All right, so we'll wait out and see the results also of the next test or DevNet
  63. that they're putting up later today.
  64. The next thing on my list here was just to quickly discuss some of the memory leak stuff.
  65. I guess most of it was potentially resolved through upgrading to Node.js 18.17.
  66. I guess we haven't confirmed that yet.
  67. If that's correct, right, Tuyen?
  68. Yeah, I think there's no leak there.
  69. However, the RSS is 12 gig.
  70. It's way bigger than current version.
  71. So I'm not sure if we, if it can run on normal instance,
  72. 'cause I just test on AX-41 only.
  73. Maybe work to deploy to the whole group.
  74. - So I think, 'cause I deployed a bunch of different SHAs,
  75. and I think I found between which two commits,
  76. something changed, and I think it might be
  77. the native fetch.
  78. So it's possible that we could back that back out
  79. until whatever that is gets fixed
  80. 'cause that is a known issue.
  81. So it's possible we might be able to revert that
  82. and not see the memory issue, like the gains.
  83. - But a fix for that was already in April, I think.
  84. So if you look at the release notes
  85. I linked in the private chat,
  86. there was, they did a patch that was just not yet
  87. in Node.js basically.
  88. - So we can't say conclusively that it was that, right?
  89. - Yeah, I think so.
  90. Right, well, I mean, we tested it.
  91. It's not happening on node 20.
  92. And it was also not leaking on the version of node 18--
  93. was it 18.17, where they patched undici.
  94. So I think we can--
  95. >>I did see two other leaks, though,
  96. that were also relating to 18.16 and 25
  97. that were posted on the node board and the issues board.
  98. And looking at feature one large, it's leaking,
  99. but the feature one medium seems like it's not.
  100. So it's like, basically I put a bunch of commits
  101. on different servers on that group,
  102. and some are leaking and some are not,
  103. like the older ones are not.
  104. So it's somewhere in between there.
  105. So we have kind of like a breadcrumb
  106. of which commit is initiating it.
  107. So I think we'll be able to resolve it.
  108. - And is it also leaking on node 20?
  109. - Let me check, I've got both versions.
  110. - I checked in my git Tree and it's not like that.
  111. For all instances, I think earlier,
  112. my source is still in git Tree as M1V,
  113. but that's still run node 18.
  114. - Is that 18.17 or?
  115. - 18.16.
  116. - By the way, if that's related to native fetch,
  117. we would also see it on the validator client.
  118. So it does it leaks in validator old space and main thread large objects.
  119. So both were climbing and I'm looking at beta.
  120. I'm going to have to do a little more research, but like it's kind of hit or miss.
  121. Some of them are leaking, some of them are not.
  122. So I'm going to, but I don't want to analyze it now while we're on the call.
  123. [silence]
  124. So I have beta group and feature one have the same sets of SHAs deployed,
  125. but one of them is on node 18, one of them is on node 20.
  126. [silence]
  127. Okay, is there anything else being tested on feat two and feat three right now?
  128. [silence]
  129. Yeah, feature two, I think, is where I put out BLST.
  130. So I got that brought in.
  131. And then I think Tuyen had something on feature three.
  132. [INAUDIBLE]
  133. OK, cool.
  134. Yeah, let's see if we can get to, I guess, the bottom of this
  135. and figure out what if it is that commit
  136. and see if 18.17 will fix this.
  137. But this kind of goes into the conversation of just how much
  138. longer we want to support Node 18 as well,
  139. which I think Nico made a comment about we
  140. should keep supporting it until Node 20 is LTS,
  141. If anybody has any objections to that, please speak up.
  142. I tend to agree also. As long as it's LTS, I think we should.
  143. I mean, it's the official node.
  144. Yeah. I mean, it's still on the main page,
  145. and people are still using it, so it makes sense.
  146. But yeah, I think the swap over to Node 20 LTS
  147. This is, I think, sometime in October, I believe.
  148. But--
  149. >>And I apologize.
  150. It's feature one is where I have BLST,
  151. and feature two is where we’re looking for the leaks.
  152. That backwards.
  153. >>OK.
  154. Great, yeah.
  155. Just as long as--
  156. >>I've got kind of an opposing idea
  157. that I think we should move to node 20
  158. and ask people to use node 20 if possible, as soon as possible.
  159. But I mean, supporting node 18 is fine,
  160. but there's no reason why we should be using node 18,
  161. especially if it's causing us more headache.
  162. - Yeah, but I think that's pretty in line
  163. with my view as well.
  164. Just not like hard dropping or updating it like that
  165. in the package JSON, we couldn't force it, right?
  166. - Right.
  167. But I think we did that pretty well with the last release.
  168. I think, yeah, it's in the announcement release notes
  169. anyways.
  170. - Right, we asked people to use it.
  171. And if they use Docker, which is the
  172. recommended installation,
  173. then they're gonna get Node 20 anyway.
  174. - In our docs, is it's, what does it say?
  175. Because I guess like that would probably be the last thing
  176. we would need to update to really be like,
  177. hey, we don't really want to support 18 anymore.
  178. Or at least like, we don't recommend you to run it.
  179. - Yeah, only reference right now is really
  180. in the package, Jason.
  181. Else, everything else is on 20.
  182. Cool.
  183. Well, I mean, I don't see any harm in putting 20 into that,
  184. unless there's some objection to it.
  185. But--
  186. >>I think we should keep it as 18 until--
  187. because that reference will force hard and force
  188. not using node 18.
  189. So if you try to install lodestar with node 18,
  190. it will throw an error.
  191. Which I think is not what we're wanting.
  192. I think we want it.
  193. Yeah.
  194. Still be allowed.
  195. Yeah, that is how I feel as well.
  196. Like where like it doesn't error on 18,
  197. but like we run everything on 20 as far as our fleet and our images are on 20,
  198. but it just doesn't error trying to install on 18.
  199. Would it be smart for us to run every--
  200. I guess we wouldn't be testing any further things on 18 anyway,
  201. but we also wouldn't know if anything broke, right?
  202. I guess when we're doing release and stuff,
  203. nothing runs on 18, really.
  204. But either way, I think we have it pretty good, correct for now.
  205. and we'll just keep an eye on when we should actually
  206. change the package that is on them.
  207. But it sounds like to me, everything is fine the way it is.
  208. Okay, next up on my list, I have just,
  209. I guess discussion points on the interoperability
  210. with other clients, I think we just posted something today
  211. for Nazar to look into, but we have had issues,
  212. I guess, with interoperability with a bunch of other clients
  213. and Nico made a point that right now,
  214. a lot of our fallback stuff is on the VC side,
  215. and which is not, I guess, fully compatible with some,
  216. like you wouldn't be able to use like a Lodestar VC,
  217. sorry, a lighthouse VC on like Lodestar as an example. Is this the strategy that we want to
  218. stick to? Like how do we want to go about, I guess, increasing interoperability? Because I
  219. think that that is an important thing for us to be able to do.
  220. Um. Does anybody have any points? Um, for this. Otherwise, um, like we should try to maximize the priority or interoperability we need to make to make that happen, especially with with lighthouse and prism being the two most popular clients right now.
  221. I mean, I think there are three most common setups.
  222. So one is if people run a solo staking rig and a rocket pool, for example, then they
  223. might use a different VC.
  224. The other is DVT, where we have only the Lodestar VC right now that runs with a lighthouse
  225. Beacon node, which there were no issues so far.
  226. And then we have the, I guess, big operator issue where they want to use Lodestar as fallback,
  227. I guess.
  228. And I mean, the issues, at least I've noticed so far, really this missing attestations are
  229. not as attestations aggregates.
  230. So Lighthouse and Nimbus, and I think, I'm not sure Teku as well, maybe they can't
  231. produce aggregates with Lodestar for some reason.
  232. And then, yeah, I think the bigger issue is really with this fallback logic that we have
  233. that on the validator side, because other VCs assume this is done by the beacon node.
  234. So yeah, you basically miss the block if the MEV boost relay fails, because they only request
  235. the blinded block and expect the beacon node to do the fallback behavior.
  236. So yeah.
  237. I guess based on what you're seeing, Nico,
  238. is there a lot of people actually doing
  239. this sort of setup which requires us to be compatible
  240. with all these other clients?
  241. 'Cause I'm trying to gauge how important
  242. something like this is in regards to fixing or figuring out.
  243. (silence)
  244. Like I think it's just the one guy on Discord right now
  245. who's running this type of setup specifically
  246. unless I'm missing a group of people here.
  247. - I am also running it, but yeah.
  248. But these are at least from my side the observations.
  249. So I also have set up on Goerli now running Lodestar
  250. with all four validator clients or with all four others.
  251. - Right.
  252. - But yeah, all but Prysm.
  253. I think Prysm
  254. , you cannot make it work.
  255. So they have their own stuff going on Prysm.
  256. So I'm not sure if they are even compatible
  257. with any other clients
  258. 'cause they use a completely different API to communicate.
  259. Yeah.
  260. So it sounds to me like this is quite a difference
  261. that would require some investigation
  262. and even like quite a bit of work, it sounds like to me,
  263. to make it work, especially with Prysm.
  264. 'Cause Prysm like 40% of the network right now,
  265. or something like that.
  266. (mouse clicking)
  267. - I'm not sure if they have a flag or something
  268. that you can use REST APIs,
  269. but I think by default they use gRPC or something.
  270. - Yeah, I think so.
  271. Yeah, I mean, we should definitely try, in my opinion,
  272. try to be as compatible
  273. with the other clients as much as possible.
  274. Some of it, I guess, will require more work than others,
  275. but that's what I would like to see is that those issues
  276. in regards to interoperability
  277. are resolved in the near future.
  278. But that's just my opinion on it.
  279. Does anybody else want to add anything to that?
  280. But I think like in terms of like the people
  281. that were really trying to lure into using Loadstar
  282. would be like larger node operators.
  283. And a lot of them have expressed
  284. that their setups are also multi-client.
  285. And if we're gonna try to target those,
  286. we should try to be as compatible as we can.
  287. That's sort of my rationale for it.
  288. - Yeah, I agree.
  289. I mean, it helps the lodestar adoption case
  290. for those people.
  291. So we should definitely be supporting it.
  292. And it seems like if we get tests,
  293. then we can kind of ensure that we're not regressing.
  294. So I think it makes sense.
  295. Um. Okay Yeah, I think that, um,
  296. issue specifically with the, um.
  297. I think, um. Well, we'll get
  298. bizarre to look into this. I
  299. think if he's. On the call
  300. there. Okay Oh, yeah, he's
  301. I'd like to see looked into just a little bit more so that we can try to resolve
  302. the issues with other clients if possible.
  303. Okay, I have one question like reading on the comment in the discord.
  304. This fallback mechanism which Nico is referring that on other implementation is on the beacon node
  305. side and in our case it's on the validator side.
  306. Does not it part of a spec or what?
  307. >> No, I don't think so.
  308. I think it's just a feature that all the clients have kind of copied from each other or independently
  309. found to be useful.
  310. It just seems that everyone else is doing it on the beacon node side.
  311. And if we are the only one difference from others, is there any like particular rationale
  312. behind when we implemented it?
  313. I think the rationale would be that the more work we can hoist to the validator, the less
  314. we're having to do on the beacon node side would be my interpretation of it.
  315. But I also didn't implement it.
  316. I think that was Gajinder. So you plan to ask him.
  317. Okay, I will check with him.
  318. But also, I think there will be spec enforcement. So the v3, I think for produce block enforces that
  319. about I think it's not yet merged. But I also saw that Gajinder mentioned some points there.
  320. I think maybe there are some rationales also for why we implemented that fallback in the
  321. the validator client. Okay. Um, yeah, I guess we'll need to refer to get Gajinder to continue
  322. that. Um, I don't have any
  323. further to do with that. But is
  324. there any other additional
  325. points? Or questions in
  326. regards to this?
  327.  
  328.  
  329. If not there,
  330. um. I just want to get a
  331. update on anybody who's been
  332. working with the network worker
  333. being done in relation to that.
  334. But I don't have the latest update
  335. on what the status of the network worker thread
  336. is, if anybody has anything to throw in here.
  337. Was adjusting the memory, both with and without,
  338. by bumping the new space.
  339. And it actually does drop event loop,
  340. whether a network is on the main thread or on the worker thread.
  341. So that is definitely a good solution.
  342. But it came up when we started to see the leak.
  343. So now that we've kind of narrowed the leak down,
  344. I'm going to deploy with the new space update on 20
  345. and see how it runs with that now.
  346. - All right, great.
  347. - And maybe also one thing,
  348. so I also posted something in our issue
  349. that we have with the network worker,
  350. what we might consider because I saw some benchmarks
  351. that look pretty bad for worker threads.
  352. And it might be just the case
  353. that we are just using them wrong here.
  354. And it might be better to use a child process instead,
  355. Because from what I've read and also what I see
  356. from node maintainers, when they mentioned the drawbacks
  357. and advantages of worker threads,
  358. it's mostly that they should run short-lived tasks
  359. basically that are blocking the main thread
  360. and are CPU intensive.
  361. And this is not really the case for the network.
  362. We have a lot of IO and it's a long-lived process.
  363. And I think what it comes down to, as far as I understood it,
  364. is really just the OS allocating resources.
  365. So there's the main difference
  366. between a thread and a worker.
  367. But yeah, not sure.
  368. Also I noticed that Ben mentioned NodeCluster a lot of times
  369. in the emails that he wrote.
  370. And yeah, that uses a child process as well.
  371. So maybe we could give that a try.
  372. I at least wanted to give that a try and get some metrics.
  373. Yeah, but this doesn't satisfy me.
  374. Like, we're just playing crazy guessing games here.
  375. I want to understand why a worker thread would work better or worse than a fork at a fundamental level.
  376. Do we have the answers?
  377. Yeah, that's a difficult question, actually.
  378. I researched a lot there.
  379. And I think it really comes down to the OS allocating resources.
  380. Because it's another process, basically.
  381. and the thread has shared memory with the main process and so on,
  382. which I think is also not great in our case.
  383. And why does it have an impact?
  384. Yeah, you really need to, I'm probably the wrong person to ask here.
  385. So they should be on separate threads though,
  386. like they should be able to schedule a separate course.
  387. If it's like if you're using child threads and workers, it should be on it should schedule
  388. them separately, as opposed to trying to interleave them because they're, it's a separate PID,
  389. like it's still under the same PID. So it will tend to like put them on different on different
  390. cores. Worker threads does not have a PID like independent PID.
  391. Exactly, correctly. But when you do a child process, it's a separate PID with all separate
  392. spaces. And it also would load all the dynamic libraries and shared libraries as well. So it
  393. may have additional memory overhead, but I don't know that for sure.
  394. And I think there's slightly more cost in IPC, because it would use a Unix socket instead of,
  395. I don't know, doing it over memory directly.
  396. But I think I also saw that in one of Ben's emails
  397. that it's basically on Linux.
  398. It's not noticeable at least.
  399. And I think we are not using any shared memory, right?
  400. So at least from reviewing the code,
  401. I didn't see that we are doing that.
  402. No, we're not.
  403. Because the WorkerThread library is serialized everything
  404. to string and then just passing a string.
  405. And no, but I think for the key stores, we use transferable objects, which are shared array
  406. buffers. So.
  407. Although it seems like since Tuyen brought the extra set timeout into place in the network
  408. thread, it seems like it's actually brought loop time way down to where it would be expected to be
  409. anyway. So I wonder if it's a moot point because it seems like the node, well, again, this is where
  410. I'd cede to you guys. But it seems like it's doing what it needs to do at the moment, that
  411. we don't have the crazy loop times like we used to have.
  412. Okay, I just want to well, I want to know like where the status of that was and I guess sort of our next steps
  413. um
  414. Are in trying to
  415. Improve it. Um does
  416. Like how close are we to perhaps?
  417. I guess understanding whether or not this is something that will be ready within i'd say within the next two minor releases
  418. Or is it pretty hard to say still at this point?
  419. I mean, it seems like it performs about the same now, whether it's on a network like on a worker thread or it's on main thread, as far as just loop times and memory usage and whatnot.
  420. But, I mean, it seems like it to me.
  421. I mean.
  422. But I would leave that.
  423. Yeah, that was that was kind of what I saw too,
  424. is I didn't see any improvement and I didn't really
  425. see any degradation, but.
  426. I feel like we still need to do more testing.
  427. Like I was saying.
  428. I didn't realize that we had it.
  429. We were testing on node 18.
  430. So it'd be good to bump our fleet back up to up to 20
  431. and test again.
  432. And we should actually, now that you say that,
  433. set the variable in Ansible to 20
  434. so that whenever we deploy, it automatically goes in 20 now.
  435. OK, great.
  436. Yeah, we'll keep going down the path of more investigation
  437. testing.
  438. I think we have a bit of a roadmap here to see the benefits.
  439. And if we do want to try something else,
  440. we can further discuss some other options.
  441. Does anybody have anything to add to the network thread
  442. discussion?
  443. Is there like a documented acceptance criteria or something
  444. like we want to target?
  445. Say, I think I'm hearing like, is the performance, there's no
  446. performance increase, but there is no degradation. If that's
  447. what we expected from this refactoring from start, and I
  448. think it's okay. But if that is not what we were expecting, then
  449. and maybe we take our time to dig further.
  450. - Well, it does bring down,
  451. like 'cause when you put everything,
  452. network and main thread on one thread,
  453. the loop times are substantially higher.
  454. When you break it up, it looks like it, I think,
  455. like, I mean, it's like 30 milliseconds versus 15 on two,
  456. two loops.
  457. So it does seem like it brings the overall loop time down,
  458. but it doesn't really change like API response times
  459. or because of the latency between the two,
  460. I think is what's causing the,
  461. like the each loop is performing better,
  462. but the communication between the two
  463. is making the like the API response about net zero.
  464. - And when you refer like a loop time,
  465. does it mean like event loop time?
  466. - Yeah, exactly.
  467. - But we wanted to, we introduced this worker thread
  468. milestone into Lodestar
  469. because we wanted to improve the performance.
  470. So if it's not like reaching to that level,
  471. then maybe we should like step back and think again
  472. how we implement it.
  473. and maybe there's some better way that we overlooked
  474. because of a worker thread concept.
  475. It wasn't like, looks promising at the time.
  476. - It does.
  477. - Right, and I think,
  478. I feel like the landscape changed
  479. when we moved to deterministic long-lived subnets
  480. where we're only subscribing to two subnets
  481. instead of possibly up to all 64.
  482. And now it's just like, because we're doing so much less work,
  483. whatever gain we thought we had or we thought
  484. we were going to get is less important or not
  485. as noticeable or less measurable because there's just
  486. so much less work being done in general.
  487. Yeah, we must test every single worker
  488. with Subscribe all subnets.
  489. Otherwise, it's not useful data.
  490. Because as you say, otherwise, it's so idle
  491. that it probably doesn't matter.
  492. Yeah.
  493. Well, that was what we were seeing.
  494. We were testing it without Subscribe all subnets.
  495. And it was like, well, it's kind of the same.
  496. Yeah.
  497. Like the feeling I have is if we don't test with subscribe all subnets and then we say, oh, the worker looks good and then we ship it and people use it with subnets and then lots of nice, that's not okay. That's not okay.
  498. Yeah, we need to compare with the stable mainnet node.
  499. Okay, sounds like we just need to gather some more data here then,
  500. before we can make any sort of decisions on what sort of benefits we get from it.
  501. But while I would love to have a more fundamental understanding of the differences between worker threads and
  502. a fork process, I think it's worth it to do a test. So if someone wants to do it, please go for it.
  503. Yeah, I mean, I definitely want to take a look at that. I also asked in, there's this other
  504. worker pool implementation that's maintained by mostly Node Core devs. And there I also asked,
  505. let's see if we get a good response there.
  506. Okay, anything else to add to the network thread point?
  507. Okay, sorry, was someone going to say something?
  508. I was going to say that I implemented the similar structure in another project,
  509. And I opted for detached child processes
  510. to maximize the hardware performance.
  511. Because if it is a child process or a worker thread,
  512. in any way, there is a lot of dependency
  513. between a main thread and a worker thread or charge process
  514. created by the Node.js environment.
  515. So if we clearly want to maximize,
  516. utilize the performance of all available cores, we need to have a detached child process that
  517. can utilize a full core. And when I implemented, I used a third-party serialization library,
  518. which was performing around 300 MB per second for real-time serialization. So it did not have
  519. any impact between transmission data between two child processes which does not have IPC
  520. connection because there is one if you spin up a normal child process there is an IPC connection
  521. created with the node environment which has an overhead on top of the child process.
  522. So if we really really want to achieve full-scale performance of available cores then
  523. Maybe we start looking to this pattern.
  524. Okay, I will share some implementation details later on in the discord if someone wants to look
  525. at. Yeah, yeah, that would be interesting to look at. Thanks for sharing, sir.
  526. Okay, any other points for that?
  527.  
  528.  
  529. Okay, let's do a quick round of updates. Let's start today with NC. How are you doing?
  530. All right, hey guys. Um, okay, so for the ePBS side, so we had like a like a like a first
  531. meeting with the Prysm folks last week. And it seems like you know, things still going pretty
  532. slow. But at least like we started like splitting up work and we we have a weekly meeting setup.
  533. So that's pretty cool. Um, and now on the other side, you know, to keep myself productive. So I
  534. I started looking into the 6110 implementation on Lodestar.
  535. Seems like there is a prerequisite on the pubkey cache.
  536. So I need to do the refactoring and also, you know,
  537. have like two new sets of pubkey cache
  538. attached to the beacon state.
  539. Still looking into it, not much update else.
  540. So, right, that's all from me.
  541. Are you and Lion also working on some other PBS type
  542. implementation as well, or is that also important?
  543. - Right, so right now we're only looking
  544. into the PTC design.
  545. There are obviously other designs out there,
  546. but I think we're just focusing on the PTC.
  547. - Got it, okay, thank you.
  548.  
  549.  
  550. All right, let's move forward with Lion.
  551. Hey, so last week we spent a bunch of time on Whisk, debating different optimizations
  552. and doing more security analysis.
  553. There is one that's very promising that could reduce state size from doubling to 33% more.
  554. So I took it as far as I could, and now it's on cryptography team's hands.
  555. see what comes out but it's exciting. And this week, yeah, I spent a bunch of time thinking
  556. about the devnet, the big boy issues that we already discussed. So all good.
  557. Thank you.
  558.  
  559. All right. Let's go ahead with Nazar.
  560. Thank you. Based on my last week update, I created a EL provider proxy which shows 100
  561. ETH to every account you try to connect, no matter how much balance it has, to test that
  562. if our prover is working fine or not. And during my testing, I found out that prover
  563. was not working. It was not verifying the balance and it was a surprise because all of our tests
  564. were working fine. Yeah, I spent a lot of time, like a day on it figuring out what could be the reason,
  565. but it turns out that it was Web3.js version 4.x which implemented RPC in a different way than
  566. I was expecting. So I opened a PR to make our provider compatible with the Web3.js
  567. 4x version. And yeah, when I was doing it, there was a very weird TypeScript issues,
  568. which is causing this PR to be delayed. But hopefully, it will be completed today. And then
  569. I will open this, like finish this PR.
  570. And I will add documentation section inside the readme to test how to test this like unverified
  571. provider with our Lodestar prover.
  572. And yes, then I will be working on the issue of hiding simulation test for different beacon
  573. node and validator client configuration.
  574. Yeah, that's all from me.
  575. Thank you, Nazar.
  576.  
  577.  
  578. Okay, next up we got Gajinder.
  579. How are you doing?
  580. Hey, Phil.
  581. Hey, everyone.
  582. So I've worked a little bit on Verkle.
  583. basically I was successfully able to read local genesis after basically doing the changes
  584. which basically differed a little bit in the types from what were implemented.
  585. And I also attempted a sync but I was not able to decode the blocks that were served by
  586. lighthouse. So then I extracted some beacon block JSONs from the lighthouse and tried to load
  587. on our types. So there were a few issues that were discovered and I have sort of raised them.
  588. And hopefully I will try to make local changes so that I can currently sync on Constatine network.
  589. and I have raised the issues so that they can be addressed before the network is relaunched
  590. so that when the network is relaunched we can load and sync it in a proper way. I will still try to
  591. sync the current network in the current format so that you know on the next relaunch we are sure
  592. that we can participate in the network. Apart from that, did finalized PR regarding fee recipient
  593. and I did some mock test to see whether the fee recipient that is being passed
  594. actually reaches the notifyFCU calls to the execution engine. So that should basically
  595. give us quite a good amount of confidence in terms of our expectations with regard to fee recipient
  596. and finalized, incorporated some of the changes that we reviewed on the Free the Blobs PR
  597. and also tested it with Ethereum.js for DevNet 8.
  598. So, currently seems like it works well with Ethereum.js.
  599. No other EL DevNet 8 branches or images are out yet.
  600. So when they will come, I'll test against them as well.
  601. And did some reviews and some small PRs on, for example, execution,
  602. engine straight tracking, and also helped EF dev guys run Lodestar as a boot node.
  603. So they were having some issues with ENR and which basically we have seen these kind of issues
  604. before and had added NAT flag for them. So I basically helped them use loadstar as boot nodes
  605. and to basically then other nodes could sync from Lodestar.
  606. Yeah, so I am currently planning to work on making sure that the race that we run between
  607. builder and execution, we want to move it to beacon so that we can be compatible with
  608. other beacon nodes and validators in terms of how they run the block production that should resolve
  609. some of our interop issues that Nico has also seen. And I will continue working on syncing the
  610. the verkle testnet.
  611. Thank you, Gajinder.
  612.  
  613.  
  614. All right, let's move forward with Nico.
  615. Hey, so yeah, as I mentioned before, I looked a bit into the worker threads versus Child process
  616. topic.
  617. I'm still not satisfied there with my understanding, and it's really hard to find good information
  618. on the performance.
  619. but hopefully, yeah, getting some benchmarks there.
  620. And maybe some responses on GitHub will help there.
  621. Besides that, after the update last week,
  622. I did to the boot nodes, I was reviewing that a bit.
  623. So there was this one issue where a user,
  624. I think it was Mika, I'm not sure who it was actually,
  625. but yeah, there was an issue if you said,
  626. connect to boot nodes and there was this passing issue
  627. and then yeah, I basically just reviewed the code
  628. to understand how that works.
  629. Besides that, yeah, and I opened the issue
  630. where I discussed the strategy,
  631. how we should maintain boot nodes
  632. better in the future maybe.
  633. Besides that, there was some thing with DappNode
  634. where some user asked if there's a possibility
  635. to maybe disable doppelganger protection.
  636. And because that's enabled by default,
  637. so I'm looking into that.
  638. Maybe we can provide a better option
  639. how they can more easily disable it.
  640. But yeah, I think there's still improvement
  641. we can do on that end,
  642. how we improve the implementation on Lodestar.
  643. I think right now it's for all clients, they just wait two epochs or three before they start the testing.
  644. But I think if we know that the instance did the attestation in the previous epoch, we can just start right away.
  645. So I documented that in the issue.
  646. And there's even security improvements in my opinion, if we do that.
  647. Which I also wrote down there.
  648. And yeah, so the plan for this week is I guess,
  649. testing this child process stuff
  650. and also updating the boot nodes.
  651. And then I also want to further look into
  652. the whole state cache and region topic
  653. and review some more code there.
  654. - Cool, thanks Nico.
  655.  
  656. All right, let's go with Matt.
  657. - Hi, had a little bug in Ansible doing some deploy.
  658. So I put up a PR to update one of the dependencies,
  659. went switching between node,
  660. put up a couple small PRs for just dashboard updates
  661. and got those cleaned up and they got merged.
  662. Investigated the memory leak issue a little bit
  663. and then also was trying to dig more on the line zero,
  664. but I ended up just adding that to the email
  665. that I sent to Ben.
  666. I did finally get that sent over to him
  667. and asked the four questions that were pressing,
  668. and then got BLST updated as far as in the BLST repo,
  669. and then got the updated code into Loadstar,
  670. and then updated the critical pieces in Loadstar.
  671. So I basically took out the old BLST repo
  672. in the state transition and CLI and all the other server side stuff. I only left the Herumi version
  673. in the light client and in the prover because those are going to run client side and then
  674. that was branching off of an older version about two weeks ago is where that work started.
  675. So I got that standing and working and collecting metrics on it just to see how it is relative to
  676. when all the subnets were subscribed and looks like it's brought down CPU usage by 40%
  677. which is nice by keeping it using the libuv worker pool versus the separate workers. So
  678. I also got all of the work that Tuyen did as far as the because there was a ton of work in the BLS
  679. updating the gossip for re-verifying multiple of the same message,
  680. really changed a lot of that code enough that it wasn't really possible to do a merge commit with
  681. that. So I basically just had to separate it into two files and I'm just kind of like
  682. manually putting the pieces together. So that's almost done and I'll be able to deploy that as
  683. well in order to see how it looks with the rest of the metrics so we can see it. Before we did
  684. that so it's kind of like with the metrics that we kind of are used to seeing and then with the new
  685. metrics that we're now seeing now so we kind of like get a more holistic look at how it's actually
  686. doing but it seems like it's working okay because it's like event the the loop is only 15 milliseconds
  687. uh on the old version before we did any updates uh at all from two weeks ago so that's great
  688. And then I also did some investigations with the new space and the semi-space,
  689. both were promising. And that's actually what kind of like when I was messing with that is what highlighted the memory leak.
  690. So I'm going to go back to putting that on now that we can kind of like get the memory leak out of there,
  691. we'll be able to see exactly what the heck that that's actually doing.
  692. And then my goal for this week is I want to...
  693. I'm going to finish that merge, and hopefully I'll get that done today,
  694. and we can see what that looks like.
  695. And then Lion gave me some updates on the deduplicate payloads,
  696. and I'm going to go back to working on that one.
  697. I think the issue that I was having when I was having a bunch of errors
  698. was it was random. I was basically regenerating
  699. like a whole bunch of blocks and using a whole bunch of randomness.
  700. And I think that was just pooping my CPU,
  701. just trying to regenerate a whole bunch of random data.
  702. In order, so I'm going to save those to like just a fixture.
  703. So it's not actually generating the randomness every time
  704. and then see how that does.
  705. So I've got a couple of strategies there.
  706. And then I'm also going to try to get the,
  707. So the memory thing and the blast thing.
  708. And I'll start on the deduplicate payloads,
  709. and then I'll respond to Ben, because there's
  710. going to be a couple of things that are probably
  711. going to come out of there.
  712. Thanks, Matt.
  713.  
  714.  
  715. All right, let's move on to Tuyen,
  716. and then we'll finish off with Cayman.
  717. Tuyen, you still there?
  718. Okay.
  719. So, last week, I work on a issue that Lodestar has more than Max peer. I found that we have a configuration to close the server when we have max connection but we just count the inbound connection only.
  720. So I end up with opening issue in just libp2p.
  721. So Cayman, please have a look at that to see
  722. if we should work on that with libp2p team or someone else.
  723. Next, Nethermind has an issue of not able to sync
  724. from EF checkpoint sync URL.
  725. It is actually seven days out of that.
  726. And so the change is just to print out an error
  727. with more detail.
  728. Not sure why other guys don't have the issue,
  729. but Nethermind does, we'll ask them.
  730. Next, I work on some gossipsub.
  731. One is to update protobufs to protons.
  732. The benchmark is good, but when I try sort the memory leak,
  733. I thought it's the issue of protons,
  734. but then I chat with Matt and he found it's our issue.
  735. So we'll get back to that and may upgrade the proton
  736. work version from V2 to V3 there
  737. so that we can use the latest version of protons.
  738. Also, when we update Lodestar,
  739. there were some broken promise,
  740. broken metric from Gossip Sub
  741. we see some reject or ignore messages without the topic.
  742. So, created a PR in GossipSub in order to see that.
  743. Other than that, investigate memory still a little bit
  744. and found that version 20 is good.
  745. Next, I will try to wrap up my work on the index Gossip queue
  746. just with Lion and he said,
  747. As long as the performance is better,
  748. maybe we can go with that queue for now,
  749. and we can improve that later if needed.
  750. So we will ask Lion to update this PR.
  751. Also, I will look into Lighthouse
  752. on how they can maintain having zero historical state.
  753. Maybe we can apply that or not.
  754. We'll look, study the code a little bit.
  755. That's it for me.
  756. Oh, also, there was an issue with the libp2p where we have some script that may attack
  757. our node.
  758. I have a branch for that.
  759. And we've tested.
  760. Just to update some flags at the work that was already done in TCP side.
  761. Thank you, Tuyen.
  762.  
  763. Okay, and Cayman.
  764. Hey, so last week I was working with Alex a little bit,
  765. Alex at the libp2p team on this varint library
  766. that we were--
  767. like what the strategy should be for using varint
  768. and across the different libraries
  769. and how to unify on a single implementation.
  770. I put out a PR that I'm hoping he'll review soon.
  771. But we got roughly 10x improvement in decoding speed
  772. and roughly 5x improvement on encoding.
  773. So like that.
  774. But it is only a small part of our total CPU time.
  775. I think it's like somewhere around 3% to 5%.
  776. It's still-- it'll be nice to get that down way lower.
  777. Other than that, did some investigation
  778. with Nazar on the type of JavaScript
  779. that we're outputting through TypeScript.
  780. And we kind of determined that we can bump up to ES 2021.
  781. We were outputting ES 2019.
  782. And that made things like nullish coalescing really ugly
  783. in the output JavaScript, which I
  784. don't know that it has any performance implications.
  785. But it would be nice to just use the latest JavaScript
  786. if we can.
  787. So put out a PR for that.
  788. And then one other thing I noticed
  789. was that we still have our max mesh peer count at 9
  790. when the spec suggested it should be up 12.
  791. And we previously lowered it from 12 to 9
  792. because we were having issues with performance.
  793. And so I figure this is a decent time
  794. to reinvestigate whether or not we can bump that back up.
  795. But I think there needs to be testing done on that
  796. before we merge it.
  797. But I opened the PR just to get the conversation started.
  798. And other than that, this week I'm
  799. going to be bumping up to the latest version of LibP2P.
  800. Again, fingers crossed, this should
  801. be a relatively straightforward update.
  802. But we need to be the latest version
  803. to be able to get all the fixes and changes we're getting
  804. in the Gossipsub library.
  805. And also, if we are wanting to test the YAMUX, which
  806. I would like to start testing again,
  807. we also need the latest version.
  808. So we'll be doing that.
  809. That's it for me.
  810. Thanks Cayman.
  811. OK, any last minute points?
  812. All right. Thanks, everyone.
  813. We'll see you on Discord.
  814. Take care, y'all.
  815. Thanks.
  816. Thank you.
  817. Bye.
  818.  
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement