Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- Hey everyone, we are on the August 15th, 2023 stand up.
- Wow, it's been August.
- That is crazy how fast time flies.
- Back to school everybody.
- All right, what do I have on my list here?
- Okay, we'll do a little bit of the v1.11 planning.
- There are some things we have to reprioritize.
- Um?
- Yamux was one of them that was on my
- list here because yeah, there was.
- Mplex was removed, but I think Pawan
- one is reintegrating it.
- If I if I'm not mistaken, it came in.
- Yeah, I don't think they're going
- to actually cut the release.
- Without Mplex support.
- But right?
- Yeah, I mean, I guess it's still
- kind of a kick in the pants to get to get get Yamux over the finish line.
- is there a reason why they're implementing that is it more efficient
- or is it just a change or yeah so a blacks is deprecated by lipp2p specs
- and it got replaced by you know yamux which is supposedly better and
- and faster and all these things.
- - But is it whiz bang?
- - Our implementation isn't.
- So that's been why we haven't integrated it.
- I think there's a memory leak in it as well.
- - Right.
- - So discv5 memory leak.
- Is that related to the discv5 thing
- or are those different?
- - Different.
- Yeah, um, yeah, I know. I know we wanted to work on that. And then other things just kept
- getting in the way. There was a time where we wanted to like, continue like, well, we
- got libp2p upgraded. So now I don't know if that makes a difference, though, does it
- for the Yamux?
- Well, it just unblocks the, I guess, integration of, of, of the, yamux.
- Okay.
- Um, because I think it was only lighthouse that was, I guess, wanting
- to sort of deprecate, uh, Mplex, but do you still see this as like, well, I
- I mean, we should still do it as soon as we can,
- but is this something that we wanna try to maybe push
- in v1.11?
- I don't know if we'll get it fixed by then,
- but what was your opinion on it?
- - I think it's important, but not very urgent.
- - Okay.
- - If we can't, if the time comes for v1.11
- and it's available, that'd be great.
- And it is something that I started looking into yesterday.
- So I'll try to keep pushing on it.
- - Cool.
- All right, thanks.
- All right.
- Just one of those things that's just like a good reminder
- of when things like fall to the side or wherever,
- we just gotta keep pushing to the finish line
- on these things.
- But yeah, let's see what else do I have for v1.11 planning.
- I've been mostly focused on the Holesky stuff
- just to ensure that we're caught up
- with whatever we need to do to ensure
- that we will be successful on that testnet.
- - I think I did.
- - Oh, sorry.
- - No, go ahead.
- I didn't, I thought you were done.
- No, I was just going to say that the keys have been generated.
- We've uploaded them for Genesis.
- We plan to do 5,000 keys per server,
- but that has never been done before.
- So the infrastructure team is aware that things may break,
- and we may have to lower the amount of keys on each server.
- But anyways, that's all uploaded now.
- taken care of. Genesis should hopefully be within the next like two weeks I think,
- and we'll see how that goes. One of the things that was mentioned as well, thanks Tuyen for
- reminding me, is that we should figure out a way to deal with finalized archived states.
- So I put up an issue literally just before this call to see if anybody can implement
- some pruning features for that, because if we can limit the amount of beacon data that grows with
- that, it might help with some of the Holesky servers. That way, we don't keep running out
- of disk space and stuff. And I don't believe we need any of those finalized or archived states
- for a testnet like that. So if someone can take a look at that after the call and see if they
- they can implement it.
- I tagged it for v1.11 as well
- to see if we can get it in there.
- That would hopefully help with that.
- - Well, I'm actively working on the blocks again.
- Lion gave me some pointers in order to be able to do that.
- It's not state, but it's blocks,
- which kind of gets stored in the state as well.
- So I got some headway on it this week
- and I'm gonna keep working on it.
- - Okay.
- Great.
- >> What else do I have on here?
- >> There's something that I would like to add,
- make sure it gets in v1.11, is the,
- I've got a draft PR open for adding a boot node CLI command.
- So you can run a Lodestar boot node, and it basically will just spin up a discv5
- server and not spin up a beacon node.
- And we want to eventually, or after that,
- we want to spin up at least three different Lodestar boot
- nodes in different geographic areas
- and contribute to the boot node ecosystem.
- Because all the other CL clients are running boot nodes,
- so we really should as well.
- And I do think that this is an area where we can actually meaningfully impact security
- the network because I think everyone else is pretty much using like very similar providers,
- very similar jurisdictions.
- I think it's all like AWS.
- And then like, I think there's a bunch of people running things in Frankfurt.
- So it's like, it's not very well distributed among jurisdictions and providers.
- so we could do a lot better.
- Oh, and we'll have a sub cluster in the India.
- Thanks, Gajinder, for your contributions.
- OK.
- Anything else that people want to tag as for v1.11?
- I'm hoping to get this thing potentially ready
- within the next, I would say,
- if we can even get a beta out by like a week from now.
- But the thing is I'm not,
- like I'm taking the last week of August off,
- so I wouldn't be able to help sort of push this through.
- So if--
- - I'd like to see the stuff that Tuyen's working on,
- the fork choice. - The fork choice, yeah.
- core choice of performance improvements, that'd be nice.
- - So I do have a-
- - So I would like to have a protons upgrade
- in gossipsub to Cayman,
- 'cause we left that for a long time.
- The benchmark is just the same to protobufs-js.
- So there's no issue there,
- and there's the memory leak.
- I will test it back in group feature 3
- and maybe let the whole day to test it.
- >>Okay, cool.
- Yeah, I haven't looked too much into that PR,
- but I had one question about it,
- which was, I think we had some non-standard,
- or we added some limits to the protobuf decoding,
- where it would limit the number of messages
- that could be inside of a single protobuf,
- or a single RPC, I think limited the number of control
- messages, the number of IWants and IHaves, things like that.
- But at the decoding level, as an additional security check,
- basically.
- And I was wondering if protons in generating the protobuf
- decoding stuff using protons, if that feature is still
- going to be there or if how we would tackle that.
- I don't see it's being there, I can see the codec source code that was generated so it
- just do the same thing like what lion did, I bring the source code to a separate file,
- and then do our own codec with the limit option there.
- So it's not difficult.
- Okay.
- I want to make sure I've tagged all of your fork choice PRs for this Tuyen.
- If you can just either send me the PRs, I want to make sure I got them all.
- else. Let's see. Oh, um, the memory leak issue. I feel like I should make a tweet for this
- because I think the reason why Nico reopened it was because somebody didn't see that issue
- because we closed it right? Yeah, a few people popped it up. Yeah. Okay, maybe I feel like
- I should make some sort of announcement on discord and Twitter about this. So people
- What else?
- Anything else that's on your radars that could possibly get in within the next like week
- or two weeks at most?
- Like, no, nevermind.
- BLS?
- I would love to.
- It's running, it's looking nice.
- Like it's looking nice, but like to be fair,
- it's such a big change that like,
- I would like everybody's blessing and it runs a few days
- and lots of perf reports, but it looks nice.
- It's running.
- Feature two, if you guys see it, I mean like a lot of like,
- mesh peers are looking good,
- length of connections are good,
- message data bytes per sec,
- like all of the, like the metrics are looking nice.
- - What about state transition?
- - It's aggregating state faster.
- They're faster, like there's less lag in the state transitions.
- Like it's like across the board, like there are certain things that just, I mean, it looks nice.
- I'd love other eyes on it to make sure that I'm like, do you know what I mean?
- But it's on feature too.
- So you'd say it's ready for like a final review.
- Is that, uh, it's ready for like looking at the metrics and stuff.
- And then like the PR for actually like integrating it is a lot.
- There was a load of changes in there.
- So like it might be worth breaking into a couple pieces because I added a couple flags
- And renamed a couple flags and like also I had to it touches a lot of files because I didn't want any crossover of the classes
- So like basically I had to import like change the header and lots of files where the BLS was being imported or the types are being
- Imported they're all they're uniformed. You don't I'm saying but like so there's just a lot of little changes that
- That are spread. So it's a big PR
- And we have to publish the package and get that done and the CI and the blast library
- So there's a lot of little things left
- But I mean, it's all doable as long as like the metrics of everybody's okay with them. The only thing is the gossip score
- There are a couple like it's more average across the gossip scores
- But like there's a couple deeper spikes and I'm not sure why those couple
- Spikes are there so like there's a couple things that gives me pause but like otherwise most stuff look pretty good
- Like the BLS metrics are all working. It's it's yeah
- You said this is feature two or feature three
- to
- One of the other things I want to get an update on as well is the...
- I'm trying to think, did we run subscribe all subnets
- on any of the production nodes or anything on mainnet at all
- to see our performance?
- I don't know if we have, but we should look into that a little bit more
- because Tuyen also has his issue to increase target peers at some point.
- I'm not sure when we would want to do that.
- I guess we need to ensure that we're able to handle this, correct?
- Actually, I should probably run BLST with subscribes to all subnets as well
- and see what happens when you boost it up and it's doing more gossip transitions.
- That's when it would probably do even better.
- Hopefully, knock on wood.
- So, regarding the target peers, I can work on that next.
- It's just that the feature three is busy to test the protons.
- Once I'm done with that, I can test with the max peers of 100.
- Okay.
- Yeah, sounds good.
- And actually another thing that might be able to get in is adjusting new space, because I've been playing with that all week, and it significantly reduces GC time.
- time. So we've got like for instance on the main net nodes, it goes from 10 to 12% to like two, two
- and a half percent of no time doing GC. So like there's a there's a couple really good, which is
- affecting block times and a few of the other bits and pieces. I have noticed some regression in like
- scrape duration and a couple other random which is why I asked that question in in private of like
- what are the other things that I really should be wary of, maybe that I'm not paying attention to,
- just to make sure that it doesn't degrade node performance inadvertently in some places.
- But that's a good one that we can include as well.
- Um. All right.
- Uh, we have quite a few things already in the v1.11 milestone, so I think unless there's something
- pressing, we might be good for v1.11 planning here at this point.
- there. And there was one thing.
- There was one thing that I
- wanted to mention this while
- here. Just trying to find it
- again. Yeah I don't know if
- there was any advancement in
- Lodestar beacon node, the aggregated attestation errors,
- do you have any update on what's going on there, Nico?
- - So I posted the update on the issue.
- Sorry, Nazar, go ahead.
- - I was working on a PR,
- which where we can run different combination of validator
- and beacon nodes in the simulation test.
- Would that detect if the aggregate is not produced?
- - Yes, it will.
- - Yeah, so based on what I checked
- and I debugged it a bit is that,
- so we have this cache where we,
- when the VC wants to get an aggregate,
- we check in a cache.
- And I think this is filled from the previous attestation.
- And it looks like the data route is different,
- but at least we don't have anything in the cache
- that matches the data route that Lighthouse is sending.
- - Okay.
- Yeah, I just have a tag that's like v1.11,
- but I suspect that this is gonna get pushed,
- but I just wanted to check up on it.
- So thanks.
- - Yeah, so I posted detailed logs and stuff on the issue.
- If anyone has ideas why this could be,
- and maybe we can discuss there.
- - Yeah.
- You, you're also in our telegram chats
- with Lighthouse, correct?
- Just in case we need to correct.
- - I don't think so.
- - No? Okay.
- I'll make sure. - Not with Lighthouse, no.
- - Okay. I'll make sure you are,
- just in case there are some questions
- that need to be routed to them as well.
- Okay, why don't we just go with a round of updates
- if there's nothing else for planning.
- So anybody have anything else
- they want to bring up for planning today?
- Cool.
- All right, let's start with Tuyen today.
- Hi, so I investigate the issue with the bigboy testnet.
- The main issue we have is we have a long update head call, which may take up to eight seconds.
- That's why we have very low peers.
- And there was a two PR to improve that.
- The main issue we have is in the devnet test a lot of unfinalized proto nodes and the updateHead
- call grows exponentially.
- That the main issue is there's a lot of checking
- the node has the same finalized checkpoint.
- So the main fix was in node is verify for head.
- With that, it reduced a lot of time.
- And I think we will not have this same issue
- the, if we, we use the next version of Lodestar on that testnet.
- The other one we check is to track vote by index, they will have to improve the compute
- delta function.
- In the performance test it is like 2x, or 3x better, but somehow when it test in a mainnet node is like eight times faster. Normally an updateHead call is like 240 milliseconds,
- but it reduced like 30 to 40 milliseconds. So, please review that here.
- The other thing I'm working on is protons migration. So, the protobuf in gossipsub
- is implemented by protobufs-js, and we have a plan to migrate to protons, and we left that for a while. Now I think it's a good time to to implement that in the system, and I'm testing that in group.
- The other work is in gossipsub there was a small PR to the de-duplicate metrics in gossipsub.
- The issue is that time I have a PR to unbundle metrics, and we have to two validation phase, one in the gossipsub, and the other one from the application level, and the name is very similar, like, count number of invalid or valid messages. And then then the name is duplicated. So, in the PR, I suggest we de-duplicate the name, like in the system we can call it pre validation result, mainly to differentiate from the result.
- The validation result from the application. The main point is just to be duplicated it, but if someone had any better ideas, please comment on the PR.
- But that's important because when I run Lodestar with the latest gossipsub, I cannot operate without that PR.
- And last thing is the index gossip queue.
- The implementation of the queue was merged now I'm working on a PR to consume that as the
- last PR for this work.
- That's it for me.
- Awesome.
- Thanks, Tuyen.
- Let's move forward with Nico.
- Okay, so one thing I looked at was that lighthouse issue and documented a bit about that.
- Then I also looked at updating the bootnodes.
- So that's already in.
- I basically just pulled the latest changes and the hard-coded values or basically updated
- the ENRs that we have in our code.
- Besides that, yesterday I looked at this one issue.
- we can also get that in 1.11 where we don't support authorization header anymore or basically
- basic auth is not supported. That was rather a bug in the previous implementation we had
- because it didn't follow the spec. And actually if we updated to NodeFetch version 3, we would
- have had the same thing. But so this PR just builds the header beforehand basically because
- fetch library itself doesn't do that. Yeah, besides that was mostly looking at moving
- the network to a process instead of a thread. So I got that to kind of work. But the problem
- right now is that all of a sudden, it just stops. So the so where I'm at right now is
- is that the main thread is just not receiving
- any message events at some point.
- So I'm not sure what the reason is for that yet,
- but we are definitely sending a ton of events
- from the worker to the main thread.
- So could be related to that, maybe.
- Not sure if it's a bug in the child process implementation,
- still need to figure that out.
- Yeah, so basically stuck on that right now,
- trying to debug that.
- Thanks for the update, Nico.
- All right, Gajinder.
- Hey guys.
- So mostly, I worked on
- on the block proposal improvement tracker.
- So I'm trying to integrate produce block V3,
- which is basically a combined API
- for both execution block and a builder block,
- and which basically will move our block builder
- versus execution race to the beacon.
- And apart from that, I also tried,
- I also just tried a bit on broadcasting the proposals,
- the proposal data that validators send.
- But then there are many questions that came into head
- over there, for example, in the validator,
- should we treat each beacon node URL
- as a separate beacon node,
- because for example, in some of the, in produce block,
- you know, when we send produce block API calls to these,
- we should send in parallel,
- even though they are configured as fallback,
- because any delay, if for example,
- the base, the zero URL doesn't respond.
- So then we, as a fallback,
- we send the request at a later point,
- which basically cause a delay.
- So should we send parallel requests
- or should we race them like how we race
- builder versus execution because it makes sense.
- For example, one beacon node might not be connected
- to a builder, one could be connected to,
- others could be connected to builder.
- So one of the execution nodes could resolve
- way before that, one of the beacon nodes
- could resolve way before the other beacon nodes.
- and that basically will not give us the optimal block value
- that we might be trying to produce.
- So maybe we should consider all these beacon URLs
- as independent block producers and race
- and apply our cutoff and race module over there.
- So these are the questions that came to mind.
- So I basically put that on the back burner
- and trying to push V3 out.
- Also, I looked into the Whisk spec,
- dig through it and sort of got into
- and understood, had some discussions with Lion with this.
- And yeah, that's mostly it.
- And did some couple of reviews.
- - Thanks, Gajinder.
- I'm curious, did anybody give any thought
- to this sort of fallback URL race at all
- and how it should be structured?
- I'm interested to know like the sort of trade-offs
- in the way that we're designing this.
- - I mean, I think Vouch is the only one
- which races multiple beacon nodes.
- Most of the other validators are sort of hooked to one node
- that's the basic assumption. Not really sure whether what the other clients' observations
- have been on the fallback node. Maybe I'll ask someone. But Vouch do race multiple beacon nodes
- to get the most optimal proposal. And they must be doing some sort of a wait for sure, because
- it is very possible or they are hooked up onto the same system where there is the same builder
- attached to the beacon node. So it might be totally a homogeneous kind of setup for the
- beacon nodes. They are attached to the same kind of builders. So it could be that. But if there is,
- for example, you know, you send, you registered to one particular node, your registration to one
- particular beacon node failed, and it will alter the timelines and the value of the block that is
- producing, so which will anyway impact the final blocks that validator should be weighing to see
- which one it should pick for block proposal. And obviously they will come at different times, so
- you can't really have a simple say that okay first one resolves will win. I think our strategy where
- we say that there is a two second cutoff and post that the first if all in that cutoff,
- if not all resolved or a few error, then we'll basically race to see whoever resolved first.
- So basically we wait till the cutoff and then pick the winner. I think that is an optimal strategy
- to go for and I'm not sure how other clients are sort of implementing that.
- But even if we move our race to the beacon node,
- where again, we'll use the same race
- against builder and execution.
- But I think it would still make sense
- with regard to having fallback nodes,
- that we should race them
- and we should basically start proposal flow parallelly on,
- we should at least start the produce block flow parallel
- on all of them and basically again, cut off and race.
- And for publishing the blocks again,
- we should broadcast rather than saying that, okay,
- we'll send to one and then we will,
- if it fails, then we'll fall back to the other nodes.
- I think these are the things that we can do
- and we'll need to sort of have a different kind of treatment
- for how fallback beacons are treated in validator.
- So I'm thinking maybe I'll introduce a method
- where basically the HTTP client will raise all the URLs
- with some cutoff and some threshold,
- with some cutoff and timeout.
- So we can get this very generic interface
- that we can easily plug into any calls that we want to run through this mode.
- Okay, sounds good. Yeah, I was just curious because with Vouch, I'm not sure if with Vouch,
- if how much the timing matters comparatively to the value of the blocks. There
- There is one Lido node operator who is running Loadstar with Vouch, or at least I asked him
- to, but I don't have any data back yet to see how we perform against the other clients
- on vouch yet.
- But that would be a very interesting metric point to see how often Loadstar is actually
- viable for them on Vouch.
- I mean if he knows the specific of how Vouch is running the proposal flow, I think
- it would be nice to sort of you know discuss it with them.
- Otherwise, we can definitely dig into the code.
- But if somehow someone can give us the knowledge of how Vouch is doing it, why not?
- Okay.
- Okay, thanks, Gajinder.
- Okay, thanks, Guginder.
- All right, let's move forward with NC. How's it going?
- Hey, guys.
- All right. So last week for the ePBS, on the ePBS side, I think we focused on the inclusion list design in terms of the discussion.
- And there's a lot of inclusion list design out there.
- You have a forward inclusion list, you have same slot or same block,
- top of block, bottom of block design.
- But nonetheless, in terms of engine API,
- I think they pretty much share the same specs.
- So I wrote that up.
- Here, I'll send out the link.
- So for anyone who is interested,
- any feedback is welcome on this engine API part.
- Right, and then for the rest,
- regarding the validator spec, builder spec,
- I think, you know, the Prysm folks are still,
- you know, picking their brains
- and there's just a lot of like, you know,
- small details that need to consider.
- That's all on the ePBS.
- And then for 6110, I did an analysis on the pubkey cache
- in the current code base of Lodestar.
- Right.
- Personally, I think that like, I don't,
- there is really no need for the unfinalized
- index to pubkey cache.
- Not only because the beacon API,
- it doesn't really use the index to pubkey,
- like beacon API only uses the other way around,
- pubkey to index.
- But also like for any single use case out there,
- they're using indexed pubkey,
- like it always require like an active validator.
- So there's no need for them finalized cache.
- I saw like comments from Gajinder.
- There's one point saying like non-finality.
- This is something that still need to think about.
- But nonetheless, like I still, you know,
- wanna get some more input from Lion
- and hopefully we can settle down with the refactoring
- with the pubkey cache design,
- and I can go ahead and start the coding.
- Right, that's pretty much for me.
- - Thanks, NC.
- Awesome.
- We'll take a look at that in a bit.
- All right, let's move forward with Cayman.
- Hey, y'all, so last week I got libp2p updated
- to the latest version.
- Fairly straightforward update this time.
- Basically, there was a lot of package renaming.
- And now js-libp2p is a monorepo.
- So if you decide to contribute, just keep that in mind.
- It's all in the libp2p/jslib2p repo.
- Except for the ChainSafe maintained packages,
- gossipsub, noise and yamux.
- So now that that's in, that unblocks Tuyen
- and some of the things he's doing with gossipsub
- and yamux and doing basically any other upgrades.
- We could not upgrade any of those dependencies
- without upgrading libp2p first because of all the breaking
- changes.
- And the other thing I worked on was adding a boot node CLI
- command.
- It's functional, but I just wanted
- to see if I could do a little more cleanup on it
- and see if we can reuse more code between the beacon node
- CLI initialization process and the boot node initialization
- process.
- So hoping to get that polished this week.
- And the other thing I was working on
- was digging into Yamux.
- So currently Yamux is still slightly slower than Mplex.
- And I'm really puzzled why.
- But I'm basically trying to add all the little tweaks that
- are added to Mplex, little performance tweaks,
- to see if that brings things back to a comparable place.
- But so far, I have not gotten it,
- at least in these very naive comparison tests,
- to be one to one.
- It's still a little slower.
- Like, I don't know, like 5% slower, 10% slower,
- something like that.
- And I think I was able to see a memory leak just
- in the comparison tests.
- So sending a bunch of messages, I
- was able to see memory rise somewhat
- and took a heap snapshot.
- But I'm having problems viewing the snapshot.
- So as far as I know, there's only a single tool
- that can visualize a Heap snapshot,
- and that's the Chrome DevTools.
- Anyone who knows any other tooling that
- can help show a Heap snapshot, please tell me.
- But my Chrome DevTools is basically just spinning.
- It's basically waiting.
- It says building the dominator tree.
- >>Can you try the Brave DevTool?
- In the past, I used to have the same issue.
- It cannot be viewable by Chrome DevTool,
- but Brave DevTool somehow works.
- >>Brave DevTool.
- OK.
- Thanks.
- Excellent.
- So yeah, this week, I'm going to finish polishing up
- the boot nodes PR and keep digging into the Yamux,
- see if I can squash this memory leak,
- and hopefully add some little performance tweaks
- to try to get it up to the level of Mplex.
- And that's it for me.
- Thanks, Cayman.
- All right, let's go with Nazar.
- Thank you.
- The last-- can you guys hear me?
- So last week, I was working on making
- Prover with Web3.js 4 version.
- There were some issues, so I opened one PR to fix those.
- Now it is working fine.
- And while I was doing that, I found out
- that we have a bug in the browser logger.
- So it was not logging anything in the browser.
- I fixed that and along when I was doing,
- I saw that there are no test,
- unit test for like environment logger and browser logger
- and other some of the logging stuff.
- Because of that, we were unable to identify it earlier.
- So I wrote the unit test for the logger package.
- And yes, and currently I finished up the Lightclient demo PR.
- It's ready for the review.
- If someone is available, can we do it?
- So that is like using the Prover now
- instead of the Light client directly.
- And yeah, currently I'm working on a PR
- to decouple the beacon and validator in the simulation test
- so that we can mix up the way we can mix right now
- the execution and the content,
- so we can mix up execution beacon
- and our validator in the simulation test.
- So this PR will be done today.
- It's almost gonna finish and next week,
- like during this week,
- I have planned to start working on
- integration with the MetaMask.
- I started that draft earlier.
- So I already have something to work on.
- I will complete that draft and make it
- like some first presentable demo
- of our prover working inside the Metamask.
- Yep, that's all from me.
- - Thanks, Nazar.
- And we have Matt.
- All right, so a couple things that I worked on this week
- were got blast PR done and integrated into load star.
- The both of the, or actually all of the rest
- of the remaining PRs that were up in the blast repo
- all got approved.
- So just CI and a couple other things there.
- Looking at the metrics in Lodestar.
- So it was running slower with four LibUV threads
- because we had 10 worker threads
- or eight worker threads, depending on originally.
- So once I boosted the worker pool
- up to match the number of threads that we had,
- it definitely seems like it runs a lot smoother.
- A lot of the metrics are within five, 10%
- of what they were originally by that.
- Most of them are a little bit better.
- Nothing is really like knocking the socks off.
- Like there's a couple of things that look good,
- like the number of aggregated keys,
- number of signature sets,
- and a few other of the blast metrics look much better.
- Things like head drift
- or the block epoch transition times
- look pretty good as well.
- Block production times are doing okay.
- The long live peers are doing better.
- there are a few things that look like they're doing a little worse.
- So I don't know if that's just how the implementation happened,
- or if it's the library itself.
- But overall, it seems like it's either a net or a little bit
- better by a fraction.
- But the big thing is the CPU time has dropped by 30%.
- So whereas if we have 64% CPU on the unstable large,
- it's running 48% on feature two.
- So there's a lot of opportunity in there
- just by opening up resources.
- It is a little bit heavier on memory,
- but there's opportunities there as well, I think.
- And so it could use some more tuning,
- but it's getting a lot closer
- and it's starting to feel like it's working okay.
- But obviously I leave that to everybody else
- to just take a look and make sure I'm not missing something
- because a lot of these metrics are still relatively new.
- And then I also have been playing
- with the semi-space and new space.
- I deployed several different versions over the week.
- And it definitely, the more new space we have,
- the better it performs.
- We're average garbage collecting 200 gigs roughly
- on main net and spending 10 to 12%
- of time doing network processing garbage collection
- by increasing the new space.
- The more we increase it up to that threshold
- of what our average garbage collection is currently
- on the scavenge side,
- it basically just continues to reduce
- the garbage collection time, pause time.
- I'm gonna deploy one more where I'm gonna go way over
- to 512, which should be double of what we're actually doing
- for garbage collection to see how that affects it.
- I've been paying attention though,
- mostly to the smaller nodes to find out like,
- because they are actually only doing 30, 40, 50 megabytes
- of garbage collection on the scavenge side.
- And even though we're setting it to 64, 128 or 256,
- it's really not affecting the performance.
- So like setting it much higher
- than what the scavenge quantity is,
- doesn't seem like it affects it on the small one.
- So I'm also expecting that same behavior
- on the large instances
- by setting the garbage collection too high.
- But I just wanna just prove out that that theory is correct
- by deploying it and making sure
- that the assumption is correct.
- And I'll do another write up.
- I put up some information in there just for posterity.
- And then this information is all gonna directly tie over
- to the worker thread when we deploy that.
- So it's applicable to both whether network is on main thread
- or on the worker thread.
- And the rates are gonna be the same
- because I was actually playing with this
- a couple of weeks ago as well with the new space.
- And I was seeing the same thing with the same numbers
- is why I chose those numbers this week
- without the worker thread,
- just to make sure that like the space sizing
- was commensurate with, you know,
- when it's on main thread as it is on a worker.
- So it'll apply to both situations.
- And then I spent a little bit more time trying to research.
- Basically what I did was, is I went through GitHub
- and I just looked for the error zero.
- And I found it in a couple of interesting places
- that I think is really what's driving what our issue is.
- And then I was gonna try to spend a little bit of time
- this week looking at that, see if I can actually hone down.
- But I think it's coming from the serialization,
- deserialization in order to cache
- and whether we build or don't build in NX in the Monorepo.
- Because it basically it's part of the serializer and like during startup and basically it's deserializing and serializing the
- like the cache state to see if we've changed is I think what's happening.
- I might be totally off base but it's the next thread that I found when I was looking through a bunch of different of the errors.
- I posted a few of them into our Discord channel just to see for posterity there as well.
- And then hopefully pulling on that thread, we'll find something.
- And then I also spent a little bit of time, not a ton, but I did spend a little bit of time,
- Blind had put together a, not a POC, but like a couple ideas on a branch in order to how to update the
- the blinding blocks and it's a little bit different than the implementation I was using
- and he's got some really good ideas in there. So I studied what he did and then brought that into
- my own branch and I'm going to go through and continue processing our chunking off bits and
- pieces of what he had done and bringing them in and then merging my code with his code and see
- if I can get that to run. And then I'm going to start with unit tests on just the transition
- functions first instead of trying to build the whole thing and see if I can piecemeal it
- and narrow down some of the complexity because it is there's it's it's fairly difficult PR
- what I realized last time. So that's going to be probably the big heart of it is anything that's
- necessary in order to try to get the little bits and pieces for blast. But again I would love
- came in if you want to just tear into the metrics on feature two and like be brutally honest like
- this looks good this looks shit let me know what it is so that I can also like kind of under you
- know borrow your eyes in order to see what you see so that I can kind of train myself in order to
- to understand and make sure that I'm looking at the right things as well that'd be super helpful
- and to you as well anything that you guys see would be amazing because I want to make sure
- that I'm looking at that correctly. And, and then I'll focus on the blinding box,
- because I know that's going to be something that's going to go into 110 or 111.
- All right, thanks, Matt.
- Makes me wonder, if you guys have an opinion on whether or not three feature groups is enough
- for you to do like all your testing on. That's cool. But if you if you do feel like we are
- consistently like running out of server groups to like test things on, let me know I can inquire
- with infrastructure to set up like a feature for something. When we're not testing things,
- you can also use beta as well. So as just like another way I have been. Okay, perfect. Cool.
- Like because I and as a matter of fact, like I was testing 64 128 and 256. And then I put BLST
- up also and I took one down so like I was using three this week. I'm going to be taking some of
- those down so three will free up. But yeah, I would definitely felt like I was monopolizing.
- So I apologize. I'm going to take those down now that I've got a couple days of data on them.
- I can I can basically pull them down.
Add Comment
Please, Sign In to add comment