Untitled

Just some recommendations from me on running a node and aiming to be as highly reputable as possible with little downtime:

- For everyone intending to run a node on main-net, please don't use light Ethereum clients. This includes both Geth and Parity. When you run a light node, you're soley dependant on the quality of your peers. This could result in people trying to single you out exploiting you for collateral.
- Based on that, I'd suggest trying to get your full node synced as soon as you can. You can run Geth in fast mode, and that will take up around 100-110GB currently. My own suggestion is using a t2.medium for this and setting your disk size to leave enough room for future Blockchain growth. That's around 10GB a month currently from what Vitalik has mentioned before, but as of Geth 1.8, it enables automatic data pruning so this should stay around the same. If you get to a point where you hit your disk space limit, your Geth node will get into a reboot loop and you'd have to template it, create an AMI then spin up a new box with an increased disk space limit. With around 100GB worth of data, templating takes around 20-30min.
- If you wait until main-net release for creating your full main-net node, it could delay your node operation for around a week or so, maybe less, maybe more. Getting your Geth client synced early ready to release will be a good oppurtunity to build your reputation very early on.

IMO what will play a large part of being a good quality operator is making sure you keep your node/Geth/adaptors up-to-date and at the latest versions, this can cause some issues when running in production if you've not tested doing upgrades of your nodes. Some of my own suggestions to ease on this front are:

- Fronting your Geth instances with a load-balancer. This will be a huge gain for you, as when you upgrade any instance on AWS, to eliminate any downtime you want to boot up the new version of Geth alongside your old version and then flip the switch once the new version has synced. If you do this without a load-balancer, you'd have to stop your node, change the config to point to the new Geth instance and then bring it back up again. If you use a load-balancer and add the new version when it's synced, while turning off the old one, there would be zero downtime as the load-balancer will simply forward the requests to the new Geth instance.
- To aid with spinning up new Geth instances, I'd recommend some form of backup to export and store the Geth main-net chain. With that, you'll be able to create new Geth instances from a template and drastically reduce the time it takes to sync the new one. This is due to only needing to say sync weeks worth of blocks, rather than starting afresh. You could use something like S3 or EFS to store the Geth backup, and then get your new instances to copy from those locations and then importing it into Geth on-boot.
- On the flip-side, if you dockerise the Geth instances, you could quickly stop and then boot up a new version of Geth. This means you wouldn't have to template your instance, but you would see some downtime. I don't think this would be much of an issue early on, but once features like off-chain computation are a thing, you could have a higher chance in hitting penalities and loosing reputation from upgrades. The penalities may not be that much, but I feel the impact on your reputation would effect you more than any slight penalty hits. One thing to note is that I have encountered corrupted chains when upgrading Geth versions on odd occasions, so I would recommend using new instances from templates rather than taking down production boxes and rebooting.
- Those considerations for Geth are similar for your node. For example, if you're upgrading to a new version, you'd need to boot up a new instance with the same db and wallet. You could do this manually by copying the data over and then starting the new version of the node since the size of the data isn't that large. When you copy over your data and wallet, your new node will be a carbon copy of the old one, keeping the same address etc.
- Although, I'd highly recommend backing up your node instances. AWS EBS storage can and do corrupt on rare occassion, if that happens, you'd loose everything if you've not backed that up properly. You could again use S3/EFS for this, having a daily/weekly cronjob to sync your nodes data to one of those locations.

This next point is something that adds quite a lot of complexity onto your solution for running a node, but it's something that could have a huge impact to the overall health of the ChainLink network on go-live. This is regarding the region and availability zones in AWS:

- Most people will probably spin up a node on AWS and manually create a box in the default location and availability zone, that being region us-east-1 and availability zone A. AWS availability zones have seen downtime in the past, it is very very rare, but it can happen. If for any reason us-east-1 A went down, you'd suddenly see a enormous chunk of the ChainLink node capacity be taken down with it, essentially showing some nature of centralisation within the network. To alleviate this issue, you can create an Auto Scalling Group in AWS for Geth and your node(s). With doing this, if for any reason an AZ was hit with downtime, your scalling group would know and then boot up a new instance in a working availability zone.

Just to expand on that though, no-one is forced to use AWS, the more variety the better. This could be any other similar SaaS providers like Google Cloud, Azure or Dreamhost as examples.

Another important aspect is logging and alerting. Ideally, if you're using AWS it'd be good to stream your logs into CloudWatch so you can keep tabs on your node. It'd be also good to set up alering for high CPU/RAM usage, any downtime and any errors which show up in the logs so you can react in prompt time.

The most important point: Test, test, test, test! Do dry runs with upgrading, do some disaster testing and try to break things. The more testing you do, the more familiar you become with your setup and the more you'll be understanding of anything that can cause change in the future and the quicker you can recover your instances if any problems occur.

As always, I'm more than happy to help people with any of this and expand/clarify on anything I've said. There's some AWS articles which have nice step-by-step tutorials aiding you with doing some of this:

- Tutorial: Create a Classic Load Balancer: https://docs.aws.amazon.com/elasticloadbalancing/latest/classic/elb-getting-started.html
- Getting Started with Amazon Elastic File System: https://docs.aws.amazon.com/efs/latest/ug/getting-started.html
- Tutorial: Set Up a Scaled and Load-Balanced Application: https://docs.aws.amazon.com/autoscaling/ec2/userguide/as-register-lbs-with-asg.html