Untitled

Could you please do a sysadmin test for us?

You are responsible for the initial deployment and later support of a web application in an approximately 100 server cluster (exact number doesn't matter, but assume that there are many physical servers to take care about).  This web application is some code written by a separate team of developers.  It is intended to run under Apache/Linux and use SQL database (say, MySQL) as a backend.  The application is expected to be run in 24x7 mode.

Please outline how you would deploy this web application and what process you'd setup to ensure that
a) it is available in 24x7 mode,
b) there is a seamless way to upgrade this web application when new version is released (this happens weekly).

Three--four paragraph answer is enough.

There are some software products for service and management automation that can be used to solve such a task. Also an in-house solution can be considered for deployment. It should meet the following statements.
1.	We assume that network redundancy issues and DC fault tolerance issues are solved. Our solution requires a distributed redundant fault-tolerant system with fast enough network interconnects to be built.
2.	We assume that the servers belonging to our cluster are already assembled and set up; OS and system software are already set up properly as well as SSH public key authentication for all servers we need.
3.	All web-servers are served by load balancers based on nginX or similar software to provide both redundancy and load balancing; the load balancers’ redundancy is provided by using CARP technique. Also hardware load balancers such as Cisco SLB devices can be used with VRRP set up for redundancy.
4.	Application update server should implement the following algorithm
4.1.	Our script commands balancer(s) to exclude a web-server node from the balancing scheme.
4.2.	We update the application from our SVN-repository or similar with svn up (or similar) or similar by connecting to the node being updated and running appropriate scripts; then we re-generate node-specific configurations using some template engine.
We can update the application the same way but with files created through some standard package manager so that package contained pre/post install scripts would provide transition of local configuration to new version, in this case, creating and maintaining packages should be a developers task
4.3.	Apache service is restarted.
4.4.	Our script commands load balancers to include a web-server node into the balancing scheme again.
5.	It’s desirable that the servers would not be updated all at the same time (flash-cut way), the «one, some, many» strategy is recommended. When we install an update that has no full backward compatibility (for example some significant changes in the database structure are necessary) we should temporary split our system into two independent parts and update one part first and then update and join the second part under the system administrators’ control. Both parts of the system should be capable of serving all the workload. Complex application updates should be developed in a way supporting multiple versions of database schema in the application code.
6.	In order to provide a MySQL redundancy a master-slave replication can be used with a virtual master IP and switching to slave in case of failure.
7.	We should deploy and set up a system for monitoring all the hardware, OS, services, network channels etc. to monitor the availability and performance of all the parts of our systems by using Zabbix or similar software.
8.	We should develop and put into operation a complete set of regulations for incident escalation and engineers notification in case of any failure as well as automated scripts and manual intervention sequences for all the typical failures for predictable availability and redundancy state recovery of any part of the system.