Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- First let's show the watchdog priority off all the machines (sorry there's multiple entries per host, I've been troubleshooting some strange looping logic of assigning the priority per machine, but third time is a charm :P)
- ubuntu@ip-172-31-2-26:~$ sudo salt '*' cmd.run "cat /etc/pgpool2/3.5.3/pgpool.conf| grep wd_priority"
- ip-172-31-2-7.us-west-1.compute.internal:
- wd_priority = 1
- wd_priority = 5
- wd_priority = 3
- ip-172-31-2-6.us-west-1.compute.internal:
- wd_priority = 1
- wd_priority = 5
- wd_priority = 2
- ip-172-31-2-8.us-west-1.compute.internal:
- wd_priority = 1
- wd_priority = 1
- wd_priority = 4
- ip-172-31-2-9.us-west-1.compute.internal:
- wd_priority = 1
- wd_priority = 3
- wd_priority = 5
- ip-172-31-2-4.us-west-1.compute.internal:
- wd_priority = 1
- wd_priority = 6
- wd_priority = 6
- ip-172-31-2-5.us-west-1.compute.internal:
- wd_priority = 1
- wd_priority = 4
- wd_priority = 1
- Now let's make sure that 172.31.2.4 is infact the master
- ubuntu@ip-172-31-2-26:~$ sudo salt -G 'roles:postgres-master' test.ping
- ip-172-31-2-4.us-west-1.compute.internal:
- True
- Salt seems to think so, let's make sure replication is alive and well on it now
- ubuntu@ip-172-31-2-4:~$ sudo -u postgres psql
- psql (9.1.22)
- Type "help" for help.
- postgres=# select * from pg_stat_replication;
- procpid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | state | sent_location | write_location | flush_location | replay_location | sync_priority | sync_state
- ---------+----------+-------------+------------------+-------------+-----------------+-------------+-------------------------------+-----------+---------------+----------------+----------------+-----------------+---------------+------------
- 28655 | 16384 | replication | walreceiver | 172.31.2.5 | | 44840 | 2016-08-02 22:21:34.352537+00 | streaming | 0/6000000 | 0/6000000 | 0/6000000 | 0/6000000 | 0 | async
- 28901 | 16384 | replication | walreceiver | 172.31.2.6 | | 56999 | 2016-08-02 22:26:35.648484+00 | streaming | 0/6000000 | 0/6000000 | 0/6000000 | 0/6000000 | 0 | async
- 28902 | 16384 | replication | walreceiver | 172.31.2.8 | | 55492 | 2016-08-02 22:26:35.751934+00 | streaming | 0/6000000 | 0/6000000 | 0/6000000 | 0/6000000 | 0 | async
- 28903 | 16384 | replication | walreceiver | 172.31.2.7 | | 48535 | 2016-08-02 22:26:35.786888+00 | streaming | 0/6000000 | 0/6000000 | 0/6000000 | 0/6000000 | 0 | async
- 28904 | 16384 | replication | walreceiver | 172.31.2.9 | | 51240 | 2016-08-02 22:26:35.818891+00 | streaming | 0/6000000 | 0/6000000 | 0/6000000 | 0/6000000 | 0 | async
- (5 rows)
- Looks like replication is setup correctly with the 5 other slaves
- Okay now let's fire up pgpool across the nodes
- ubuntu@ip-172-31-2-26:~$ sudo salt '*' cmd.run "service pgpool2 start"
- ip-172-31-2-7.us-west-1.compute.internal:
- * Starting pgpool-II pgpool
- FD - Starting pgpool-II by executing:
- /usr/sbin/pgpool -n -f /etc/pgpool2/3.5.3/pgpool.conf -F /etc/pgpool2/3.5.3/pcp.conf -a /etc/pgpool2/3.5.3/pool_hba.conf >> /var/log/pgpool/pgpool.log 2>&1 &
- ...done.
- ip-172-31-2-8.us-west-1.compute.internal:
- * Starting pgpool-II pgpool
- FD - Starting pgpool-II by executing:
- /usr/sbin/pgpool -n -f /etc/pgpool2/3.5.3/pgpool.conf -F /etc/pgpool2/3.5.3/pcp.conf -a /etc/pgpool2/3.5.3/pool_hba.conf >> /var/log/pgpool/pgpool.log 2>&1 &
- ...done.
- ip-172-31-2-9.us-west-1.compute.internal:
- * Starting pgpool-II pgpool
- FD - Starting pgpool-II by executing:
- /usr/sbin/pgpool -n -f /etc/pgpool2/3.5.3/pgpool.conf -F /etc/pgpool2/3.5.3/pcp.conf -a /etc/pgpool2/3.5.3/pool_hba.conf >> /var/log/pgpool/pgpool.log 2>&1 &
- ...done.
- ip-172-31-2-5.us-west-1.compute.internal:
- * Starting pgpool-II pgpool
- FD - Starting pgpool-II by executing:
- /usr/sbin/pgpool -n -f /etc/pgpool2/3.5.3/pgpool.conf -F /etc/pgpool2/3.5.3/pcp.conf -a /etc/pgpool2/3.5.3/pool_hba.conf >> /var/log/pgpool/pgpool.log 2>&1 &
- ...done.
- ip-172-31-2-6.us-west-1.compute.internal:
- * Starting pgpool-II pgpool
- FD - Starting pgpool-II by executing:
- /usr/sbin/pgpool -n -f /etc/pgpool2/3.5.3/pgpool.conf -F /etc/pgpool2/3.5.3/pcp.conf -a /etc/pgpool2/3.5.3/pool_hba.conf >> /var/log/pgpool/pgpool.log 2>&1 &
- ...done.
- ip-172-31-2-4.us-west-1.compute.internal:
- * Starting pgpool-II pgpool
- FD - Starting pgpool-II by executing:
- /usr/sbin/pgpool -n -f /etc/pgpool2/3.5.3/pgpool.conf -F /etc/pgpool2/3.5.3/pcp.conf -a /etc/pgpool2/3.5.3/pool_hba.conf >> /var/log/pgpool/pgpool.log 2>&1 &
- ...done.
- Now let's connect to the VIP, and show pool nodes:
- ubuntu@ip-172-31-2-26:~$ psql -U postgres -h 172.31.2.30
- psql (9.1.21, server 9.1.22)
- Type "help" for help.
- postgres=# show pool_nodes;
- node_id | hostname | port | status | lb_weight | role | select_cnt
- ---------+------------------------------------------+------+--------+-----------+---------+------------
- 0 | ip-172-31-2-6.us-west-1.compute.internal | 5433 | 2 | 0.166667 | standby | 0
- 1 | ip-172-31-2-5.us-west-1.compute.internal | 5433 | 2 | 0.166667 | standby | 0
- 2 | ip-172-31-2-9.us-west-1.compute.internal | 5433 | 2 | 0.166667 | standby | 0
- 3 | ip-172-31-2-4.us-west-1.compute.internal | 5433 | 2 | 0.166667 | primary | 0
- 4 | ip-172-31-2-8.us-west-1.compute.internal | 5433 | 2 | 0.166667 | standby | 0
- 5 | ip-172-31-2-7.us-west-1.compute.internal | 5433 | 2 | 0.166667 | standby | 0
- (6 rows)
- Here we can see our master being designated as primary server, which is expected behavior. Now that we can confirm replication is running, each node has a unique watchdog priority (with master having the highest priority)
- Now I will go into the ec2 control panel and turn off the primary node, causing a failover scenario. now we run show pool_nodes; again
- postgres=# show pool_nodes;
- node_id | hostname | port | status | lb_weight | role | select_cnt
- ---------+------------------------------------------+------+--------+-----------+---------+------------
- 0 | ip-172-31-2-6.us-west-1.compute.internal | 5433 | 2 | 0.166667 | primary | 0
- 1 | ip-172-31-2-5.us-west-1.compute.internal | 5433 | 2 | 0.166667 | standby | 0
- 2 | ip-172-31-2-9.us-west-1.compute.internal | 5433 | 2 | 0.166667 | standby | 0
- 3 | ip-172-31-2-4.us-west-1.compute.internal | 5433 | 3 | 0.166667 | standby | 0
- 4 | ip-172-31-2-8.us-west-1.compute.internal | 5433 | 2 | 0.166667 | standby | 0
- 5 | ip-172-31-2-7.us-west-1.compute.internal | 5433 | 2 | 0.166667 | standby | 0
- (6 rows)
- Now we see that 172.31.2.6 is the new primary and that the old master, 172.31.2.4 is a status of 3 (unavailable). It shows the other standby servers as available, but is that truly the case? let's check the status of postgres on the still running servers
- ubuntu@ip-172-31-2-26:~$ sudo salt '*' cmd.run "service postgresql status"
- ip-172-31-2-7.us-west-1.compute.internal:
- 9.1/main (port 5433): online,recovery
- ip-172-31-2-5.us-west-1.compute.internal:
- 9.1/main (port 5433): online,recovery
- ip-172-31-2-9.us-west-1.compute.internal:
- 9.1/main (port 5433): online,recovery
- ip-172-31-2-8.us-west-1.compute.internal:
- 9.1/main (port 5433): online,recovery
- ip-172-31-2-6.us-west-1.compute.internal:
- 9.1/main (port 5433): online
- Looks like 6 is our primary node, let's check replication on it:
- ubuntu@ip-172-31-2-6:~$ sudo -u postgres psql
- psql (9.1.22)
- Type "help" for help.
- postgres=# select * from pg_stat_replication;
- procpid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | state | sent_location | write_location | flush_location | replay_location | sync_priority | sync_state
- ---------+----------+---------+------------------+-------------+-----------------+-------------+---------------+-------+---------------+----------------+----------------+-----------------+---------------+------------
- (0 rows)
- So I think this expected behavior since we need to repair the slaves ourselves. My question is why are the other nodes that are no longer replicating still listed as available in pgpool? And more so, when I go to initiate replication on the slaves to point to new master, it seems it is set as unavailable in pgpool after I reinitate replication with new master, for example let's set replication back up on 172.31.2.5
- ubuntu@ip-172-31-2-5:~$ sudo -u postgres /etc/postgresql/9.1/main/replscripts/initiate_replication.sh -f -H ip-172-31-2-6.us-west-1.compute.internal -P 5433 -p replication
- initiate_replication - Start
- DEBUG: The script will be executed with the following arguments:
- DEBUG: --trigger-file=/etc/postgresql/9.1/main/im_the_master
- DEBUG: --standby_file=/etc/postgresql/9.1/main/im_slave
- DEBUG: --primary-host=ip-172-31-2-6.us-west-1.compute.internal
- DEBUG: --primary-port=5433
- DEBUG: --slot-name=ip_172_31_2_5
- DEBUG: --user=replication
- DEBUG: --password=replication
- DEBUG: --force
- INFO: Checking if trigger file exists...
- INFO: Checking if standby file exists...
- INFO: Standby file not found. Creating new one...
- INFO: Ensuring replication user and password in password file (.pgpass)...
- * Stopping PostgreSQL 9.1 database server [ OK ]
- INFO: Deleting old data...
- INFO: Getting the initial backup...
- wal_e.main INFO MSG: starting WAL-E
- DETAIL: The subcommand is "backup-fetch".
- STRUCTURED: time=2016-08-02T23:00:15.870621-00 pid=4027
- wal_e.worker.s3.s3_worker INFO MSG: beginning partition download
- DETAIL: The partition being downloaded is part_00000000.tar.lzo.
- HINT: The absolute S3 key is wal-e-stage/basebackups_005/base_000000010000000000000002_00000032/tar_partitions/part_00000000.tar.lzo.
- STRUCTURED: time=2016-08-02T23:00:22.453123-00 pid=4027
- INFO: Creating recovery.conf file...
- INFO: Removing old postgresql.conf file...
- INFO: Copying new postgresql.conf file...
- INFO: Starting postgresql service...
- * Starting PostgreSQL 9.1 database server [ OK ]
- initiate_replication - Done!
- ubuntu@ip-172-31-2-5:~$
- Now let's check the master
- ubuntu@ip-172-31-2-6:~$ sudo -u postgres psql
- psql (9.1.22)
- Type "help" for help.
- postgres=# select * from pg_stat_replication;
- procpid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | state | sent_location | write_location | flush_location | replay_location | sync_priority | sync_state
- ---------+----------+-------------+------------------+-------------+-----------------+-------------+------------------------------+-----------+---------------+----------------+----------------+-----------------+---------------+------------
- 3368 | 16384 | replication | walreceiver | 172.31.2.5 | | 49348 | 2016-08-02 23:00:30.00704+00 | streaming | 0/C000000 | 0/C000000 | 0/C000000 | 0/C000000 | 0 | async
- (1 row)
- Looks like it's added the new slave, now onto check pgpool
- ubuntu@ip-172-31-2-26:~$ psql -U postgres -h 172.31.2.30
- psql (9.1.21, server 9.1.22)
- Type "help" for help.
- postgres=# show pool_nodes;
- node_id | hostname | port | status | lb_weight | role | select_cnt
- ---------+------------------------------------------+------+--------+-----------+---------+------------
- 0 | ip-172-31-2-6.us-west-1.compute.internal | 5433 | 2 | 0.166667 | primary | 0
- 1 | ip-172-31-2-5.us-west-1.compute.internal | 5433 | 3 | 0.166667 | standby | 0
- 2 | ip-172-31-2-9.us-west-1.compute.internal | 5433 | 2 | 0.166667 | standby | 0
- 3 | ip-172-31-2-4.us-west-1.compute.internal | 5433 | 3 | 0.166667 | standby | 0
- 4 | ip-172-31-2-8.us-west-1.compute.internal | 5433 | 2 | 0.166667 | standby | 0
- 5 | ip-172-31-2-7.us-west-1.compute.internal | 5433 | 2 | 0.166667 | standby | 0
- (6 rows)
- Show the replicating slave as unavailable, does this seem strange? Sorry if this is to much information and I've confused you anywhere along the way, let me know if i'm doing something fundamentally wrong or if you're confused by any of this.
Add Comment
Please, Sign In to add comment