Guest User

Untitled

a guest
Aug 2nd, 2016
117
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
Bash 13.02 KB | None | 0 0
  1. First let's show the watchdog priority off all the machines (sorry there's multiple entries per host, I've been troubleshooting some strange looping logic of assigning the priority per machine, but third time is a charm :P)
  2.  
  3. ubuntu@ip-172-31-2-26:~$ sudo salt '*' cmd.run "cat /etc/pgpool2/3.5.3/pgpool.conf| grep wd_priority"
  4. ip-172-31-2-7.us-west-1.compute.internal:
  5.    wd_priority = 1
  6.    wd_priority = 5
  7.    wd_priority = 3
  8. ip-172-31-2-6.us-west-1.compute.internal:
  9.    wd_priority = 1
  10.    wd_priority = 5
  11.    wd_priority = 2
  12. ip-172-31-2-8.us-west-1.compute.internal:
  13.    wd_priority = 1
  14.    wd_priority = 1
  15.    wd_priority = 4
  16. ip-172-31-2-9.us-west-1.compute.internal:
  17.    wd_priority = 1
  18.    wd_priority = 3
  19.    wd_priority = 5
  20. ip-172-31-2-4.us-west-1.compute.internal:
  21.    wd_priority = 1
  22.    wd_priority = 6
  23.    wd_priority = 6
  24. ip-172-31-2-5.us-west-1.compute.internal:
  25.    wd_priority = 1
  26.    wd_priority = 4
  27.    wd_priority = 1
  28.  
  29.  
  30. Now let's make sure that 172.31.2.4 is infact the master
  31. ubuntu@ip-172-31-2-26:~$ sudo salt -G 'roles:postgres-master' test.ping
  32. ip-172-31-2-4.us-west-1.compute.internal:
  33.     True
  34.  
  35. Salt seems to think so, let's make sure replication is alive and well on it now
  36. ubuntu@ip-172-31-2-4:~$ sudo -u postgres psql
  37. psql (9.1.22)
  38. Type "help" for help.
  39.  
  40. postgres=# select * from pg_stat_replication;
  41. procpid | usesysid |   usename   | application_name | client_addr | client_hostname | client_port |         backend_start         |   state   | sent_location | write_location | flush_location | replay_location | sync_priority | sync_state
  42. ---------+----------+-------------+------------------+-------------+-----------------+-------------+-------------------------------+-----------+---------------+----------------+----------------+-----------------+---------------+------------
  43.   28655 |    16384 | replication | walreceiver      | 172.31.2.5  |                 |       44840 | 2016-08-02 22:21:34.352537+00 | streaming | 0/6000000     | 0/6000000      | 0/6000000      | 0/6000000       |             0 | async
  44.   28901 |    16384 | replication | walreceiver      | 172.31.2.6  |                 |       56999 | 2016-08-02 22:26:35.648484+00 | streaming | 0/6000000     | 0/6000000      | 0/6000000      | 0/6000000       |             0 | async
  45.   28902 |    16384 | replication | walreceiver      | 172.31.2.8  |                 |       55492 | 2016-08-02 22:26:35.751934+00 | streaming | 0/6000000     | 0/6000000      | 0/6000000      | 0/6000000       |             0 | async
  46.   28903 |    16384 | replication | walreceiver      | 172.31.2.7  |                 |       48535 | 2016-08-02 22:26:35.786888+00 | streaming | 0/6000000     | 0/6000000      | 0/6000000      | 0/6000000       |             0 | async
  47.   28904 |    16384 | replication | walreceiver      | 172.31.2.9  |                 |       51240 | 2016-08-02 22:26:35.818891+00 | streaming | 0/6000000     | 0/6000000      | 0/6000000      | 0/6000000       |             0 | async
  48. (5 rows)
  49.  
  50.  
  51. Looks like replication is setup correctly with the 5 other slaves
  52.  
  53. Okay now let's fire up pgpool across the nodes
  54.  
  55. ubuntu@ip-172-31-2-26:~$ sudo salt '*' cmd.run "service pgpool2 start"
  56. ip-172-31-2-7.us-west-1.compute.internal:
  57.      * Starting pgpool-II pgpool
  58.     FD - Starting pgpool-II by executing:
  59.     /usr/sbin/pgpool -n -f /etc/pgpool2/3.5.3/pgpool.conf -F /etc/pgpool2/3.5.3/pcp.conf -a /etc/pgpool2/3.5.3/pool_hba.conf >> /var/log/pgpool/pgpool.log 2>&1 &
  60.        ...done.
  61. ip-172-31-2-8.us-west-1.compute.internal:
  62.      * Starting pgpool-II pgpool
  63.     FD - Starting pgpool-II by executing:
  64.     /usr/sbin/pgpool -n -f /etc/pgpool2/3.5.3/pgpool.conf -F /etc/pgpool2/3.5.3/pcp.conf -a /etc/pgpool2/3.5.3/pool_hba.conf >> /var/log/pgpool/pgpool.log 2>&1 &
  65.        ...done.
  66. ip-172-31-2-9.us-west-1.compute.internal:
  67.      * Starting pgpool-II pgpool
  68.     FD - Starting pgpool-II by executing:
  69.     /usr/sbin/pgpool -n -f /etc/pgpool2/3.5.3/pgpool.conf -F /etc/pgpool2/3.5.3/pcp.conf -a /etc/pgpool2/3.5.3/pool_hba.conf >> /var/log/pgpool/pgpool.log 2>&1 &
  70.        ...done.
  71. ip-172-31-2-5.us-west-1.compute.internal:
  72.      * Starting pgpool-II pgpool
  73.     FD - Starting pgpool-II by executing:
  74.     /usr/sbin/pgpool -n -f /etc/pgpool2/3.5.3/pgpool.conf -F /etc/pgpool2/3.5.3/pcp.conf -a /etc/pgpool2/3.5.3/pool_hba.conf >> /var/log/pgpool/pgpool.log 2>&1 &
  75.        ...done.
  76. ip-172-31-2-6.us-west-1.compute.internal:
  77.      * Starting pgpool-II pgpool
  78.     FD - Starting pgpool-II by executing:
  79.     /usr/sbin/pgpool -n -f /etc/pgpool2/3.5.3/pgpool.conf -F /etc/pgpool2/3.5.3/pcp.conf -a /etc/pgpool2/3.5.3/pool_hba.conf >> /var/log/pgpool/pgpool.log 2>&1 &
  80.        ...done.
  81. ip-172-31-2-4.us-west-1.compute.internal:
  82.      * Starting pgpool-II pgpool
  83.     FD - Starting pgpool-II by executing:
  84.     /usr/sbin/pgpool -n -f /etc/pgpool2/3.5.3/pgpool.conf -F /etc/pgpool2/3.5.3/pcp.conf -a /etc/pgpool2/3.5.3/pool_hba.conf >> /var/log/pgpool/pgpool.log 2>&1 &
  85.        ...done.
  86.  
  87. Now let's connect to the VIP, and show pool nodes:
  88. ubuntu@ip-172-31-2-26:~$ psql -U postgres -h 172.31.2.30
  89. psql (9.1.21, server 9.1.22)
  90. Type "help" for help.
  91.  
  92. postgres=# show pool_nodes;
  93. node_id |                 hostname                 | port | status | lb_weight |  role   | select_cnt
  94. ---------+------------------------------------------+------+--------+-----------+---------+------------
  95. 0       | ip-172-31-2-6.us-west-1.compute.internal | 5433 | 2      | 0.166667  | standby | 0
  96. 1       | ip-172-31-2-5.us-west-1.compute.internal | 5433 | 2      | 0.166667  | standby | 0
  97. 2       | ip-172-31-2-9.us-west-1.compute.internal | 5433 | 2      | 0.166667  | standby | 0
  98. 3       | ip-172-31-2-4.us-west-1.compute.internal | 5433 | 2      | 0.166667  | primary | 0
  99. 4       | ip-172-31-2-8.us-west-1.compute.internal | 5433 | 2      | 0.166667  | standby | 0
  100. 5       | ip-172-31-2-7.us-west-1.compute.internal | 5433 | 2      | 0.166667  | standby | 0
  101. (6 rows)
  102.  
  103.  
  104. Here we can see our master being designated as primary server, which is expected behavior. Now that we can confirm replication is running, each node has a unique watchdog priority (with master having the highest priority)
  105.  
  106. Now I will go into the ec2 control panel and turn off the primary node, causing a failover scenario. now we run show pool_nodes; again
  107.  
  108. postgres=# show pool_nodes;
  109. node_id |                 hostname                 | port | status | lb_weight |  role   | select_cnt
  110. ---------+------------------------------------------+------+--------+-----------+---------+------------
  111. 0       | ip-172-31-2-6.us-west-1.compute.internal | 5433 | 2      | 0.166667  | primary | 0
  112. 1       | ip-172-31-2-5.us-west-1.compute.internal | 5433 | 2      | 0.166667  | standby | 0
  113. 2       | ip-172-31-2-9.us-west-1.compute.internal | 5433 | 2      | 0.166667  | standby | 0
  114. 3       | ip-172-31-2-4.us-west-1.compute.internal | 5433 | 3      | 0.166667  | standby | 0
  115. 4       | ip-172-31-2-8.us-west-1.compute.internal | 5433 | 2      | 0.166667  | standby | 0
  116. 5       | ip-172-31-2-7.us-west-1.compute.internal | 5433 | 2      | 0.166667  | standby | 0
  117. (6 rows)
  118.  
  119. Now we see that 172.31.2.6 is the new primary and that the old master, 172.31.2.4 is a status of 3 (unavailable). It shows the other standby servers as available, but is that truly the case? let's check the status of postgres on the still running servers
  120.  
  121. ubuntu@ip-172-31-2-26:~$ sudo salt '*' cmd.run "service postgresql status"
  122. ip-172-31-2-7.us-west-1.compute.internal:
  123.     9.1/main (port 5433): online,recovery
  124. ip-172-31-2-5.us-west-1.compute.internal:
  125.     9.1/main (port 5433): online,recovery
  126. ip-172-31-2-9.us-west-1.compute.internal:
  127.     9.1/main (port 5433): online,recovery
  128. ip-172-31-2-8.us-west-1.compute.internal:
  129.     9.1/main (port 5433): online,recovery
  130. ip-172-31-2-6.us-west-1.compute.internal:
  131.     9.1/main (port 5433): online
  132.  
  133. Looks like 6 is our primary node, let's check replication on it:
  134.  
  135. ubuntu@ip-172-31-2-6:~$ sudo -u postgres psql
  136. psql (9.1.22)
  137. Type "help" for help.
  138.  
  139. postgres=# select * from pg_stat_replication;
  140. procpid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | state | sent_location | write_location | flush_location | replay_location | sync_priority | sync_state
  141. ---------+----------+---------+------------------+-------------+-----------------+-------------+---------------+-------+---------------+----------------+----------------+-----------------+---------------+------------
  142. (0 rows)
  143.  
  144.  
  145. So I think this expected behavior since we need to repair the slaves ourselves. My question is why are the other nodes that are no longer replicating still listed as available in pgpool? And more so, when I go to initiate replication on the slaves to point to new master, it seems it is set as unavailable in pgpool after I reinitate replication with new master, for example let's set replication back up on 172.31.2.5
  146.  
  147. ubuntu@ip-172-31-2-5:~$ sudo -u postgres /etc/postgresql/9.1/main/replscripts/initiate_replication.sh -f -H ip-172-31-2-6.us-west-1.compute.internal -P 5433 -p replication
  148. initiate_replication - Start
  149. DEBUG: The script will be executed with the following arguments:
  150. DEBUG: --trigger-file=/etc/postgresql/9.1/main/im_the_master
  151. DEBUG: --standby_file=/etc/postgresql/9.1/main/im_slave
  152. DEBUG: --primary-host=ip-172-31-2-6.us-west-1.compute.internal
  153. DEBUG: --primary-port=5433
  154. DEBUG: --slot-name=ip_172_31_2_5
  155. DEBUG: --user=replication
  156. DEBUG: --password=replication
  157. DEBUG: --force
  158. INFO: Checking if trigger file exists...
  159. INFO: Checking if standby file exists...
  160. INFO: Standby file not found. Creating new one...
  161. INFO: Ensuring replication user and password in password file (.pgpass)...
  162.  * Stopping PostgreSQL 9.1 database server                                                                                                                                                                                                                                [ OK ]
  163. INFO: Deleting old data...
  164. INFO: Getting the initial backup...
  165. wal_e.main   INFO     MSG: starting WAL-E
  166.         DETAIL: The subcommand is "backup-fetch".
  167.         STRUCTURED: time=2016-08-02T23:00:15.870621-00 pid=4027
  168. wal_e.worker.s3.s3_worker INFO     MSG: beginning partition download
  169.         DETAIL: The partition being downloaded is part_00000000.tar.lzo.
  170.         HINT: The absolute S3 key is wal-e-stage/basebackups_005/base_000000010000000000000002_00000032/tar_partitions/part_00000000.tar.lzo.
  171.         STRUCTURED: time=2016-08-02T23:00:22.453123-00 pid=4027
  172. INFO: Creating recovery.conf file...
  173. INFO: Removing old postgresql.conf file...
  174. INFO: Copying new postgresql.conf file...
  175. INFO: Starting postgresql service...
  176.  * Starting PostgreSQL 9.1 database server                                                                                                                                                                                                                                [ OK ]
  177. initiate_replication - Done!
  178. ubuntu@ip-172-31-2-5:~$
  179.  
  180. Now let's check the master
  181. ubuntu@ip-172-31-2-6:~$ sudo -u postgres psql
  182. psql (9.1.22)
  183. Type "help" for help.
  184.  
  185. postgres=# select * from pg_stat_replication;
  186. procpid | usesysid |   usename   | application_name | client_addr | client_hostname | client_port |        backend_start         |   state   | sent_location | write_location | flush_location | replay_location | sync_priority | sync_state
  187. ---------+----------+-------------+------------------+-------------+-----------------+-------------+------------------------------+-----------+---------------+----------------+----------------+-----------------+---------------+------------
  188.    3368 |    16384 | replication | walreceiver      | 172.31.2.5  |                 |       49348 | 2016-08-02 23:00:30.00704+00 | streaming | 0/C000000     | 0/C000000      | 0/C000000      | 0/C000000       |             0 | async
  189. (1 row)
  190.  
  191.  
  192. Looks like it's added the new slave, now onto check pgpool
  193.  
  194. ubuntu@ip-172-31-2-26:~$ psql -U postgres -h 172.31.2.30
  195. psql (9.1.21, server 9.1.22)
  196. Type "help" for help.
  197.  
  198. postgres=# show pool_nodes;
  199.  node_id |                 hostname                 | port | status | lb_weight |  role   | select_cnt
  200. ---------+------------------------------------------+------+--------+-----------+---------+------------
  201.  0       | ip-172-31-2-6.us-west-1.compute.internal | 5433 | 2      | 0.166667  | primary | 0
  202.  1       | ip-172-31-2-5.us-west-1.compute.internal | 5433 | 3      | 0.166667  | standby | 0
  203.  2       | ip-172-31-2-9.us-west-1.compute.internal | 5433 | 2      | 0.166667  | standby | 0
  204.  3       | ip-172-31-2-4.us-west-1.compute.internal | 5433 | 3      | 0.166667  | standby | 0
  205.  4       | ip-172-31-2-8.us-west-1.compute.internal | 5433 | 2      | 0.166667  | standby | 0
  206.  5       | ip-172-31-2-7.us-west-1.compute.internal | 5433 | 2      | 0.166667  | standby | 0
  207. (6 rows)
  208.  
  209.  
  210. Show the replicating slave as unavailable, does this seem strange? Sorry if this is to much information and I've confused you anywhere along the way, let me know if i'm doing something fundamentally wrong or if you're confused by any of this.
Add Comment
Please, Sign In to add comment