Guest User

Untitled

a guest
Jun 24th, 2018
111
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 4.94 KB | None | 0 0
  1. # Support Documentation For Log Check Issue
  2. The commands listed in this document use BOSH CLI v2+.
  3.  
  4. ## Background
  5. The current Crunchy PostgreSQL for PCF tile release included high availability features that in the event of a problem on the Primary PostgreSQL server, our configuration would automatically fence the Primary and promote a replica to Primary status. Our recent release of the `v04.090513.001` tile version included some additional statistics gathering as part of our health check process.
  6.  
  7. The high availability functions are based on the status of each server in Consul. On each of the PostgreSQL servers, there exists a script at `/var/vcap/store/service/healthcheck.sh`. Every service has one, haproxy, pgbackrest, postgresql, etc. in the same location. The Consul application on each VM runs that script, if it exits with a status code of 0, it is in `passing` state, if it is 1 `warning` state and anything 2 or above it is marked `critical`. Our HA configuration is such that if the status of a PostgreSQL server is marked `critical` in Consul, our Crunchy Cluster Manager (CCM) sets it to fenced, looks for a the next server that is a replica and promotes it to primary. Our built in health checks are designed so that only the Primary server can receive an exit code of 2, others will receive an exit code of 0 or 1.
  8.  
  9. ## Known Issue
  10. As part of the health check script we also do some statistics gathering, memory status, cpu status, etc and publish that to consul. Part of that information gathering is a log check that checks pg logs for PANIC|ERROR|FATAL|WARN messages. When a customer application initiates enough transactions, the PostgreSQL system creates logs big enough or were still being processed in memory that the health check reaches the timeout of Consul waiting for a response and the check is killed. Likely receiving a response code of 128 (kill -9) though the scenario in which it happens makes it difficult to determine the exact error code output. Since anything greater than 2 is marked critical, our CCM kicked in and executed a failover. Part of the reason it was difficult to catch was that our output of the health check script is on the status of the postgresql server, pgbackrest server, and the replica servers. It did not capture the output of the statistics generation other than a `true` value if it completed successfully. We ultimately found the issue by doing a comparison, word for word, the response message from a critical event against a passing event.
  11.  
  12. As a result of this, customer service instances would experience frequent failovers of the Primary PostgreSQL server and frequent interrupts to their applications. If a customer is reporting that they see frequent dropped connections in their application or difficulty when running bosh tasks against the PostgreSQL VMs, this is a likely culprit.
  13.  
  14. ## Resolution
  15. To fix this issue until the next release, the customer needs to remove the statistics_push function in the health check script. This fix is also on a per cluster basis so it will need to be addressed for each cluster the exhibits the issues.
  16.  
  17. 1. The customer will first need to determine the PostgreSQL servers that exist in the cluster.
  18. - `bosh -e $ENV -d $SERVICE_INSTANCE vms | grep 'postgresql/'`
  19. - An example:
  20. ```
  21. $ bosh -e vbox -d service-instance_f9def4ea-7231-4344-b5ae-b1f0ca333f18 vms | grep 'postgresql/'
  22. postgresql/8b702651-42b0-47f8-9152-f883dc305c38 running z2 10.244.10.3 bd38401d-2d7f-47c1-627b-c77349899a8a crunchy-small false
  23. postgresql/dab128a1-5be5-4b0d-bb0b-bb4371e53603 running z1 10.244.9.7 3423495e-4567-45ff-599b-3d0be8333e51 crunchy-small false
  24. ```
  25. 1. Next for each of the PostgreSQL servers, run a `sed` command to remove the `statistics_push` function.
  26. - `bosh -e $ENV -d $SERVICE_INSTANCE ssh $POSTGRESQL_SERVER -c "sudo -u vcap sed -i '/statistics_push/d' /var/vcap/store/service/healthcheck.sh"`
  27. - An example:
  28. ```
  29. bosh -e vbox -d service-instance_f9def4ea-7231-4344-b5ae-b1f0ca333f18 ssh postgresql/dab128a1-5be5-4b0d-bb0b-bb4371e53603 -c "sudo -u vcap sed -i '/statistics_push/d' /var/vcap/store/service/healthcheck.sh"
  30. Using environment '192.168.50.6' as client 'admin'
  31.  
  32. Using deployment 'service-instance_f9def4ea-7231-4344-b5ae-b1f0ca333f18'
  33.  
  34. Task 594. Done
  35. postgresql/dab128a1-5be5-4b0d-bb0b-bb4371e53603: stderr | Unauthorized use is strictly prohibited. All access and activity
  36. postgresql/dab128a1-5be5-4b0d-bb0b-bb4371e53603: stderr | is subject to logging and monitoring.
  37. postgresql/dab128a1-5be5-4b0d-bb0b-bb4371e53603: stderr | Connection to 10.244.9.7 closed.
  38.  
  39. Succeeded
  40. ```
  41. 1. The `healthcheck.sh` script is called each time uniquely so it does not require a restart of any services on the instance.
  42.  
  43. # Code Fix
  44. We currently have the fix in our staging environment and targeted to go out with our next release of `v04.090513.003` and `v04.100400.003`. The fix will remove the statistics gathering from the health check and put it in an independent cron job.
Add Comment
Please, Sign In to add comment