Advertisement
Guest User

Untitled

a guest
Jun 23rd, 2017
517
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 6.24 KB | None | 0 0
  1. ### Finding long running queries in Postgres
  2.  
  3. There have been a few times in the last several months at $DAY_JOB during which the following query,
  4. or some slight variation of it,
  5. has been used to identify long-running and potentially problematic queries:
  6.  
  7. ```
  8. select pid,
  9. now() - query_start as duration,
  10. query,
  11. state
  12. from pg_stat_activity
  13. ```
  14.  
  15. I believe that this can yield misleading results as other things need to be taken into consideration.
  16. To illustrate this, I'm going to iterate on the above query in one session,
  17. and issue other test queries in another one.
  18.  
  19. OK... let's start with a clean slate and run this query for the first time;
  20. on my local machine I happen to see the following:
  21.  
  22. ```
  23. pid | query | state | duration
  24. -------+-----------------------------------------+--------+-----------------
  25. 84306 | SHOW TRANSACTION ISOLATION LEVEL | idle | 00:25:31.513976
  26. 84307 | SHOW TRANSACTION ISOLATION LEVEL | idle | 00:25:31.509142
  27. 84318 | SHOW TRANSACTION ISOLATION LEVEL | idle | 00:25:01.31746
  28. 84319 | SHOW TRANSACTION ISOLATION LEVEL | idle | 00:25:01.311403
  29. 83673 | select pid, +| active | 00:00:00
  30. | query, +| |
  31. | state, +| |
  32. | now() - query_start as duration+| |
  33. | from pg_stat_activity; | |
  34. (5 rows)
  35. ```
  36.  
  37. This is all stuff we can ignore for this demonstration so I'm going to alter the query a little bit to exclude them:
  38.  
  39. ```
  40. select pid,
  41. now() - query_start as duration,
  42. query,
  43. state
  44. from pg_stat_activity
  45. where query not ilike '%pg_stat_activity%'
  46. and query not ilike '%transaction%';
  47. ```
  48.  
  49. This is a little better:
  50.  
  51. ```
  52. pid | duration | query | state
  53. -----+----------+-------+-------
  54. (0 rows)
  55. ```
  56.  
  57. Let's issue a query from another session:
  58.  
  59. ```
  60. genome_development=# \timing
  61. Timing is on.
  62. genome_development=# select 'Hi, I''m not a long-running query!';
  63. ?column?
  64. -----------------------------------
  65. Hi, I'm not a long-running query!
  66. (1 row)
  67.  
  68. Time: 0.314 ms
  69. ```
  70.  
  71. ... and then wait a few moments before looking at things in our main session:
  72.  
  73. ```
  74. genome_development=# select pid,
  75. now() - query_start as duration,
  76. query,
  77. state
  78. from pg_stat_activity
  79. where query not ilike '%pg_stat_activity%'
  80. and query not ilike '%transaction%';
  81. pid | duration | query | state
  82. -------+----------------+----------------------------------------------+-------
  83. 85188 | 00:01:08.68768 | select 'Hi, I''m not a long-running query!'; | idle
  84. (1 row)
  85. ```
  86.  
  87. What's this? This says that the duration is over 1 minute...
  88. but `psql` in the other session said that our query took 0.314 ms.
  89. What's happening here?
  90. The problem here is that there are other fields in this view that we need to look at to determine what's _actually_ taking a long time.
  91. Notice that the state of this row is `idle`.
  92. That means the process that this row represents is _not_ currently executing the query.
  93. To see this, let's issue this which is purposefully constructed to take 30 seconds to run:
  94.  
  95. ```
  96. select 'But _I_ am indeed a long-running query.' from pg_sleep(30);
  97. ```
  98.  
  99. ... and quickly switch back to the main session to see what the view is tells us:
  100.  
  101. ```
  102. genome_development=# select pid,
  103. now() - query_start as duration,
  104. query,
  105. state
  106. pg_stat_activity
  107. where query not ilike '%pg_stat_activity%'
  108. and query not ilike '%transaction%';
  109. pid | duration | query | state
  110. -------+-----------------+---------------------------------------------------------------------+--------
  111. 85188 | 00:00:03.551807 | select 'But _I_ am indeed a long-running query.' from pg_sleep(30); | active
  112. (1 row)
  113. ```
  114.  
  115. Wait... is this saying that the query took just under 4 seconds?
  116. No... because the process state is still `active` and the query is still running.
  117. If we wait for the full thirty seconds to pass, we should see a state change:
  118.  
  119. ```
  120. genome_development=# select pid,
  121. now() - query_start as duration,
  122. query,
  123. state
  124. pg_stat_activity
  125. where query not ilike '%pg_stat_activity%'
  126. and query not ilike '%transaction%';
  127. pid | duration | query | state
  128. -------+-----------------+---------------------------------------------------------------------+-------
  129. 85188 | 00:00:45.365226 | select 'But _I_ am indeed a long-running query.' from pg_sleep(30); | idle
  130. (1 row)
  131. ```
  132.  
  133. Aha... the process is `idle` now... but wait... the duration is wrong again.
  134. What is going on?
  135. The problem this time is that we need to look at something in addition to the `query_start` field to determine how long a query actually took to run.
  136. It turns out that the `pg_stat_activity` view also has a `state_change` field that we can exploit to properly figure this out.
  137. According to the documentation, the `state_change` field is the "Time when the state was last changed".
  138. (Duh.)
  139. So, if a process completes executing a query, then its state will go from `active` to `idle`, and thus we should see that the difference between the `state_change` and `query_start` field will be the actual execution time:
  140. Let's see:
  141.  
  142. ```
  143. genome_development=# select pid,
  144. query_start,
  145. state_change,
  146. query,
  147. state
  148. from pg_stat_activity
  149. where query not ilike '%pg_stat_activity%'
  150. and query not ilike '%transaction%';
  151. pid | query_start | state_change | query | state
  152. -------+-------------------------------+-------------------------------+---------------------------------------------------------------------+-------
  153. 85188 | 2017-06-23 13:48:23.776142-04 | 2017-06-23 13:48:53.778372-04 | select 'But _I_ am indeed a long-running query.' from pg_sleep(30); | idle
  154. (1 row)
  155. ```
  156.  
  157. Well, look at that... they're 30 seconds apart, just as we expected.
  158.  
  159. So... what did we learn here?
  160. What we need to do is make sure that we _only_ look at processes that are currently `active`;
  161. those are the ones that are actually running queries.
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement