Advertisement
Guest User

Fly.io cluster down

a guest
Jul 20th, 2023
587
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 8.47 KB | None | 0 0
  1. Service Interruption: Can’t Destroy Machine, Deploy, or Restart
  2. Questions / Help
  3. App not working
  4. rails
  5. Jul 18
  6. 17m
  7.  
  8.  
  9. sudojosh
  10.  
  11. 1
  12. 3d
  13. I’m also getting this - just like @mfwgenerics, I worked around this by scaling to create a new machine, but still have the original machine in a state where it can’t be destroyed:
  14.  
  15. Error: could not get machine [machine ID]: failed to get VM [machine ID]: unavailable: dial tcp [ipv6]:3593: connect: connection refused
  16. My staging environment is in the same state, but until I’m not adding more VMs that I’m surely going to be billed for until I know I can clean up the zombies.
  17.  
  18. Just like OP, I see the same error in the dashboard about emergency maintenance. That’s been there for 15 hours, with no other information.
  19.  
  20. This is the second time I’ve had this kind of issue with Fly, where my service just goes down, Fly reports everything healthy, and there’s literally no information and nothing I can really do other than wait and hope it comes back up sometime (hours later, probably).
  21.  
  22. I appreciate the convenience that Fly offers, but these kind of problems erode my trust in this platform completely. Heroku had it’s faults, but I was never left scouring a forum trying to get my service back up - if a host was unhealthy, my dyno would be automatically moved, no worries. I’m running a small-scale golden path Rails app with Postgres, I can’t imagine trying to fix these kinds of problems on a more complex app.
  23.  
  24.  
  25.  
  26.  
  27. Reply
  28.  
  29. finntechnz
  30. 3d
  31. Adding some more information here since I’m also surprised that this is still ongoing 12 hours later with no response.
  32.  
  33. We had four machines (app + Postgres for staging and production) running yesterday, and three of the four (including both databases) are still down and can’t be accessed. I can replicate the issues others have mentioned here.
  34. This is our company’s external API app and so the issue broke all of our integrations.
  35. Our team ended up setting up a new project in fly to spin up an instance to keep us going which took a couple of hours (backfilling environment variables and configuration etc, not a bad test of our DR ability).
  36. There is no way I can find to get the data from the db machines. Thank goodness this isn’t our main production db and we were able to reverse engineer what we needed into there.
  37. Very keen to hear what’s happening with this and why after so many hours there’s no more info or updates.
  38.  
  39.  
  40.  
  41.  
  42. Reply
  43.  
  44. sevenseacat
  45. 3d
  46. As an aside, it’s kind of a kick in the teeth to see the status page for our organization reporting no incidents - the same page that lists our apps as under maintenance and inaccessible!
  47.  
  48.  
  49.  
  50.  
  51. Reply
  52.  
  53. mfwgenerics
  54. 3d
  55. Confirming my deployment is in syd too. I’m still seeing the zombie VM and observing failing CLI commands against the machine.
  56.  
  57.  
  58.  
  59.  
  60. Reply
  61.  
  62. south-paw
  63.  
  64. 1
  65. 3d
  66. image
  67. image
  68. 723×356 14.1 KB
  69. We have syd deployments as well for all our apps too
  70.  
  71. I’m feeling very lucky that none of our paid production apps or databases are affected currently (only our development environment is), but also really surprised that the issue has been ongoing for 17 hours now with no status page update, no notifications (beyond betterstack letting us know it was down) and one note on the app with not much info as to whats going on.
  72.  
  73. It really worries me what would happen if it was one of our paid production instances that was affected - the data we’re working with can’t simply be ‘recovered’ later, it’d just get dropped until service resumed or we migrated to another region to get things running again
  74.  
  75. Keen to know whats wrong and whats being done about it
  76.  
  77.  
  78.  
  79.  
  80. Reply
  81.  
  82. south-paw
  83. 2d
  84. Message has now been updated
  85.  
  86. Service Interruption (20 hours ago)
  87. We are continuing to investigate an infrastructure related issue on this host.
  88. Still no incidents listed on status page though for SYD region :thinking:
  89.  
  90. 1
  91.  
  92.  
  93.  
  94. Reply
  95.  
  96. Tutello
  97. 2d
  98. not sure if connected but had a redis app in lhr fall into suspended status overnight, killed an important demo
  99.  
  100. machine is a zombie…
  101.  
  102. machine [id] was found and is currently in a stopped state, attempting to kill…
  103. Error: could not kill machine [id]: failed to kill VM [id]: failed_precondition: machine not in known state for signaling, stopped
  104.  
  105.  
  106.  
  107.  
  108. Reply
  109.  
  110. sevenseacat
  111. 2d
  112. I got a response from support a few hours ago -
  113.  
  114. Unfortunately this host managed to get into a extremely poor state, and a fix is taking longer than expected. We have a team continuing to work on it, but no estimated resolution time to share right now. As soon as we have an update we will let you know.
  115.  
  116. So I guess we just wait…
  117.  
  118. 2 Replies
  119. 1
  120.  
  121.  
  122.  
  123. Reply
  124.  
  125. jl1
  126. 2d
  127. Same issue here for me, on a host in syd. It’s completely broken a pg cluster.
  128.  
  129. The absence of any proactive status updates on this issue has been really poor.
  130.  
  131.  
  132.  
  133.  
  134. Reply
  135.  
  136. south-paw
  137.  
  138. sevenseacat
  139. 2d
  140. Thank you for sharing that update, surprised there is no status update from Fly yet though :cold_sweat:
  141.  
  142. I can appreciate the issue might be taking up a lot of time and they want to focus on fixing it first - but even just a message from the staff here earlier would put me at ease for our production apps that are running
  143.  
  144.  
  145.  
  146.  
  147. Reply
  148.  
  149. sevenseacat
  150. 2d
  151. We worked out we could create a new Postgres cluster from one of the snapshots of the currently-down app - so we’re back up and running for our app.
  152.  
  153. (We had to create it with a different name, and then when we tried to make another one with the previous name, flyctl put the cluster on the same currently-down host! Oops)
  154.  
  155.  
  156.  
  157.  
  158. Reply
  159.  
  160. mplatts
  161. 1d
  162. Also having this issue. Scale worked for the Phoenix server, but the Postgres server is also dead.
  163.  
  164. And I can’t even restore the Postgres one:
  165.  
  166. Error: failed to create volume: Couldn't allocate volume, not enough compute capacity left in yyz
  167.  
  168.  
  169.  
  170. Reply
  171.  
  172. sevenseacat
  173. 1d
  174. There’s a known incident listed on the status page for YYZ, might be related. Fly.io Status - We are undergoing emergency vendor hardware replacement in YYZ region. 7
  175.  
  176. Still crickets for the down host in SYD though :frowning:
  177.  
  178. 1 Reply
  179.  
  180.  
  181.  
  182. Reply
  183.  
  184. mplatts
  185. 1d
  186. Yes, mine has been fixed.
  187.  
  188.  
  189.  
  190.  
  191. Reply
  192.  
  193. south-paw
  194.  
  195. sevenseacat
  196. 1d
  197. Really weird how radio silent it’s been on it :thinking: we’re coming up on 48 hours now
  198.  
  199.  
  200.  
  201.  
  202. Reply
  203.  
  204. sudojosh
  205. 1d
  206. Just a note that the status update for me now states that the service interruption was resolved 7 hours ago:
  207.  
  208. Service Interruption resolved 7 hours ago
  209. We are continuing to investigate an infrastructure related issue on this host.
  210.  
  211. I still had to manually restart the machine to actually bring my app back up, but I’ve been able to actually interact with machines now, so I guess it is resolved.
  212.  
  213. Never heard anything from Fly, just complete silence. No status page updates either. I’m sympathetic to the problems I can imagine Fly having to scale their service and support, but the takeaway I have from this experience is that if something outside my own control happens with Fly, there’s nothing I can do to find out what’s going on, when it’s going to be resolved, and if there’s anything I can do to resolve the issue. It sounds like even the paid email support has a multi-hour response, and even then it’s just going to be a “we’re working on it”. I can’t recommend Fly professionally with that kind of experience, and I’m not sure if I can even tolerate it for personal apps.
  214.  
  215. 1
  216.  
  217.  
  218.  
  219. Reply
  220.  
  221. mfwgenerics
  222. 9h
  223. Update: my bad VM has finally been restored after a couple of days.
  224.  
  225. I am concerned about the lack of clarity and communication around what happened but I’m happy to put this situation down to growing pains on Fly’s part. I think I’ll be sticking to non-critical, non-stateful workloads for the near term though. :sweat_smile:
  226.  
  227.  
  228.  
  229.  
  230. Reply
  231.  
  232. south-paw
  233.  
  234. 1
  235. 17m
  236. This topic has also now been made private and can’t be found if you haven’t already commented on it.
  237.  
  238. There was no notification (email or comment here) of services being restored.
  239.  
  240. If this was one of our (paid) production instances that was affected rather than a dev one - frankly, I’d be complete reevaluating if we would continue using Fly for hosting - the zero communication on this has really shocked and worried me now on how something more important to us would be handled.
  241.  
  242. All I’m asking for is some official message or reassurance on what’s happened here - I don’t think that’s too much to ask - our customers would expect it of our company if something went down for 3 days
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement