Fly.io cluster down

Service Interruption: Can’t Destroy Machine, Deploy, or Restart
Questions / Help
App not working
rails
Jul 18
17m


sudojosh

1
3d
I’m also getting this - just like @mfwgenerics, I worked around this by scaling to create a new machine, but still have the original machine in a state where it can’t be destroyed:

Error: could not get machine [machine ID]: failed to get VM [machine ID]: unavailable: dial tcp [ipv6]:3593: connect: connection refused
My staging environment is in the same state, but until I’m not adding more VMs that I’m surely going to be billed for until I know I can clean up the zombies.

Just like OP, I see the same error in the dashboard about emergency maintenance. That’s been there for 15 hours, with no other information.

This is the second time I’ve had this kind of issue with Fly, where my service just goes down, Fly reports everything healthy, and there’s literally no information and nothing I can really do other than wait and hope it comes back up sometime (hours later, probably).

I appreciate the convenience that Fly offers, but these kind of problems erode my trust in this platform completely. Heroku had it’s faults, but I was never left scouring a forum trying to get my service back up - if a host was unhealthy, my dyno would be automatically moved, no worries. I’m running a small-scale golden path Rails app with Postgres, I can’t imagine trying to fix these kinds of problems on a more complex app.


Reply

finntechnz
3d
Adding some more information here since I’m also surprised that this is still ongoing 12 hours later with no response.

We had four machines (app + Postgres for staging and production) running yesterday, and three of the four (including both databases) are still down and can’t be accessed. I can replicate the issues others have mentioned here.
This is our company’s external API app and so the issue broke all of our integrations.
Our team ended up setting up a new project in fly to spin up an instance to keep us going which took a couple of hours (backfilling environment variables and configuration etc, not a bad test of our DR ability).
There is no way I can find to get the data from the db machines. Thank goodness this isn’t our main production db and we were able to reverse engineer what we needed into there.
Very keen to hear what’s happening with this and why after so many hours there’s no more info or updates.


Reply

sevenseacat
3d
As an aside, it’s kind of a kick in the teeth to see the status page for our organization reporting no incidents - the same page that lists our apps as under maintenance and inaccessible!


Reply

mfwgenerics
3d
Confirming my deployment is in syd too. I’m still seeing the zombie VM and observing failing CLI commands against the machine.


Reply

south-paw

1
3d
image
image
723×356 14.1 KB
We have syd deployments as well for all our apps too

I’m feeling very lucky that none of our paid production apps or databases are affected currently (only our development environment is), but also really surprised that the issue has been ongoing for 17 hours now with no status page update, no notifications (beyond betterstack letting us know it was down) and one note on the app with not much info as to whats going on.

It really worries me what would happen if it was one of our paid production instances that was affected - the data we’re working with can’t simply be ‘recovered’ later, it’d just get dropped until service resumed or we migrated to another region to get things running again

Keen to know whats wrong and whats being done about it


Reply

south-paw
2d
Message has now been updated

Service Interruption (20 hours ago)
We are continuing to investigate an infrastructure related issue on this host.
Still no incidents listed on status page though for SYD region :thinking:

1


Reply

Tutello
2d
not sure if connected but had a redis app in lhr fall into suspended status overnight, killed an important demo

machine is a zombie…

machine [id] was found and is currently in a stopped state, attempting to kill…
Error: could not kill machine [id]: failed to kill VM [id]: failed_precondition: machine not in known state for signaling, stopped


Reply

sevenseacat
2d
I got a response from support a few hours ago -

Unfortunately this host managed to get into a extremely poor state, and a fix is taking longer than expected. We have a team continuing to work on it, but no estimated resolution time to share right now. As soon as we have an update we will let you know.

So I guess we just wait…

2 Replies
1


Reply

jl1
2d
Same issue here for me, on a host in syd. It’s completely broken a pg cluster.

The absence of any proactive status updates on this issue has been really poor.


Reply

south-paw

sevenseacat
2d
Thank you for sharing that update, surprised there is no status update from Fly yet though :cold_sweat:

I can appreciate the issue might be taking up a lot of time and they want to focus on fixing it first - but even just a message from the staff here earlier would put me at ease for our production apps that are running


Reply

sevenseacat
2d
We worked out we could create a new Postgres cluster from one of the snapshots of the currently-down app - so we’re back up and running for our app.

(We had to create it with a different name, and then when we tried to make another one with the previous name, flyctl put the cluster on the same currently-down host! Oops)


Reply

mplatts
1d
Also having this issue. Scale worked for the Phoenix server, but the Postgres server is also dead.

And I can’t even restore the Postgres one:

Error: failed to create volume: Couldn't allocate volume, not enough compute capacity left in yyz


Reply

sevenseacat
1d
There’s a known incident listed on the status page for YYZ, might be related. Fly.io Status - We are undergoing emergency vendor hardware replacement in YYZ region. 7

Still crickets for the down host in SYD though :frowning:

1 Reply


Reply

mplatts
1d
Yes, mine has been fixed.


Reply

south-paw

sevenseacat
1d
Really weird how radio silent it’s been on it :thinking: we’re coming up on 48 hours now


Reply

sudojosh
1d
Just a note that the status update for me now states that the service interruption was resolved 7 hours ago:

Service Interruption resolved 7 hours ago
We are continuing to investigate an infrastructure related issue on this host.

I still had to manually restart the machine to actually bring my app back up, but I’ve been able to actually interact with machines now, so I guess it is resolved.

Never heard anything from Fly, just complete silence. No status page updates either. I’m sympathetic to the problems I can imagine Fly having to scale their service and support, but the takeaway I have from this experience is that if something outside my own control happens with Fly, there’s nothing I can do to find out what’s going on, when it’s going to be resolved, and if there’s anything I can do to resolve the issue. It sounds like even the paid email support has a multi-hour response, and even then it’s just going to be a “we’re working on it”. I can’t recommend Fly professionally with that kind of experience, and I’m not sure if I can even tolerate it for personal apps.

1


Reply

mfwgenerics
9h
Update: my bad VM has finally been restored after a couple of days.

I am concerned about the lack of clarity and communication around what happened but I’m happy to put this situation down to growing pains on Fly’s part. I think I’ll be sticking to non-critical, non-stateful workloads for the near term though. :sweat_smile:


Reply

south-paw

1
17m
This topic has also now been made private and can’t be found if you haven’t already commented on it.

There was no notification (email or comment here) of services being restored.

If this was one of our (paid) production instances that was affected rather than a dev one - frankly, I’d be complete reevaluating if we would continue using Fly for hosting - the zero communication on this has really shocked and worried me now on how something more important to us would be handled.

All I’m asking for is some official message or reassurance on what’s happened here - I don’t think that’s too much to ask - our customers would expect it of our company if something went down for 3 days