Sep 4th, 2017
- dispatch_get_global_queue() is in practice on of the worst thing that the dispatch API provides, because despite all the best efforts of the runtime, there aren't enough information at runtime about your operations/actors/... to understand what your intent is and optimize for it. Which means that the language should make sure that (1) the anti-pattern is not the default behavior and (2) the interface provides a way to give and propagate the hints (dependency relationships, ordering, execution context, priorities, ...) the runtime will potentially need upfront.
- In general suspending/resuming work is very difficult to handle for a runtime (implementation wise), has large memory costs, and breaks priority inversion avoidance. dispatch_suspend()/dispatch_resume() is one of the banes of my existence when it comes to dispatch API surface. It only makes sense for dispatch source "I don't want to receive these events anymore for a while" is a perfectly valid thing to say or do. But suspending a queue or work is ripping the carpet from under the feet of the OS as you just basically make all work that is depending on the suspended one invisible and impossible to reason about.
- FWIW dispatch sources, and more importantly dispatch mach channels (which is the private interface that is used to implement XPC Connections) have a design that try really really really hard to not fall into any these pitfalls, are priority inheritance friendly, execute on *distributed* execution contexts, and have a state machine exposed through "on$Event" callbacks. We should benefit from the many years of experience that are condensed in these implementations when thinking about Actors and the primitives they provide.
- The place where GCD "fails" at is that if you target your individual serial queues to the global concurrent queues (a.k.a. root queues) which means "please pool, do your job", then yes it doesn't scale, because we take these individual serial queues as proxies for OS threads.
- In GCD there's a very big difference between the one queue at the root of your graph (just above the thread pool) and any other that is within. The number that doesn't scale is the number of the former contexts, not the latter.
- The real problem is that if you go async you need to be async all the way. Node.js and other similar projects have understood that a very long time ago. If you express dependencies between asynchronous execution context with a blocking relationship such as above, then you're just committing performance suicide. GCD handles this by adding more threads and overcommitting the system.
- The real world is far messier though. In practice, real world code blocks all of the time. In the case of GCD tasks, this is often tolerable for most apps, because their CPU usage is bursty and any accidental “thread explosion” that is created is super temporary. That being said, programs that create thousands of queues/closures that block on I/O will naturally get thousands of threads. GCD is efficient but not magic.
- I think the real problem is that programmers cannot pretend that resources are infinite.
- NSOperation has several implementation issues, and using it to encapsulate asynchronous work means that you don't get the correct priorities (I don't say it cant' be fixed, I honnestly don't know, I just know from the mouth of the maintainer that NSOperation makes only guarantees if you do all your work from -[NSOperation main]).
- In dispatch, there are 2 kind of queues at the API level in that regard:
- - the global queues, which aren't queues like the other and really is just an abstraction on top of the thread pool
- - all the other queues that you can target on each other the way you want.
- It is clear today that it was a mistake and that there should have been 3 kind of queues:
- - the global queues, which aren't real queues but represent which family of system attributes your execution context requires (mostly priorities), and we should have disallowed enqueuing raw work on it
- - the bottom queues (which GCD since last year tracks and call "bases" in the source code) that are known to the kernel when they have work enqueued
- - any other "inner" queue, which the kernel couldn't care less about
- In dispatch, we regret every passing day that the difference between the 2nd and 3rd group of queues wasn't made clear in the API originally.
- I like to call the 2nd category execution contexts, but I can also see why you want to pass them as Actors, it's probably more uniform (and GCD did the same by presenting them both as queues). Such top-level "Actors" should be few, because if they all become active at once, they will need as many threads in your process, and this is not a resource that scales. This is why it is important to distinguish them. And like we're discussing they usually also wrap some kind of shared mutable state, resource, or similar, which inner actors probably won't do.
- Private concurrent queues are not a success in dispatch and cause several issues, these queues are second class citizens in GCD in terms of feature they support, and building something with concurrency *within* is hard. I would keep it as "that's where we'll go some day" but not try to attempt it until we've build the simpler (or rather less hard) purely serial case first.
- This is why our belief (in my larger K(ernel) & R(untime) team) is that instead of trying to paper over the fact that there's a real OS running your stuff, we need to embrace it and explain to people that everywhere in a traditional POSIX world they would have used a real pthread_create()d thread to perform the work of a given subsystem, they create one such category #2 bottom queue that represents this thread (and you make this subsystem an Actor), and that the same way in POSIX you can't quite expect a process to have thousands of threads, you can't really have thousands of such top level actors either. It doesn't mean that there can't be some anonymous pool to run stuff on it for the stuff that is less your raison d'être, it just means that the things your app that are really what it was built to do should be organized in a resource aware way to take maximum advantage of the system. Throwing a million actors to the system without more upfront organization and initial thinking by the developers is not optimizable.
Please, Sign In to add comment