I look at cross core communication as a 100x latency penalty. Everything follows from there. The dependencies in the workload ultimately determine how it should be spread across the cores (or not!). The real elephant in the room is that oftentimes it's much faster to just do the whole job on a single core even if you have 255 others available. Some workloads do not care what kind of clever scheduler you have in hand. If everything constantly depends on the prior action you will never get any uplift.
You see this most obviously (visually) in places like game engines. In Unity, the difference between non-burst and burst-compiled code is very extreme. The difference between single and multi core for the job system is often irrelevant by comparison. If the amount of cpu time being spent on each job isn't high enough, the benefit of multicore evaporates. Sending a job to be ran on the fleet has a lot of overhead. It has to be worth that one time 100x latency cost both ways.
The GPU is the ultimate example of this. There are some workloads that benefit dramatically from the incredible parallelism. Others are entirely infeasible by comparison. This is at the heart of my problem with the current machine learning research paradigm. Some ML techniques are terrible at running on the GPU, but it seems as if we've convinced ourselves that GPU is a prerequisite for any kind of ML work. It all boils down to the latency of the compute. Getting data in and out of a GPU takes an eternity compared to L1. There are other fundamental problems with GPUs (warp divergence) that preclude clever workarounds.
Astute points. I've worked on an extremely performant facial recognition system (tens of millions of face compares per second per core) that lives in L1 and does not use the GPU for the FR inference at all, only for the display of the video and the tracked people within. I rarely even bother telling ML/DL/AI people it does not use the GPU, because I'm just tired of the argument that "we're doing it wrong".
How are you doing tens of millions of faces per second per core, first of all assuming a 5ghz processor, that gives you 500 cycles per image if you do ten million a second, that's not nearly enough to do anything image related. Second of all L1 cache is at most in the hundreds of kilobytes, so the faces aren't in L1 but must be retrieved from elsewhere...??
You can't look at it like _that_. Biometrics has its own "things". I don't know what OP is actually doing, but it's probably not classical image processing. Most probably facial features are going through some "form of LGBPHS binarized and encoded which is then fed into an adaptive bloom filter based transform"[0].
Paper quotes 76,800 bits per template (less compressed) and with 64-bit words it's what, 1200 64-bit bitwise ops. at 4.5 Ghz it's 4.5b ops per second / 1200 ops per per comparison which is ~3.75 million recognitions per second. Give or take some overhead, it's definitely possible.
Correct, it’s probably distance of a vector or something like that after the bloom. Take the facial points as a vec<T> as you only have a little over a dozen and it’s going to fit nicely in L1.
> assuming a 5ghz processor, that gives you 500 cycles per image if you do ten million a second
Modern CPUs don't quite work this way. Many instructions can be retired per clock cycle.
> Second of all L1 cache is at most in the hundreds of kilobytes, so the faces aren't in L1 but must be retrieved from elsewhere...??
Yea, from L2 cache. It's caches all the way down. That's how we make it go really fast. The prefetcher can make this look like magic if the access patterns are predictable (linear).
The keyword is CAN, there can also be huge penalties (random main-memory accesses are over a cycles typically), the parent was probably considering a regular image transform/comparison and 20 pixels per cycle even for low resolution 100x100 images is way above what we do today.
As others have mentioned, they're probably doing some kind of embedding like search primarily and then 500 cycles per face makes more sense, but it's not a full comparison.
Back in the old days of "Eigenfaces", you could project faces into 12- or 13-dimensional space using SVD and do k-nearest-neighbor. This fit into cache even back in the 90s, at least if your faces were pre-cropped to (say) 100x100 pixels.
I don't know the application, but just guessing that you don't need to compare an entire full-resolution camera image, but perhaps some smaller representation like an embedding space or pieces of the image
You can handle hundreds of millions of transactions per second if you are thoughtful enough in your engineering. ValueDisruptor in .NET can handle nearly half a billion items per second per core. The Java version is what is typically used to run the actual exchanges (no value types), so we can go even faster if we needed to without moving to some exotic compute or GPU technology.
That's fine, but a work-stealing scheduler doesn't redistribute work willy-nilly. Locally-submitted tasks are likely to remain local, and are generally stolen when stealing does pay off. If everything is more-or-less evenly distributed, you'll get little or no stealing.
That's not to say it's perfect. The problem is in anticipating how much workload is about to arrive and deciding how many worker threads to spawn. If you overestimate and have too many worker threads running, you will get wasteful stealing; if you're overly conservative and slow to respond to growing workload (to avoid over-stealing), you'll wait for threads to spawn and hurt your latencies just as the workload begins to spike.
There’s secondary costs though - because you might run on any thread you have to sprinkle atomics and/or mutexes all over the place (in Rust parlance the tasks spawned must be Send) which have all sorts of implicit performance costs that stack up even if you never transfer the task.
In other words, you could probably easily do 10m op/s per core on a thread per core design but struggle to get 1m op/s on a work stealing design. And the work stealing will be total throughput for the machine whereas the 10m op/s design will generally continue scaling with the number of CPUs.
An occasional successful CAS (on an owned cache line) has very little cost, but if you have to sprinkle atomics/mutexes all over the place, then there's something that's clearly not scalable in your design regardless of the concurrency implementation (you're expecting contention in a lot of places).
An atomic add on a 6ghz high end desktop CPU (13900) is I believe on the order of 4-10ns. If it’s in your hot path your hot path can’t go faster than 50-100 million operations/s - that’s the cost of 1 such instruction in your hotpath (down from the 24 billion non-atomic additions your 6ghz could do otherwise). A CAS brings this down to ~20-50 Mops/s. So it’s quite a meaningful slowdown if you actually want to use the full throughput of your CPU. And if that cache line is cached on another CPU you pay an additional hidden latency that could be anywhere from 40-200ns further reducing your hotpath to a maximum of 5-25MHz (and ignoring secondary effects of slowing down those cores without them even doing anything). God forbid there’s any contention - you’re looking at a variance of 20x between the optimal and worst case of how much of a throughput reduction you see by having a single CAS in your hot loop. And this is just talking about the task scheduler - at least in Rust you’ll need to have thread-safe data structures being accessed within the task itself - that’s what I was referring to as “sprinkled”. If you really want to target something running at 10Mops/s on a single core, I don’t think you can possibly get there with a task stealing approach.
These aren’t task queues as are being discussed here. It’s more like rayon - I have a par_iter and I want that to go as fast as possible on a large number of elements. Slightly different use case than thread per core vs work stealing runtime.
I was with a similar assumption that thread per core might be the best approach for one of my OpenSource Rust libraries that is a Workflow Orchestration engine. The engine is focused on payment processing.
The perv version had thread local engine and focused on thread per core. When I moved to a pure async based engine using tokio runtime and all underlying libraries made thread safe, it improved the performance 2x. The entire workload being fully CPU driven with no IO. I was assuming tokio mostly does better only for IO based workloads, however my tests proved me wrong.
Now am not moving away from async approach.
https://github.com/GoPlasmatic/dataflow-rs
I'd say it's pretty normal for a workflow. If you have a lot of things that can proceed independently of each other, you're likely to see that characterized as "multiple workflows".
Say you're making a four-course meal. In the abstract, each course is independent of the other three, but internally the steps of its preparation have exactly this kind of dependence, where step 3 is scheduled after step 2 because doing those steps in the other order will ruin the food.
If you ever want to make just one of those courses -- maybe you're going to a potluck -- now you've got an almost fully sequential workflow.
(And in practice, the full four-course meal is much more sequential than it appears in the abstract, because many of the steps of each course must contend for scarce resources, such as the stove, with steps of other courses.)
The thing with GPUs is that for many problems really dumb and simple algorithms (think bubble sort equivalent) are many times faster than very fancy CPU algorithms (think quick sort equivalent). Your typical non-neural-network GPU algorithm is rarely using more than 50% of it's power, yet still outperforms carefully written CPU algorithms.
Pre-work time + pack up time + send time + unpack time + work time + pack up time + send time + unpack time + post-work time.
All remote work has these properties. Even something 'simple' like a remote REST call. If 'remote work time' plus all that other stuff is less than your local calls then it is time wise worth sending it remote. If not local CPU would win.
That in many cases right now the GPU is 'winning' that race.
There are some neat tricks to remove almost all the pack and unpack time. Apache Arrow can help a ton there (uses the same data format on both CPU and GPU or other accelerator). And on some unified memory systems even the send time can be very low.
Except it is only worth doing, if when taking into account loading data into the GPU and getting the results back, is still faster than total execution on the CPU.
It doesn't help that GPU beats the CPU in compute, if a plain SIMD approach outperforms the total execution time.
> If everything constantly depends on the prior action you will never get any uplift.
Not always. For differential equations with large enough matrices, the independent work each core can do outperforms the communication overhead of core-to-core latency.
State can depend on the previous time point, or even the same time point. I see this misconception often in audio programming "you cannot parallelise work because it depends on the previous sample". As long as you can find parallelism somewhere and it's less than the overhead, you can benefit. Obviously if there's zero parallelism in the problem, no amount of cores will help.
I've worked on several thread-per-core systems that were purpose-built for extreme dynamic data and load skew. They work beautifully at very high scales on the largest hardware. The mechanics of how you design thread-per-core systems that provide uniform distribution of load without work-stealing or high-touch thread coordination have idiomatic architectures at this point. People have been putting thread-per-core architectures in production for 15+ years now and the designs have evolved dramatically.
The architectures from circa 2010 were a bit rough. While the article has some validity for architectures from 10+ years ago, the state-of-the-art for thread-per-core today looks nothing like those architectures and largely doesn't have the issues raised.
News of thread-per-core's demise has been greatly exaggerated. The benefits have measurably increased in practice as the hardware has evolved, especially for ultra-scale data infrastructure.
Are there any resources/learning material about the more modern thread-per-core approaches? It’s a particular area of interest for me, but I’ve had relatively little success finding more learning material, so I assume there’s lots of tightly guarded institutional knowledge.
Unfortunately, not really. I worked in HPC when it was developed as a concept there, which is where I learned it. I brought it over into databases which was my primary area of expertise because I saw the obvious cross-over application to some scaling challenges in databases. Over time, other people have adopted the ideas but a lot of database R&D is never published.
Writing a series of articles about the history and theory of thread-per-core software architecture has been on my eternal TODO list. HPC in particular is famously an area of software that does a lot of interesting research but rarely publishes, in part due to its historical national security ties.
The original thought exercise was “what if we treated every core like a node in a supercomputing cluster” because classical multithreading was scaling poorly on early multi-core systems once the core count was 8+. The difference is that some things are much cheaper to move between cores than an HPC cluster and so you adapt the architecture to leverage the things that are cheap that you would never do on a cluster while still keeping the abstraction of a cluster.
As an example, while moving work across cores is relatively expensive (e.g. work stealing), moving data across cores is relatively cheap and low-contention. The design problem then becomes how to make moving data between cores maximally cheap, especially given modern hardware. It turns out that all of these things have elegant solutions in most cases.
There isn’t a one-size-fits-all architecture but you can arrive at architectures that have broad applicability. They just don’t look like the architectures you learn at university.
I'll toss $20-50 your way to bump up the priority on writing that knowledge down, only strings are it has to actually get done and be publicly available
As someone with workloads that can benefit from these techniques, but limited resources to put them into practice, my working thesis has been:
* Use a multi-threaded tokio runtime that's allocated a thread-per-core
* Focus on application development, so that tasks are well scoped / skewed and don't _need_ stealing in the typical case
* Over time, the smart people working on Tokio will apply research to minimize the cost of work-stealing that's not actually needed.
* At the limit, where long-lived tasks can be distributed across cores and all cores are busy, the performance will be near-optimal as compared with a true thread-per-core model.
What's your hot take? Are there fundamental optimizations to a modern thread-per-core architecture which seem _impossible_ to capture in a work-stealing architecture like Tokio's?
A core assumption underlying thread-per-core architecture is that you will be designing a custom I/O and execution scheduler that is purpose-built for your software and workload at a very granular level. Most expectations of large performance benefits follow from this assumption.
At some point, people started using thread-per-core style while delegating scheduling to a third-party runtime, which almost completely defeats the purpose. If you let tokio et al do that for you, you are leaving a lot of performance and scale on the table. This is an NP-Hard problem; the point of solving it at compile-time is that it is computationally intractable for generic code to create a good schedule at runtime unless it is a trivial case. We need schedulers to consistently make excellent decisions extremely efficiently. I think this point is often lost in discussions of thread-per-core. In the old days we didn’t have runtimes, it was just assumed you would be designing an exotic scheduler. The lack of discussion around this may have led people to believe it wasn’t a critical aspect.
The reality that designing excellent workload-optimized I/O and execution schedulers is an esoteric, high-skill endeavor. It requires enormous amounts of patience and craft, it doesn’t lend itself to quick-and-dirty prototypes. If you aren’t willing to spend months designing the many touch points for the scheduler throughout your software, the algorithms for how events across those touch points interact, and analyzing the scheduler at a systems level for equilibria and boundary conditions then thread-per-core might not be worth the effort.
That said, it isn’t rocket science to design a reasonable schedule for software that is e.g. just taking data off the wire and doing something with it. Most systems are not nearly as complex as e.g. a full-featured database kernel.
If I remember correctly, these work stealing task schedulers started getting pushed around the mid 2000s as a result of Intel failing to scale the Pentium 4 architecture to expected single-thread performance levels.
Libraries like .NET's Task Parallel Library or Intel Threaded Building Blocks pretty much cemented these work-stealing task architectures. It's not that they didn't work well enough, but Intel Core came along, and single-threaded perf scaling was possible again, so these libraries became less of a focus.
I feel I'm still doing it the old 2010 way, with all my hand-crafted dpdk-and-pipelines-and-lockless-queues-and-homemade-taskgraph-scheduler, any modern reference (apart from 'use seastar' ? ... which fair if it fills your needs) ?
That being said, there are some things that are generally true for the long term: use a pinned thread per core, maximize locality (of data and code, wherever relevant), use asynchronous programming if performance is necessary. To incorporate the OP, give control where it's due to each entity (here, the scheduler). Cross-core data movement was never the enemy, but unprincipled cross-core data movement can be. If even distribution of work is important, work-stealing is excellent, as long as it's done carefully. Details like how concurrency is implemented (shared-state, here) or who controls the data are specific to the circumstances.
I did mass scale performance benchmarking on highly optimized workloads using lockfree queues and fibers, and locking to a core almost never was faster. There were a few topologies where it was, but they were outliers.
This was on a wide variety of intel, AMD, NUMA, ARM processors with different architectures, OSes and memory configurations.
Part of the reason is hyper threading (or threadripper type archs) but even locking to groups wasn’t usually faster.
This was even moreso the case when you had competing workloads stealing cores from the OS scheduler.
Most high-performance workloads are limited by memory-bandwidth these days. Even in HPC that became the primary bottleneck for a large percentage of workloads in the 2000s. High-performance data infrastructure is largely the same. You can drive 200 GB/s of I/O on a server in real systems today.
The memory-bandwidth bound cases is where thread-per-core tends to shine. It was the problem in HPC that thread-per-core was invented to solve and it empirically had significant performance benefits. Today we use it in high-scale databases and other I/O intensive infrastructure if performance and scalability are paramount.
That said, it is an architecture that does not degrade gracefully. I've seen more thread-per-core implementations in the wild that were broken by design than ones that were implemented correctly. It requires a commitment to rigor and thoroughness in the architecture that most software devs are not used to.
I think workload might be as (if not more) the factor than the uniqueness of the topology itself for how much pinning matters. If your workload is purely computationally limited then it doesn't matter. Same if it's actually I/O limited. If it's memory bandwidth limited then it depends on things like how much fits in per core cache vs shared cache vs going to RAM, and how is RAM actually fed to the cores.
A really interesting niche is all of the performance considerations around the design/use of VPP (Vector Packet Processing) in the networking context. It's just one example of a single niche, but it can give a good idea of how both "changing the way the computation works" and "changing the locality and pinning" can come together at the same time. I forget the username but the person behind VPP is actually on HN often, and a pretty cool guy to chat with.
Or, as vacuity put it, "there are no hard rules; use principles flexibly".
Thanks for sharing. Aside from what the other replies to you have shared, I admittedly have less experience, and I'm mainly interested in the OS perspective. Balancing global and local optimizations is hard, so the OS deserves some leeway, but as I see it, mainstream OSes tend to be awkward no matter what. It's long past time for OS schedulers to consider high-level metadata to get a rough idea of the idiosyncrasies of the workload. In the extreme case, designing the OS from the ground up to minimize cross-core contention[0] gives the most control, maximizing potential performance. As jandrewrogers says in a sibling reply, this requires a commitment to rigor, treacherous and nonportable as it is. In any case, with improved infrastructure ("with sufficiently smart compilers"...), thread-per-core gains power.
Context switches (when you change the thread running on a specific core) is one of the most computational expensive things computers do. If somehow you can't use a threadpool and some sort of task abstraction, you probably shouldn't be doing anything with multiple threads or asynchronous code.
I have absolutely no idea why anyone would think breaking the thread per core model is better and I seriously question the knowledge of anyone proposing another model without some VERY good explanation. The GP isn't even close to this in any way.
Changing task is some fraction as bad as changing thread because less state is changed, but some state is still changed. For example, if you run unrelated tasks, they all start with cold caches. It might not clear the IBPB, TLB etc for security because it doesn't go through the kernel, but if the task was completely unrelated, none of those caches were helping with the transition anyway. Usually, the task is related to some small degree.
> a task can yield, which, conceptually, creates a new piece of work that gets shoved onto the work queues (which is "resume that task"). You might not think of it as "this task is suspended and will be resumed later" as much as *"this piece of work is done and has spawned a new piece of work."*
Never thought of it that way, but it’s indeed true — a new task does get enqueued in that case. Thanks for the insight!
Async etc is also a function of dynamic work loads sometimes exasperated by the fact socket/channel A is slow so while waiting there deal with channels b,c,d,.. which are also slow for various reasons.
Per core threads and not much else are fairly required for nyse, trading, oms, and i bet things like switches. A web browser might be their polar opposite.
"At that time, ensuring maximum CPU utilization was not so important, since you’d typically be bound by other things, but things like disk speed has improved dramatically in the last 10 years while CPU speeds have not."
I'm going to quibble with that observation. CPUs HAVE improved dramatically in the past 10 years. It just doesn't look dramatic if your comparison point is storage speed.
Many runtimes and OS APIs have the possibility to attach decisions to which threads on which cores get used.
Java, .NET, Delphi, and C++ co-routines, all provide mechanisms to provide our own scheduler, which can then be used to say what goes where.
Maybe cool languages should look more into the ideas of these not so cool our parents ecosystems kind of languages. There are some interesting ideas there.
How so? AFAIK BEAM is pretty much agnostic between work-stealing and work-sharding* architectures.
* I prefer the term "work-sharding" over "thread-per-core", because work-stealing architectures usually also use one thread per core, so it tends to confuse people.
The BEAM schedulers are work stealing, and there's no way to bind a process to a scheduler (or at least, there's no publically documented way in upstream OTP).
You can adjust some settings for how schedulers work with respect to balancing load, but afaik, work stealing cannot be disabled... when a scheduler has no runnable processes, it will look at the runqueue of another scheduler and steal a runnable process if any are availabke (in priority order).
It does default to one 'cpu scheduler' per cpu thread, plus some i/o schedulers and maybe some dirty schedulers.
I look at cross core communication as a 100x latency penalty. Everything follows from there. The dependencies in the workload ultimately determine how it should be spread across the cores (or not!). The real elephant in the room is that oftentimes it's much faster to just do the whole job on a single core even if you have 255 others available. Some workloads do not care what kind of clever scheduler you have in hand. If everything constantly depends on the prior action you will never get any uplift.
You see this most obviously (visually) in places like game engines. In Unity, the difference between non-burst and burst-compiled code is very extreme. The difference between single and multi core for the job system is often irrelevant by comparison. If the amount of cpu time being spent on each job isn't high enough, the benefit of multicore evaporates. Sending a job to be ran on the fleet has a lot of overhead. It has to be worth that one time 100x latency cost both ways.
The GPU is the ultimate example of this. There are some workloads that benefit dramatically from the incredible parallelism. Others are entirely infeasible by comparison. This is at the heart of my problem with the current machine learning research paradigm. Some ML techniques are terrible at running on the GPU, but it seems as if we've convinced ourselves that GPU is a prerequisite for any kind of ML work. It all boils down to the latency of the compute. Getting data in and out of a GPU takes an eternity compared to L1. There are other fundamental problems with GPUs (warp divergence) that preclude clever workarounds.
Astute points. I've worked on an extremely performant facial recognition system (tens of millions of face compares per second per core) that lives in L1 and does not use the GPU for the FR inference at all, only for the display of the video and the tracked people within. I rarely even bother telling ML/DL/AI people it does not use the GPU, because I'm just tired of the argument that "we're doing it wrong".
How are you doing tens of millions of faces per second per core, first of all assuming a 5ghz processor, that gives you 500 cycles per image if you do ten million a second, that's not nearly enough to do anything image related. Second of all L1 cache is at most in the hundreds of kilobytes, so the faces aren't in L1 but must be retrieved from elsewhere...??
You can't look at it like _that_. Biometrics has its own "things". I don't know what OP is actually doing, but it's probably not classical image processing. Most probably facial features are going through some "form of LGBPHS binarized and encoded which is then fed into an adaptive bloom filter based transform"[0].
Paper quotes 76,800 bits per template (less compressed) and with 64-bit words it's what, 1200 64-bit bitwise ops. at 4.5 Ghz it's 4.5b ops per second / 1200 ops per per comparison which is ~3.75 million recognitions per second. Give or take some overhead, it's definitely possible.
[0] https://www.christoph-busch.de/files/Gomez-FaceBloomFilter-I...
Cache locality is a thing. Like in raytracing and the old confucian adage that says "Primary rays cache, secondary trash".
Correct, it’s probably distance of a vector or something like that after the bloom. Take the facial points as a vec<T> as you only have a little over a dozen and it’s going to fit nicely in L1.
NDA prevents me from saying anything beyond the compares are minimal representatives of a face template, and those stream through the core's caches.
A public report from the employer about the tech https://cyberextruder.com/wp-content/uploads/2022/06/Accurac... (I no longer work there.)
Queue the “If I were to build it…” ;)
> assuming a 5ghz processor, that gives you 500 cycles per image if you do ten million a second
Modern CPUs don't quite work this way. Many instructions can be retired per clock cycle.
> Second of all L1 cache is at most in the hundreds of kilobytes, so the faces aren't in L1 but must be retrieved from elsewhere...??
Yea, from L2 cache. It's caches all the way down. That's how we make it go really fast. The prefetcher can make this look like magic if the access patterns are predictable (linear).
The keyword is CAN, there can also be huge penalties (random main-memory accesses are over a cycles typically), the parent was probably considering a regular image transform/comparison and 20 pixels per cycle even for low resolution 100x100 images is way above what we do today.
As others have mentioned, they're probably doing some kind of embedding like search primarily and then 500 cycles per face makes more sense, but it's not a full comparison.
Back in the old days of "Eigenfaces", you could project faces into 12- or 13-dimensional space using SVD and do k-nearest-neighbor. This fit into cache even back in the 90s, at least if your faces were pre-cropped to (say) 100x100 pixels.
I don't know the application, but just guessing that you don't need to compare an entire full-resolution camera image, but perhaps some smaller representation like an embedding space or pieces of the image
Could you tell me a bit about how you were able to ensure the model is close to the cache?
the secret is to keep things ˢᵐᵒˡ
Do you work for Flock?
No shot are you doing tens of millions of anything useful per second per core. That's like beyond HFT numbers.
You can handle hundreds of millions of transactions per second if you are thoughtful enough in your engineering. ValueDisruptor in .NET can handle nearly half a billion items per second per core. The Java version is what is typically used to run the actual exchanges (no value types), so we can go even faster if we needed to without moving to some exotic compute or GPU technology.
It's so sad to see how many people not knowing how incredibly fast our CPUs are
> I look at cross core communication as a 100x latency penalty
if your workload is majority cpu-bound then this is true, sometimes, and at best
most workloads are io (i.e. syscall) bound, and io/syscall overhead is >> cross-core communication overhead
That's fine, but a work-stealing scheduler doesn't redistribute work willy-nilly. Locally-submitted tasks are likely to remain local, and are generally stolen when stealing does pay off. If everything is more-or-less evenly distributed, you'll get little or no stealing.
That's not to say it's perfect. The problem is in anticipating how much workload is about to arrive and deciding how many worker threads to spawn. If you overestimate and have too many worker threads running, you will get wasteful stealing; if you're overly conservative and slow to respond to growing workload (to avoid over-stealing), you'll wait for threads to spawn and hurt your latencies just as the workload begins to spike.
There’s secondary costs though - because you might run on any thread you have to sprinkle atomics and/or mutexes all over the place (in Rust parlance the tasks spawned must be Send) which have all sorts of implicit performance costs that stack up even if you never transfer the task.
In other words, you could probably easily do 10m op/s per core on a thread per core design but struggle to get 1m op/s on a work stealing design. And the work stealing will be total throughput for the machine whereas the 10m op/s design will generally continue scaling with the number of CPUs.
An occasional successful CAS (on an owned cache line) has very little cost, but if you have to sprinkle atomics/mutexes all over the place, then there's something that's clearly not scalable in your design regardless of the concurrency implementation (you're expecting contention in a lot of places).
An atomic add on a 6ghz high end desktop CPU (13900) is I believe on the order of 4-10ns. If it’s in your hot path your hot path can’t go faster than 50-100 million operations/s - that’s the cost of 1 such instruction in your hotpath (down from the 24 billion non-atomic additions your 6ghz could do otherwise). A CAS brings this down to ~20-50 Mops/s. So it’s quite a meaningful slowdown if you actually want to use the full throughput of your CPU. And if that cache line is cached on another CPU you pay an additional hidden latency that could be anywhere from 40-200ns further reducing your hotpath to a maximum of 5-25MHz (and ignoring secondary effects of slowing down those cores without them even doing anything). God forbid there’s any contention - you’re looking at a variance of 20x between the optimal and worst case of how much of a throughput reduction you see by having a single CAS in your hot loop. And this is just talking about the task scheduler - at least in Rust you’ll need to have thread-safe data structures being accessed within the task itself - that’s what I was referring to as “sprinkled”. If you really want to target something running at 10Mops/s on a single core, I don’t think you can possibly get there with a task stealing approach.
Is that best case latency? e.g., with only one thread adding to that location?
What about using something like https://github.com/judofyr/spice?
These aren’t task queues as are being discussed here. It’s more like rayon - I have a par_iter and I want that to go as fast as possible on a large number of elements. Slightly different use case than thread per core vs work stealing runtime.
I was with a similar assumption that thread per core might be the best approach for one of my OpenSource Rust libraries that is a Workflow Orchestration engine. The engine is focused on payment processing. The perv version had thread local engine and focused on thread per core. When I moved to a pure async based engine using tokio runtime and all underlying libraries made thread safe, it improved the performance 2x. The entire workload being fully CPU driven with no IO. I was assuming tokio mostly does better only for IO based workloads, however my tests proved me wrong. Now am not moving away from async approach. https://github.com/GoPlasmatic/dataflow-rs
> If everything constantly depends on the prior action you will never get any uplift.
I mean... that's kind of a pathological case, no?
I'd say it's pretty normal for a workflow. If you have a lot of things that can proceed independently of each other, you're likely to see that characterized as "multiple workflows".
Say you're making a four-course meal. In the abstract, each course is independent of the other three, but internally the steps of its preparation have exactly this kind of dependence, where step 3 is scheduled after step 2 because doing those steps in the other order will ruin the food.
If you ever want to make just one of those courses -- maybe you're going to a potluck -- now you've got an almost fully sequential workflow.
(And in practice, the full four-course meal is much more sequential than it appears in the abstract, because many of the steps of each course must contend for scarce resources, such as the stove, with steps of other courses.)
The thing with GPUs is that for many problems really dumb and simple algorithms (think bubble sort equivalent) are many times faster than very fancy CPU algorithms (think quick sort equivalent). Your typical non-neural-network GPU algorithm is rarely using more than 50% of it's power, yet still outperforms carefully written CPU algorithms.
That is application of the formula
Pre-work time + pack up time + send time + unpack time + work time + pack up time + send time + unpack time + post-work time.
All remote work has these properties. Even something 'simple' like a remote REST call. If 'remote work time' plus all that other stuff is less than your local calls then it is time wise worth sending it remote. If not local CPU would win.
That in many cases right now the GPU is 'winning' that race.
There are some neat tricks to remove almost all the pack and unpack time. Apache Arrow can help a ton there (uses the same data format on both CPU and GPU or other accelerator). And on some unified memory systems even the send time can be very low.
Except it is only worth doing, if when taking into account loading data into the GPU and getting the results back, is still faster than total execution on the CPU.
It doesn't help that GPU beats the CPU in compute, if a plain SIMD approach outperforms the total execution time.
Especially if you're saving watts in the process. And not utilizing a capital-intensive asset.
The "GPU as accelerator" vs. "GPU-native software" split. The former usually results in or from poor, generic architectures.
> If everything constantly depends on the prior action you will never get any uplift.
Not always. For differential equations with large enough matrices, the independent work each core can do outperforms the communication overhead of core-to-core latency.
If its independent work then it's work that doesn't rely on the prior action... At least not in the way the parent means.
State can depend on the previous time point, or even the same time point. I see this misconception often in audio programming "you cannot parallelise work because it depends on the previous sample". As long as you can find parallelism somewhere and it's less than the overhead, you can benefit. Obviously if there's zero parallelism in the problem, no amount of cores will help.
I've worked on several thread-per-core systems that were purpose-built for extreme dynamic data and load skew. They work beautifully at very high scales on the largest hardware. The mechanics of how you design thread-per-core systems that provide uniform distribution of load without work-stealing or high-touch thread coordination have idiomatic architectures at this point. People have been putting thread-per-core architectures in production for 15+ years now and the designs have evolved dramatically.
The architectures from circa 2010 were a bit rough. While the article has some validity for architectures from 10+ years ago, the state-of-the-art for thread-per-core today looks nothing like those architectures and largely doesn't have the issues raised.
News of thread-per-core's demise has been greatly exaggerated. The benefits have measurably increased in practice as the hardware has evolved, especially for ultra-scale data infrastructure.
Are there any resources/learning material about the more modern thread-per-core approaches? It’s a particular area of interest for me, but I’ve had relatively little success finding more learning material, so I assume there’s lots of tightly guarded institutional knowledge.
Unfortunately, not really. I worked in HPC when it was developed as a concept there, which is where I learned it. I brought it over into databases which was my primary area of expertise because I saw the obvious cross-over application to some scaling challenges in databases. Over time, other people have adopted the ideas but a lot of database R&D is never published.
Writing a series of articles about the history and theory of thread-per-core software architecture has been on my eternal TODO list. HPC in particular is famously an area of software that does a lot of interesting research but rarely publishes, in part due to its historical national security ties.
The original thought exercise was “what if we treated every core like a node in a supercomputing cluster” because classical multithreading was scaling poorly on early multi-core systems once the core count was 8+. The difference is that some things are much cheaper to move between cores than an HPC cluster and so you adapt the architecture to leverage the things that are cheap that you would never do on a cluster while still keeping the abstraction of a cluster.
As an example, while moving work across cores is relatively expensive (e.g. work stealing), moving data across cores is relatively cheap and low-contention. The design problem then becomes how to make moving data between cores maximally cheap, especially given modern hardware. It turns out that all of these things have elegant solutions in most cases.
There isn’t a one-size-fits-all architecture but you can arrive at architectures that have broad applicability. They just don’t look like the architectures you learn at university.
I'll toss $20-50 your way to bump up the priority on writing that knowledge down, only strings are it has to actually get done and be publicly available
> Writing a series of articles about the history and theory of thread-per-core software architecture has been on my eternal TODO list
Your past has already been super interesting, so if you ever do get around to writing this, I’d be very excited to read it!
As someone with workloads that can benefit from these techniques, but limited resources to put them into practice, my working thesis has been:
* Use a multi-threaded tokio runtime that's allocated a thread-per-core * Focus on application development, so that tasks are well scoped / skewed and don't _need_ stealing in the typical case * Over time, the smart people working on Tokio will apply research to minimize the cost of work-stealing that's not actually needed. * At the limit, where long-lived tasks can be distributed across cores and all cores are busy, the performance will be near-optimal as compared with a true thread-per-core model.
What's your hot take? Are there fundamental optimizations to a modern thread-per-core architecture which seem _impossible_ to capture in a work-stealing architecture like Tokio's?
A core assumption underlying thread-per-core architecture is that you will be designing a custom I/O and execution scheduler that is purpose-built for your software and workload at a very granular level. Most expectations of large performance benefits follow from this assumption.
At some point, people started using thread-per-core style while delegating scheduling to a third-party runtime, which almost completely defeats the purpose. If you let tokio et al do that for you, you are leaving a lot of performance and scale on the table. This is an NP-Hard problem; the point of solving it at compile-time is that it is computationally intractable for generic code to create a good schedule at runtime unless it is a trivial case. We need schedulers to consistently make excellent decisions extremely efficiently. I think this point is often lost in discussions of thread-per-core. In the old days we didn’t have runtimes, it was just assumed you would be designing an exotic scheduler. The lack of discussion around this may have led people to believe it wasn’t a critical aspect.
The reality that designing excellent workload-optimized I/O and execution schedulers is an esoteric, high-skill endeavor. It requires enormous amounts of patience and craft, it doesn’t lend itself to quick-and-dirty prototypes. If you aren’t willing to spend months designing the many touch points for the scheduler throughout your software, the algorithms for how events across those touch points interact, and analyzing the scheduler at a systems level for equilibria and boundary conditions then thread-per-core might not be worth the effort.
That said, it isn’t rocket science to design a reasonable schedule for software that is e.g. just taking data off the wire and doing something with it. Most systems are not nearly as complex as e.g. a full-featured database kernel.
The ScyllaDB team wrote a bunch (and p99conf is on today)
If I remember correctly, these work stealing task schedulers started getting pushed around the mid 2000s as a result of Intel failing to scale the Pentium 4 architecture to expected single-thread performance levels.
Libraries like .NET's Task Parallel Library or Intel Threaded Building Blocks pretty much cemented these work-stealing task architectures. It's not that they didn't work well enough, but Intel Core came along, and single-threaded perf scaling was possible again, so these libraries became less of a focus.
It seems multi-core interest is back.
I feel I'm still doing it the old 2010 way, with all my hand-crafted dpdk-and-pipelines-and-lockless-queues-and-homemade-taskgraph-scheduler, any modern reference (apart from 'use seastar' ? ... which fair if it fills your needs) ?
There are no hard rules; use principles flexibly.
That being said, there are some things that are generally true for the long term: use a pinned thread per core, maximize locality (of data and code, wherever relevant), use asynchronous programming if performance is necessary. To incorporate the OP, give control where it's due to each entity (here, the scheduler). Cross-core data movement was never the enemy, but unprincipled cross-core data movement can be. If even distribution of work is important, work-stealing is excellent, as long as it's done carefully. Details like how concurrency is implemented (shared-state, here) or who controls the data are specific to the circumstances.
I did mass scale performance benchmarking on highly optimized workloads using lockfree queues and fibers, and locking to a core almost never was faster. There were a few topologies where it was, but they were outliers.
This was on a wide variety of intel, AMD, NUMA, ARM processors with different architectures, OSes and memory configurations.
Part of the reason is hyper threading (or threadripper type archs) but even locking to groups wasn’t usually faster.
This was even moreso the case when you had competing workloads stealing cores from the OS scheduler.
Most high-performance workloads are limited by memory-bandwidth these days. Even in HPC that became the primary bottleneck for a large percentage of workloads in the 2000s. High-performance data infrastructure is largely the same. You can drive 200 GB/s of I/O on a server in real systems today.
The memory-bandwidth bound cases is where thread-per-core tends to shine. It was the problem in HPC that thread-per-core was invented to solve and it empirically had significant performance benefits. Today we use it in high-scale databases and other I/O intensive infrastructure if performance and scalability are paramount.
That said, it is an architecture that does not degrade gracefully. I've seen more thread-per-core implementations in the wild that were broken by design than ones that were implemented correctly. It requires a commitment to rigor and thoroughness in the architecture that most software devs are not used to.
I think workload might be as (if not more) the factor than the uniqueness of the topology itself for how much pinning matters. If your workload is purely computationally limited then it doesn't matter. Same if it's actually I/O limited. If it's memory bandwidth limited then it depends on things like how much fits in per core cache vs shared cache vs going to RAM, and how is RAM actually fed to the cores.
A really interesting niche is all of the performance considerations around the design/use of VPP (Vector Packet Processing) in the networking context. It's just one example of a single niche, but it can give a good idea of how both "changing the way the computation works" and "changing the locality and pinning" can come together at the same time. I forget the username but the person behind VPP is actually on HN often, and a pretty cool guy to chat with.
Or, as vacuity put it, "there are no hard rules; use principles flexibly".
Thanks for sharing. Aside from what the other replies to you have shared, I admittedly have less experience, and I'm mainly interested in the OS perspective. Balancing global and local optimizations is hard, so the OS deserves some leeway, but as I see it, mainstream OSes tend to be awkward no matter what. It's long past time for OS schedulers to consider high-level metadata to get a rough idea of the idiosyncrasies of the workload. In the extreme case, designing the OS from the ground up to minimize cross-core contention[0] gives the most control, maximizing potential performance. As jandrewrogers says in a sibling reply, this requires a commitment to rigor, treacherous and nonportable as it is. In any case, with improved infrastructure ("with sufficiently smart compilers"...), thread-per-core gains power.
[0] https://news.ycombinator.com/item?id=45651183
What type of workloads?
Context switches (when you change the thread running on a specific core) is one of the most computational expensive things computers do. If somehow you can't use a threadpool and some sort of task abstraction, you probably shouldn't be doing anything with multiple threads or asynchronous code.
I have absolutely no idea why anyone would think breaking the thread per core model is better and I seriously question the knowledge of anyone proposing another model without some VERY good explanation. The GP isn't even close to this in any way.
Changing task is some fraction as bad as changing thread because less state is changed, but some state is still changed. For example, if you run unrelated tasks, they all start with cold caches. It might not clear the IBPB, TLB etc for security because it doesn't go through the kernel, but if the task was completely unrelated, none of those caches were helping with the transition anyway. Usually, the task is related to some small degree.
> a task can yield, which, conceptually, creates a new piece of work that gets shoved onto the work queues (which is "resume that task"). You might not think of it as "this task is suspended and will be resumed later" as much as *"this piece of work is done and has spawned a new piece of work."*
Never thought of it that way, but it’s indeed true — a new task does get enqueued in that case. Thanks for the insight!
Morsel driven parallelism is working great in DuckDB, KuzuDB and now Ladybug (fork of Kuzu after archival).
Also Umbra / CedarDB.
Async etc is also a function of dynamic work loads sometimes exasperated by the fact socket/channel A is slow so while waiting there deal with channels b,c,d,.. which are also slow for various reasons.
Per core threads and not much else are fairly required for nyse, trading, oms, and i bet things like switches. A web browser might be their polar opposite.
An interesting observation:
"At that time, ensuring maximum CPU utilization was not so important, since you’d typically be bound by other things, but things like disk speed has improved dramatically in the last 10 years while CPU speeds have not."
I'm going to quibble with that observation. CPUs HAVE improved dramatically in the past 10 years. It just doesn't look dramatic if your comparison point is storage speed.
Many runtimes and OS APIs have the possibility to attach decisions to which threads on which cores get used.
Java, .NET, Delphi, and C++ co-routines, all provide mechanisms to provide our own scheduler, which can then be used to say what goes where.
Maybe cool languages should look more into the ideas of these not so cool our parents ecosystems kind of languages. There are some interesting ideas there.
Isn't this what Erlang/Elixir BEAM is all about?
How so? AFAIK BEAM is pretty much agnostic between work-stealing and work-sharding* architectures.
* I prefer the term "work-sharding" over "thread-per-core", because work-stealing architectures usually also use one thread per core, so it tends to confuse people.
The BEAM schedulers are work stealing, and there's no way to bind a process to a scheduler (or at least, there's no publically documented way in upstream OTP).
You can adjust some settings for how schedulers work with respect to balancing load, but afaik, work stealing cannot be disabled... when a scheduler has no runnable processes, it will look at the runqueue of another scheduler and steal a runnable process if any are availabke (in priority order).
It does default to one 'cpu scheduler' per cpu thread, plus some i/o schedulers and maybe some dirty schedulers.
No not at all. Seastar framework and others like it is what this is about.