WebAssembly techniques to speed up matrix multiplication

251 points3 yearsjott.live

phkahler • 3 years ago

Another technique is to transpose the left matrix so each dot product is scanned in row-order and hence more cache friendly.

Another one I tried ages ago is to use a single loop counter and "de-interleave" the bits to get what would normally be 3 distinct loop variables. For this you need to modify the entry in the result matrix rather than having it write-only. It has the effect of accessing like a z-order curve but in 3 dimensions. It's a bit of overhead, but you can also unroll say 8 iterations (2x2x2) which helps make up for it. This ends up making good use of both caches and even virtual memory if things don't fit in RAM. OTOH it tends to prefer sizes that are a power of 2.

mynameismon • 3 years ago

Another very very interesting optimisation can be found in these lecture slides [0]. (Scroll to slide 28, although the entire slide deck is amazing)

[0]: https://ocw.mit.edu/courses/electrical-engineering-and-compu...

infogulch • 3 years ago

Loop order and in-memory striding choice in general is a pretty interesting. I worked on an old voxel-based game engine that used a very complicated on-disk storage format. It seemed.. unnecessary, so I experimented with very simplified strategy of literally writing out voxel color bytes to a stream and gzipping it, which turned out to be both faster and smaller in my application. iirc it was like 10x smaller but only if I used the right striding order.

I knew that the striding order benefits would only work on voxel data similar to the data I was testing on, and I wanted to generalize it so that it could be configured dynamically so each dataset could use the best striding order for that particular data. I tried setting up numpy ndarray so that you could configure the memory striding order independently of the array indices so that my code wouldn't have to change if the striding order changed, but I could never get it to work and unfortunately I wasn't able to convince numpy folks I interacted with that this feature would be helpful.

gfd • 3 years ago

I really like these set of lecture notes for optimizing matrix multiplication: https://ppc.cs.aalto.fi/ch2/v7/ (The transpose trick is used in v1)

eigenvalue • 3 years ago

I find it surprising that, even after using all those tricks, they are still only to achieve around 50% of the theoretical peak performance of the chip in terms of GFLOPS. And that's for matrix multiplication, which is a nearly ideal case for these techniques.

gfd • 3 years ago

Hmm where did you see that? In the graph at the bottom of v7, it got "93% of the theoretical maximum performance"

progbits • 3 years ago

This deserves it's own submission, wonderful resource!

cerved • 3 years ago

You can get extremely creative in optimizing matrix mulplication for cache and SIMD.

jacobolus • 3 years ago

For contexts like the web, also check out cache-oblivious matrix multiplication https://dspace.mit.edu/bitstream/handle/1721.1/80568/4355819...

dralley • 3 years ago

The compiler will sometimes do this transpose for you, but as with all compiler optimizations it might sometimes break.

melissalobos • 3 years ago

That sounds very interesting, is the anywhere I can read more about this optimization? I didn't know any compiler could do optimizations like that.

kanaffa12345 • 3 years ago

there is no way a general purpose compiler will figure this out. op is probably talking about something like halide or tvm or torchscript jit.

astrange • 3 years ago

phkahler • 3 years ago

The transpose is irrelevant after moving on to z-order access patterns and the 3d equivalent. Not quite as possible to reach peak sustained performance but well worth the trade in simplicity IMO.

3836293648 • 3 years ago

Assuming you're only using the left matrix once, surely the action of transposing it would slow it down so much that any gains are lost

hddqsb • 3 years ago

Matrix multiplication is O(n^3), while transposing is O(n^2). So for a large matrix, transposing will only take a tiny fraction of the time spent in multiplication. (But the loop reordering trick mentioned in @mynameismon's link is a very nice alternative.)

kanaffa12345 • 3 years ago

you don't transpose it before the matmul, you always have it transposed (i.e., when you print the weights of a linear layer in pytorch, you're actually seeing (A^t)^t and what's stored is A^t.

zwieback • 3 years ago

Very cool!

My question: will this kind of thing become more mainstream? I've seen the web emerge, go from static pages to entire apps being delivered and executed in the browser. The last bastion of native apps and libraries seems to be highly optimized algorithms but maybe those will also migrate to a deliver-from-the-web and execute in some kind of browser sandbox.

Java promised to deliver some version of native code execution but the Java app/applet idea never seemed to take off. In some ways it seems superior to what we have now but maybe the security concerns we had during that era held Java back too much. Or am I misunderstanding what WebAssembly can bring to the game?

halpert • 3 years ago

The web always feels terrible compared to a native app, especially on a phone. Some of the difference is due to Safari being a bad browser (probably intentional), but a bigger part is that the threading model makes it really difficult to have a responsive UI. Not to mention the browser’s gestures often clash with the application’s gestures.

arendtio • 3 years ago

> The web always feels terrible compared to a native app

'always' is certainly not true. Yes, with modern frameworks it is very easy to build websites which are slow. But it is also possible to build websites with butter smooth animations and instant responses.

I hope that in the future we will get frameworks that make it easier to create lightweight web apps, so that we will see more high performance apps.

halpert • 3 years ago

I made a another comment further down, but basically web apps can’t run as fast as native apps for a variety of reasons.

One reason is the thread model. There are two main threads that need to be synchronized (browser main thread and JS main thread) which will always be slower than a single main thread.

Another reason is that layout and measurement of elements in HTML is really complicated. Native apps heavily encourage deferred measurement which lets the app measure and lay itself out once per render pass. In JavaScript, layouts may need to happen immediately based on what properties of the dom you’re reading and setting.

arendtio • 3 years ago

eyelidlessness • 3 years ago

I generally find Safari’s performance better than other browsers. Also, every mainstream JS runtime is multithreaded. There are limitations on what can be shared between threads, but you can optimize a lot despite those limitations (including using WASM on the web, and native extensions/FFI on Node/Deno).

halpert • 3 years ago

On iOS, the perf of Safari is definitely better than every other browser, because every other browser is mandated to use WkWebView by Apple. They aren’t allowed to implement their own engine. Of course Apple isn’t subject to the same restriction.

jacobolus • 3 years ago

On a Mac, Safari Javascript generally outperforms Chrome and Firefox (other browsing tasks are also generally better performing), but there are some workloads where Safari turns out slower, especially when the developer has put a lot of work into Chrome-specific optimization.

Safari also generally uses a lot less memory and CPU for the same websites. Chrome in particular burns through my battery very quickly, and is basically completely incapable of keeping up with my browser use style (it just crashes when I try to open a few hundred tabs). Presumably nobody with authority at Google is a heavy laptop web-user or prioritizes client-side resource use: Google’s websites are also among the biggest browser resource hogs, even when sitting idle in a background tab.

Safari often takes a couple years longer than other browsers to implement cutting-edge features. This seems to me like a perfectly reasonable design decision; some web developers love complaining about it though, and some sites that were only developed against the most recent versions of Chrome don’t work correctly in Safari.

amelius • 3 years ago

Soon, other browsers can simply run themselves inside WASM which then runs inside a WkWebView :)

Uehreka • 3 years ago

In my experience, if the thing I’m working on runs in Safari, it’s buttery smooth. And if it doesn’t run, it completely shits the bed. Stuff like “we don’t support that way of doing shadows in SVG, so rather than simply not implement that property, we’ve turned your entire SVG element black”.

johncolanduoni • 3 years ago

Every mainstream JS runtime has some sort of background garbage collection and parsing/compilation, but for any execution that has to interact with JavaScript heap or stack state you’re still in single threaded territory. SharedArrayBuffer can help if you’re willing to give up on JavaScript objects entirely, and for WebAssembly this is less of a burden, but that’s not going to help you perform rendering on most web apps concurrently. JSCore goes a little bit further and can run javascript in the same runtime instance on multiple threads with some significant limitations, but this isn’t exposed to developers on Safari.

eyelidlessness • 3 years ago

jsheard • 3 years ago

WASM threads are available in all modern browsers now, including Safari. It's very early days in terms of ecosystem but we're steadily getting there.

halpert • 3 years ago

The issue isn’t so much having additional threads, it’s needing two main threads. The browser has one main thread for accepting user input and then dispatches the relevant user events to the JS main thread.

The browser can either dispatch the events asynchronously, leading to the events being handled in a noticeably delayed way, or the browser can block its main thread until the JS dispatch finishes, leading to fewer UI events being handled. Either way is an inferior experience.

danielvaughn • 3 years ago

Also aren't service workers technically multi-threading? That's been a thing for a while in the browser now.

jsheard • 3 years ago

Technically yeah, but the threads could only communicate through message passing which isn't ideal for performance. The more recent major improvement is for workers to be able to share a single memory space similar to how threading works in native applications.

nicoburns • 3 years ago

> Some of the difference is due to Safari being a bad browser (probably intentional), but a bigger part is that only having one thread makes it really difficult to have a responsive UI

Interestingly Safari is actually generally better than other browsers for running a responsive smooth UI. Not sure how much of that is the Safari engine and how much of it is better CPU on iPhones. But even on a first generation iPad or an iPhone 4 it was possible to get 60fps rendering fairly easily. The same could not be said for even higher end android phones of the time.

kitsunesoba • 3 years ago

Anecdotally, the only time I’ve had issues with unresponsiveness for pages in Safari is with sites that were written with Chrome specifically in mind.

themerone • 3 years ago

halpert • 3 years ago

Really? Even something simple like Wordle feels janky with the way Safari’s chrome overlaps the keyboard.

acdha • 3 years ago

javajosh • 3 years ago

The web solved software distribution. Full stop. There only remain the edge cases, and that's where webasm wants to help.

Sun/Java wanted badly to solve this problem, but tried to do too much too soon. Java gave devs in 1999 cutting edge OOP tools for doing GUIs (e.g. Swing) but distribution was always the problem. Installing and running browser plugins was always error prone, and it turned out the browser was itself just good enough to deliver value, so it won. (With the happy side-effect of giving the world-wide community of devs one of the gooeyist languages ever to express their fever dreams of what application code should be).

The question in my mind is whether there is enough demand for the kinds of software webasm enables, especially given that other routes of distribution (app stores) have filled in the gaps of what the web delivers, and are generally a lot more pleasant and predictable to native devs.

brrrrrm • 3 years ago

I'm not really equipped to predict anything, but I think recent surge in popularity of simple RISC-y[1] architectures like ARM will allow the WebAssembly standard to stay small yet efficient. I'm hopeful, but standards often have a way of not keeping up with the newest technology so we'll see.

[1] https://en.wikipedia.org/wiki/Reduced_instruction_set_comput...

pjscott • 3 years ago

I don't expect ARM to have much effect here. The wasm instruction set is basically a low-level compiler intermediate representation, and to go from that to machine code for x86 or ARM is about equally difficult in both cases.

MauranKilom • 3 years ago

Well, here is a related talk: https://www.destroyallsoftware.com/talks/the-birth-and-death...

pcwalton • 3 years ago

I really like this writeup. Note that it may not be worth using the SIMD in this way (horizontal SIMD) if you know you will be multiplying many matrices that are the same size. It may be better to do vertical SIMD and simply perform the scalar algorithm on 4 or 8 matrices at a time, like GPUs would do for vertex shaders. This does mean that you may have to interleave your matrices in an odd way to optimize memory access, though.

marginalia_nu • 3 years ago

It's kind of bizarre how it's an accomplishment to get your code closer to what the hardware is capable of. In a sane world, that should be the default environment you're working in. Anything else is wasteful.

Kilenaitor • 3 years ago

There's always been a tradeoff in writing code between developer experience and taking full advantage of what the hardware is capable of. That "waste" in execution efficiency is often worth it for the sake of representing helpful abstractions and generally helping developer productivity.

The real win here is when we can have both because of smart toolchains that can transform those high-level constructs and representations into the most efficient implementation for the hardware.

Posts like this demonstrate what's possible with the right optimizations so tools like compilers and assemblers are able to take advantage of these when given the high-level code. That way we can achieve what you're hoping for: the default being optimal implementations.

AnIdiotOnTheNet • 3 years ago

> That "waste" in execution efficiency is often worth it for the sake of representing helpful abstractions and generally helping developer productivity.

That's arguable at best. I for one am sick of 'developer productivity' being the excuse for why my goddamned supercomputer crawls when performing tasks that were trivial even on hardware 15 years older.

> The real win here is when we can have both because of smart toolchains that can transform those high-level constructs and representations into the most efficient implementation for the hardware.

That's been the promise for a long time and it still hasn't been realized. If anything things seem to be getting less and less optimal.

adamc • 3 years ago

No, it's really not even arguable. Lots and lots of software is written in business contexts where the cost of developing reliable code is a lot more important than its performance. Not everything is a commercial product aimed at a wide audience.

What you're "sick of" is mostly irrelevant unless you represent a market that is willing to pay more for a more efficient product. I use commercial apps every day that clearly could work a lot better than they do. But... would I pay a lot for that? No. They are too small a factor in my workday.

Saving money is part of engineering too.

bruce343434 • 3 years ago

People have been sick of slow programs and slow computers since literally forever. I think you live in a bubble or are complacent.

No one I know has anything good to say about microsoft teams, for instance. And that's just one of the recent "dekstop apps" which are actually framed browsers.

nicoburns • 3 years ago

> I for one am sick of 'developer productivity' being the excuse for why my goddamned supercomputer crawls when performing tasks that were trivial even on hardware 15 years older.

The problem here is developer salaries. So long as developers are as expensive as they are the incentive will be to optimise for developer productivity over runtime efficiency.

Kilenaitor • 3 years ago

dogleash • 3 years ago

eternityforest • 3 years ago

CPU intensive tasks are highly optimized already.

As far as I can tell, a slow computer is due to swapping from having a bunch of stuff open, or an inherently heavy background task and an OS that doesn't know how to allocate resources.

Sometimes there's some kind of network access bogging everything down, now that SASS is everywhere(In which case, we need more abstractions and developer productivity to enable offline work!).

Sometimes things are slow because we are doing a lot of heavy media processing. That's not inefficiency, it's just inherently heavy work. In fact, simplicity might actively slow things even more, because what you really need for that stuff is to use the GPU, and the "bloated mega library" you're avoiding might already do that.

Android is kind of the proof. Android is always pretty fast. Anything 15yo hardware could do trivially, Android can probably do faster. And Android is not lightweight.

There may be some slowness out there caused by abstraction layers, but as far as I can tell without digging into the code of every slow app I've ever seen, the real problem is the keep it simple mindset that discourages even basic optimization tricks like caching, and the NIH that makes people write things themselves and assume it's faster just because it's smaller.

Plus, the UNIXy pipeline mentality that did a number on computing. GUI workflows are not pipelines and there's lots of things that are very messy to do in a pipe style model, like changing one thing and selectively recomputing only what needs changing.

The "Data processing" way of thinking leads to producing a batch of stuff, then passing it to something that computes with it, instead of working directly with a state.

lijogdfljk • 3 years ago

> when performing tasks that were trivial even on hardware 15 years older.

Did the software to perform those tasks stop working?

danieldk • 3 years ago

The GFLOP/s is 1/28th of what you'd get when using the native Accelerate framework on M1 Macs [1]. I am all in for powerful abstractions, but not using native APIs for this (even if it's just the browser calling Accelerate in some way) is just a huge waste of everyone's CPU cycles and electricity.

[1] https://github.com/danieldk/gemm-benchmark#1-to-16-threads

dr_zoidberg • 3 years ago

Wasteful of computing resources, yes, but for a long time we've been prioritizing developer time. That happens because you can get faster hardware cheaper than you can get more developer time (and not all developers time is equal, say, Carmack con do in a few hours things I couldn't do in months).

I do agree that we'd get fantastic performance out of our systems if we had the important layers optimized like this (or more), but it seems few (if any) have been pushing in that direction.

marginalia_nu • 3 years ago

But we've got far more developers now, and by large, we aren't producing more capable software than we did 20 years ago. It's gotten very complex, but that is by large just a failure to keep things simple.

Spotify is slower to launch on my Ryzen 3900X than my operating system, and lacks many of the basic features that WinAmp had in 1998. Now you're thinking "Aha! But WinAmp just played mp3s, it didn't stream over the internet!", Yes it could. It was also by large developed by one guy, over the course of a couple of months.

I don't know where this superior developer productivity is going, but it sure doesn't seem to be producing particularly spectacular results.

dr_zoidberg • 3 years ago

Back when I was younger we had to develop a few simulations in university, and we spent half the semester coding the building blocks in C. I was slightly good at it, and having seen my brother stumble a couple of years before, I knew I had to be careful with the data structures and general code layout to keep things simple and working.

As this was a group project, there were other hands on deck. One weekend I caught a nasty cold and I couldn't assist the weekly meeting to work on the project. Monday comes and we have to show our advances. The code was butchered. It took me a day to fix what had been broken (and keeping egos at bay, it would've been easier to just throw everything away and implement things from my last stable version).

Now I can fire up python, import numpy and scipy, and make far more complex simulations in a couple of minutes and a few lines of code. Sure, back then python and numpy did exist, I just didn't know about them. But you know what didn't exist 10-15 years ago? Pytorch, TensorFlow, JAX and all the deep learning ecosystem (probably scikit-learn did exstist, it's been around forever!). Thanks to those, I can program/train deep learning algorithms for which I probably wouldn't be able to code the lower-level abstractions to make them work. JAX comes with a numpy that runs on hardware "for free" (and there was PyCUDA before that if you had a compatible GPU).

That's the programmer productivity you're not seeing. Sure, these are very specific examples, but we have many more building blocks available to make interesting things without having to worry about the lower layers.

You can also complain about Electron being a bloated layer, and that's OK. There's your comparisson about how Spotify is slow and Winamp is/was fast.

kortex • 3 years ago

That's kinda like saying Bob built a go-kart in his garage over a couple months, it moves you from A to B, I don't see why a Toyota Corolla needs a massive factory. Spotify's product isn't just a streaming media player. It's also all the infrastructure to produce the streams, at scale, for millions of users.

How often are you actually launching spotify? I start it once when I boot and that's it until my next reboot, weeks/months later.

Now you might of course ask, "why isn't the velocity 6554x that of winamp, even when correcting for non-eng staff, management, overhead, etc". Well, they probably simply aren't allocating that much to the client development.

Also often times one dev who knows exactly what he is doing can be way more effective than a team; no bikeshedding, communication, PRs, etc. What happens when they get hit by a bus?

terafo • 3 years ago

But you can't get faster hardware cheaper anymore. Not naively faster hardware anyways. You are getting more and more optimization opportunities nowadays though. Vectorize your code, offload some work to the GPU or one of the countless other accelerators that are present on modern SOC, change your I/O stack so you can utilize SSDs efficiently, etc. I think it's a matter of time until someone puts FPGA onto mainstream SOC, and the gap between efficient and mainstream software will only widen from that point.

dr_zoidberg • 3 years ago

You are precisely telling me the ways in which I can get faster hardware: GPU, accelerators, the I/O stack and SSDs, etc.

I agree that the software layer has become slow, crufy, bloated, etc. But it's still cheaper to get faster hardware (or wait a bit for it, see M1, Alder Lake, Zen 3, to name a few, and those are getting successors later on this year) than to get a good programmer to optimize your code.

And I know that we'd get much better performance out of current (and probably future) hardware if we had more optimized software, but it's rare to see companies and projects tackle on such optimization efforts.

terafo • 3 years ago

ska • 3 years ago

> Anything else is wasteful.

Everything has a cost. If the developer is a slave to machine architecture, development is slow and error prone. If the machine is a slave to a abstraction, everything will run slowly. Unsurprisingly, the real trick is finding appropriate balance for your situation.

Of course you can make things worse, in both directions.

peterhunt • 3 years ago

The real issue here is that the hardware isn't capable of sandboxing without introducing tons of side channel attacks. Lots of applications are willing to sacrifice a lot of performance in order to gain the distribution advantages from a safe, sandboxed execution environment.

jacobr1 • 3 years ago

This is the real issue. The other comments here are talking about dev-productivity, but they have the causal chain backwards. The browser overhead is about running in a sandbox, on arbitrary/unknown hardware. Web development had poorer DX than desktop app development for a long time, until the critical mass of web-devs (driven by the distribution advantages) finally drove the DX investment to heavier on the web side.

Zababa • 3 years ago

On the other hand, in your sane world, productivity would be a fraction of what it currently is, for developers and users. You favor computer time over developer time. While computer time can be a proxy for user time, it isn't always as developer time can be used to speed up user time too. A single-minded focus on computer time sounds like a case of throwing out metrics like developer time and user time because they are harder to measure than computer time. In any case, it sounds like a mistake to me.

Salgat • 3 years ago

Once you realize that it's a completely sandboxed environment that works on any platform, it's a lot more impressive.

marginalia_nu • 3 years ago

Well so is Java, and it's had near-native performance well over a decade.

not2b • 3 years ago

In a sane world (which is the world that we live in), it's best to find a well-optimized library for common operations like matrix multiplication. But if you want to do something unusual (multiply large matrices inside a browser, quickly) you've exited the sane world so you'll have to work at it.

LeanderK • 3 years ago

I was hopeful WebAssembly will speed up the web. What's missing? Native Browser APIs? Native IO Apis?

Let's say I want to interactively plot some complicated function without slowing down the rest. Can I do this in WebAssembly now?

acdha • 3 years ago

> Let's say I want to interactively plot some complicated function without slowing down the rest. Can I do this in WebAssembly now?

You've been able to do that for a while now and it would likely be fast enough even in pure JavaScript. The things which tend to slow the web down come back to developers not caring about performance (and thus not even measuring) or the cross-purposes of things like ad sales where a significant amount of JavaScript is deployed to do things the user doesn't care about.

johndough • 3 years ago

Browsers will always lag behind desktop applications by a decade or two because everything is designed by committee and takes forever to arrive (see e.g. SSE, AVX, GPGPU compute). And even if everyone can eventually agree on and implement the smallest common denominator hardware will already have evolved beyond that.

In addition, browsers have to fight all kinds of nefarious attackers, so it is a very hostile environment to develop in. For example, we can't even measure time accurately (or do proper multithreading with shared memory) in the browser thanks to the Spectre and Meltdown vulnerabilities. https://meltdownattack.com/

That being said, WebGL implements extremely gimped shaders. Yet, they are still more than enough to render all kinds of functions. For example, see https://www.shadertoy.com/ or https://glslsandbox.com/ which are huge collections of functions which take screen coordinates and time as input and compute pixel colors from that, i.e. f(x, y, t) -> (r, g, b). This might sound not very impressive on first glance, but people have been amazingly creative within this extremely constrained environment, resulting in all kinds of fancy 3D renderings.

sharikous • 3 years ago

> Let's say I want to interactively plot some complicated function without slowing down the rest. Can I do this in WebAssembly now?

Yep, with a Web Worker for the secondary thread. However the environment is still young and its' difficult to use heavy computation libraries. Besides for some reason SIMD instructions are present only for 128 bit units (2 doubles or 4 floats). Another problem is no matter what to do it is a layer over the hardware, so it will be slower than specialized machine code (if what you do is not in the JS API)

LeanderK • 3 years ago

> difficult to use heavy computation libraries

what do you mean? What's the blocker? Something like numpy for js would fill this role, calling wasm in the background. Just the missing SIMD-instructions? Some quick googling shows that one can't compile BLAS for wasm yet. This might be due to wasm64 not being available yet, i think? So would this help to tap into the existing ecosystem of optimised mathematical routines?

Ideally...I would leave js and just use python ;) It has the whole ecosystem at hands with numpy, scipy, statsmodels etc. But nobody is doing it and idk why. I think it might be due to fortran not compiling to wasm.

sharikous • 3 years ago

Yes I forgot about wasm64 not being available still. Yes, that's a big block.

About numpy for js I believe js is still not comfortable enough for this kind of use, especially with the lack of operator overloading.

Anyway there are some builds of BLAS (or equivalents) to wasm and even of python. Check out pyodide and brython

onion2k • 3 years ago

Let's say I want to interactively plot some complicated function without slowing down the rest

If you can draw your plot in a shader then you can do it in WebGL very easily. You'd only need to update the input uniforms and everything else would happen on the GPU.

VHRanger • 3 years ago

This'll end up inevitably as a WASM BLAS library

Which wouldn't be a bad thing

injidup • 3 years ago

https://www.google.com/search?q=wasm%20blas&ie=utf-8&oe=utf-...

LeanderK • 3 years ago

What can't we just compile BLAS to WASM? Isn't this the point of WASM?

brandmeyer • 3 years ago

The BLAS have historically relied on a wide range of microarchitecture-specific optimizations to get the most out of each processor generation. An ideal solution would be for the browser to provide that to the application in such a way that it is difficult to fingerprint.

See also the history of Atlas, GotoBLAS, Intel MKL, etc.

bee_rider • 3 years ago

libflame/BLIS might be a good starting point, they've created a framework where you bring your compute kernels, and they make them into a BLAS (plus some other nice functionality). I believe most of the framework itself is in C, so I guess that could somehow be made to spit out wasm (I know nothing about wasm). Then, getting the browser to be aware of the actual real assembly kernels might be a pain.

owlbite • 3 years ago

How does this compare to the native BLAS in the Accelerate library?

conradludgate • 3 years ago

Accelerate on the M1 is ridiculously fast (thanks to its special core set and specific instructions).

Some benchmarks I've done has it beating out CUDA on my RTX 2070. I have to got a proper gflops number though

danieldk • 3 years ago

It's going to absolutely blow this away. Here are some of my single precision GEMM benchmarks for the M1 and M1 Pro:

https://github.com/danieldk/gemm-benchmark#1-to-16-threads

tl;dr, the M1 can do ~1300 GFLOP/s and the M1 Pro up to ~2700GFLOP/s.

On the vanilla M1, that's 28 times faster than the best result in the post.

The difference (besides years of optimizing linear algebra libraries) is that Accelerate uses the AMX matrix multiplication co-processors through Apple's proprietary instructions.

bee_rider • 3 years ago

People are asking about BLAS already in various threads, but if they know the size of their matrices beforehand, it might be interesting to try EIGEN. EIGEN also has the benefit that it is an all-template C++ library so I guess it should be somehow possible to spit out the WASM (I know nothing about WASM).

Of course usually BLAS beats EIGEN, but for small, known-sized matrices, it might have a chance.

MauranKilom • 3 years ago

Can you specify what "small" constitutes for you?

bee_rider • 3 years ago

Small enough that MKL doesn't optimize for it. Since GEMM is important, 16x16 or something silly like that I guess.

wdroz • 3 years ago

Since wasm supports threads, I wonder if you can speed up these operations further more by using multiple threads.

brrrrrm • 3 years ago

That's a good point: you certainly could. There's some fun exploration to be done with atomic operations.

The issue is that threaded execution requires cross-origin isolation, which isn't trivial to integrate. (Example server that will serve the required headers: https://github.com/bwasti/wasmblr/blob/main/thread_example/s...)

ginko • 3 years ago

Isn't the idea of webassembly to compile native C/C++/Rust/Whatever code to be able to run in the browser?

Why not just compile OpenBLAS or another computer numerics library like that to WA?

remus • 3 years ago

I'm by no means an expert but my understanding is that a lot of the performance from libraries like openBLAS comes from targeting specific architectures (e.g. particular instruction sets on a series of processors). You can probably milk some more performance by targeting the web assembly architecture specifically (assuming openBLAS hasn't started doing similar themselves).

brrrrrm • 3 years ago

that's exactly how TF.js does it: https://github.com/google/XNNPACK/blob/master/src/f32-gemm/M...

magoghm • 3 years ago

I tested it on my M1 Mac and it reached 46.78 Gigaflops, which is quite amazing for a CPU running at 3.2 GHz. Isn't that like an average of 14.6 floating point operations per clock cycle?

danieldk • 3 years ago

I hate to post this multiple times, but the M1 has a dedicated matrix co-processor, it can do matrix multiplication at >1300GFLOP/s if you use the native Accelerate framework (which uses the standard BLAS API). The M1 Pro/Max can even do double that (>2600 GFLOP/s) [1].

46.78 GFLOP/s is not even that great on non-specialized hardware. E.g., a Ryzen 5900X, can do ~150 GFLOP/s single-threaded with MKL.

[1] https://github.com/danieldk/gemm-benchmark#1-to-16-threads

lostmsu • 3 years ago

If you look at the comment above about the regular GEMM implementation, M1 actually can do that at 1.6 Teraflops.

bertman • 3 years ago

Very cool writeup!

Unfortunately, the bar graphs at the bottom of the article have different y-axis scaling, but even so:

It's sad how Firefox' performance pales in comparison to Chrome.

bruce343434 • 3 years ago

I don't understand the naming and notation of this article because the author is assuming context that I don't have.

Section baseline: What are N, M, K? 3 matrices or? Laid out as a flat array, or what? `c[m * N + n] += a[m * K + k] * b[k * N + n];`, ah, apparently a b and c are the matrices? How does this work?

Section body: What is the mathy "C′=αC+A⋅B"? derivative of a constant is the angle times the constant plus the dot product of A and B???

Please, if you write a public blog post, use your head. Not everybody will understand your terse notes.

engmgrmgr • 3 years ago

Not to be too snarky, but perhaps the onus is on you to do some homework if you want to understand a niche article for which you lack context?

Laying out matrices like that is pretty standard, especially for a post about vectorization.

bruce343434 • 3 years ago

you're probably right, but at least a small prelude explaining the single character variable names would not be out of place.

wheelerof4te • 3 years ago

I admire people who can read and understand this.

Kilenaitor • 3 years ago

Which parts are you unable to read and understand? I'm sure some of us here could help explain if you have specific questions or hangups.

bruce343434 • 3 years ago

I don't understand the naming and notation of this article because the author is assuming context that I don't have.

Section baseline: What are N, M, K? 3 matrices or? Laid out as a flat array, or what? `c[m * N + n] += a[m * K + k] * b[k * N + n];`, ah, apparently a b and c are the matrices? How does this work?

Section body: What is the mathy "C′=αC+A⋅B"? derivative of a constant is the angle times the constant plus the dot product of A and B???

conradludgate • 3 years ago

There are 3 matrices in question: A, B and C. They have dimensions (M * K), (K * N) and (M * N) respectively.

They are laid out, rather than nested arrays, as a single continuous collection of bytes that can be interpreted as having a matrix shape. That's where the `m * N + n` comes from (m rows down and n cols in)

  C' = alpha C + A.B

This is the 'generalised matrix-matrix multiplication' (GEMM) operation. It's multiplying the matrices A and B, adding it to a scales version of C and inserting it back into C. Setting alpha to 0 gets you basic matmul

wheelerof4te • 3 years ago

Thank you for this detailed explanation. Making the matrix one-dimensional makes sense from the performance standpoint.

wheelerof4te • 3 years ago

The math stuff :)

JavaScript code is readable, at least.

Kilenaitor • 3 years ago

Is "the math stuff" all the optimizations being performed e.g. vectorizing multiplication?

Not trying to sound dismissive here but the core math the post is working with is actually a pretty straightforward matrix multiplication.

The bulk of the discussion focuses on optimizing the execution of that straightforward multiplication algorithm [triple-nested for loop; O(n^3)] rather than making algorithmic/mathematic optimizations.

And again, specific questions are easier to answer. :)

djur • 3 years ago

Matrix multiplication isn't exactly intuitive if you've never worked with it before.

ausbah • 3 years ago

shouldn't compilers handle stuff like this?

brrrrrm • 3 years ago

In an ideal world, absolutely! It's a hard problem and there are many attempts to make that happen automatically including polyhedral optimization (Polly[1]) and tensor compiler libraries (XLA[2] and TVM[3]). I work on a project called LoopTool[4] which is researching ways to dramatically reduce the representations of the other projects to simplify optimization scope.

[1] https://polly.llvm.org

[2] https://www.tensorflow.org/xla

[3] https://tvm.apache.org

[4] https://github.com/facebookresearch/loop_tool

visarga • 3 years ago

If they worked so well AMD would not be in such a bad position with their GPUs in ML. They would just need to compile to their arch.

juloo • 3 years ago

An other way is WebGL.

riddleronroof • 3 years ago

This is very cool

toolslite • 3 years ago

[dead]