Back

Arm AArch64 Adds Memcpy() Instructions

158 points3 yearscommunity.arm.com
zibzab3 years ago

CISC jokes aside, this an interesting turn of events.

Classic ARM had LDM/STM which could load/store from a list of registeres. While very handy, it was a nightmare from a hardware POV. For example, it made error handling and rollback much much more complex in out-of-order implementations.

ARMv8 removed those in aarch64 and introduced LDP/STP which only handled two registers at a time (the P is for Pair, M for multiple). This made things much easier but it seems the performance hit was not negligible.

Now with v8.8 and v9.3 we get this, which looks much nicer than intels ancient string functions that have been around since 8086. But I am curious how it affects other aspects of the CPU, specially those with very long and wide pipelines.

dvdkhlng3 years ago

Note that in ARM-based controllers, LDM/STM also have a non-negligible impact on interrupt latency. These are defined in a way that they cannot be interrupted mid-instruction, so worst-case interrupt latency is higher that would be expected with a RISC CPU (especially if LDM/STM happen to run on a somewhat slower memory region)

AFAICS x86 "rep" prefixed instructions are defined so that they can in fact be interrupted without problems. The remaining count is kept in (e)cx, so just doing an iret into "rep stosb" etc. will continue its operation.

I think VIA's hash/aes instruction set extension also made use of the "rep" prefix and kept all encryption/hash state in the x86 register set, so that they could in fact hash large memory regions on a single opcode without hampering interrupts.

duskwuff3 years ago

> These are defined in a way that they cannot be interrupted mid-instruction...

Usually. Cortex-M3 and M4 cores allow LDM/STM to be interrupted by default, and offer a flag to disable that (SCB->ACTLR.DISMCYCINT).

https://developer.arm.com/documentation/ddi0439/b/System-Con...

dvdkhlng3 years ago

Yes, looks like they added a few bits to the PSR register to capture the internal state of LDM/STM . Like a small version of x86's CS register.

userbinator3 years ago

AFAICS x86 "rep" prefixed instructions are defined so that they can in fact be interrupted without problems. The remaining count is kept in (e)cx, so just doing an iret into "rep stosb" etc. will continue its operation.

The 8086/8088 have a bug (one of very few!) where segment override prefixes were lost after an interrupted string instruction:

https://www.pcjs.org/documents/manuals/intel/8086/

I believe it was fixed in later versions.

JoeAltmaier3 years ago

I'm concerned this is another patch on a very difficult problem. There are something like 16 different combinations of source alignment, destination alignment, partial starting word, partial ending word memory-move operations. What's needed is an efficient move that does the right thing at runtime, which is to fetch the largest bus-limited chunks and align as it goes.

This includes a pipeline to re-align from source to destination; partial-fill of the pipe line at the start and partial dump at the end; and page-sensitive fault and restart logic throughout.

Multiple versions of memcpy is suspicious to start with: is the compiler expected to know the alignment statically at code generation time? It might be from arbitrary pointers. Alignment is best determined at runtime. Each pass through the same memcpy code may have different aligment and so on.

Years ago I debugged the standard linux copy on a RISC machine. It has a dozen bugs related to this. I remember thinking at the time, this should all be resolved at runtime by microcode inside the processor. It's been years now, and we get this. Sigh. It's a step anyway.

geerlingguy3 years ago

Maybe someone at Arm has sympathy on my plight to get graphics cards running on a Pi—I've had to replace memcpy calls (and memset) to get many parts of the drivers to work at all on arm64.

Note that the Pi also has a not-fully-standard PCIe bus implementation, so that doesn't really help things either.

my1233 years ago

If it’s because of having to map the BARs as device memory, then it’s a problem that will be taken care of. That said, the details of the right approach to work around this isn’t quite ironed out yet.

On Apple M1, trying to map the BARs as normal memory directly causes an SError.

As a reminder, the Arm Base System Architecture spec mandates that PCIe BARs must be mappable as normal non-cacheable memory.

Ballas3 years ago

I have watched most of your videos on the subject but cannot recall - have you tried running an external GPU on a Nvidia Jetson? Perhaps that is a place to start? (or perhaps I am just letting my ignorance on the matter show)

gradschoolfail3 years ago

You mean using the M.2 Key E Slot? Been trying to find a good adaptor for that, any recommendations? (Including nonstandard carrier boards with 8x pcie slots, etc)

consp3 years ago

If you want to do a bit DIY and go cheap you can but the m.2 to SSF-8643 cards and Dremel a proper keying into it and get a linkreal LRFC6911 for the pcie slot side.

cbm-vic-203 years ago

Sounds like a job for Red Shirt Jeff.

Ballas3 years ago

That is a possibility, but if you like to do things the easy way there is a full-size PCIe slot on the Xavier AGX dev kit.

+1
girvo3 years ago
gradschoolfail3 years ago

Ah right. Was thinking of Nano. As alluded to in the sibling comment, Xaviers seem to be officially out of stock.

girvo3 years ago

Just wanted to say I adore your videos -- and it will be interesting to see where ARM goes in the future to make projects like yours easier (even if only incidentally)

Milner083 years ago

Ill take this opportunity to echo this persons words. Keep the great videos coming.

Unklejoe3 years ago

How come? What is special about how memcpy works compared to a regular load from BAR memory?

forty3 years ago

Extended Zawinski's Law: "Every Instruction Set attempts to expand until it can read mail. Those Instruction Set which cannot so expand are replaced by ones which can." ;)

cmrdporcupine3 years ago

I feel like every instruction set eventually becomes VAX.

CoastalCoder3 years ago

Is it true that VAX allowed customers to extend the ISA themselves?

I think I learned about that many years ago, but I couldn't find anything about that recently when skimming the 11/780 user manual.

wrs3 years ago

Yes. Search for “writable control store”.

https://en.wikipedia.org/wiki/Control_store#Writable_stores

Here’s an example usage:

http://hps.ece.utexas.edu/pub/gee_micro19.pdf

jhgb3 years ago

Was this about the KU780 option?

GeorgeTirebiter3 years ago

Absolutely correct, KU780 was the Writeable Control Store described here http://bitsavers.trailing-edge.com/pdf/dec/vax/handbook/VAX_...

Sophisticated customers hacking the instruction sets of their machines goes back pretty much to the beginning. The earliest I personally know of is Prof Jack Dennis hacking MIT's PDP-1 to support timesharing, sometime in 1961. Commercial machines like the Burroughs B1700 had a WCS that was designed so various compiled languages could be optimized - e.g. a FORTRAN instruction set, a COBOL instruction set, etc https://en.wikipedia.org/wiki/Burroughs_B1700 It was also in the IBM360s because they had to emulate the IBM1401 software (although I don't know if the capability was open to users to modify).

Today of course you have the various optional features of the RISC-V ecosystem --- easy to load up on an FPGA.

Perhaps we should remember that we are in the very very early days of Computers, and we should expect continued modification / experimentation.

addaon3 years ago

Thus begins the slide from RISC to (what POWER/PowerPC ended up calling) FISC. It's not about reducing the instruction set, it's about designing a fast instruction set with easy-to-generate, generalizable instructions. Even more than PowerPC (which generally added interesting but less primitive register-to-register ops), this is going straight to richer memory-to-memory ops.

brigade3 years ago

Begins? Where do SVE2's histogram instructions fit? Or even NEON's VLD3/VLD4, dating to armv7? (which can decode into over two dozen µops, depending on CPU)

RISC has been definitively dead since Dennard scaling ran out; complex instructions are nothing new for ARM.

ksec3 years ago

>RISC has been definitively dead since Dennard scaling ran out

Except this is still not agreed upon on HN. Every single thread you did see more than half of the reply about RISC and RISC-V and how ARM v8 / POWER are no longer RISC hence RISC-V is going to win.

foxfluff3 years ago

The RISC-V hype is crazy, but I feel like it must be a product of marketing. Or I'm missing something big. I've read the (unprivileged) instruction set spec and while it's a nice tidy ISA, it also feels like pretty much a textbook RISC with nothing to set it apart, no features to make it interesting in 2021. And it's not the first open ISA out there. Why is there so much hype surrounding it?

If anything, I got the vibe that they were more concerned about cost of implementation and "scaling it down" than about a future-looking, high-performance ISA. And I'd prefer an ISA designed for 2040s high end PCs rather than one for 2000s microcontrollers..

+2
eddyb3 years ago
sharikone3 years ago

Their SIMD vectorized instructions are very neat and clean up the horrible mess of x64 ISA (I am not familiar enough with Neon and SVE so I don't know if ARM is a mess too)

userbinator3 years ago

I don't get the hype either. RISC-V is basically MIPS, and probably will replace the latter in all the miscellaneous places it's currently used.

SavantIdiot3 years ago

It is both.

RISC-V has some big names promoting it heavily.

What open ISA would be a real competitor to RISC-V?

smoldesu3 years ago

> Why is there so much hype surrounding it?

I Am Not A RISC Expert (IANARE), but I think it boils down to how reprogrammable each core is. My understanding is that each core has degrees of flexibility that can be used to easily hardware-accelerate workflows. As the other commenter mentioned, SIMD also works hand-in-hand with this technology to make a solid case for replacing x86 someday.

Here's a thought: the hype around ARM is crazy. In a decade or so, when x86 is being run out of datacenters en-masse, we're going to need to pick a new server-side architecture. ARM is utterly terrible at these kinds of workloads, at least in my experience (and I've owned a Raspberry Pi since Rev1). There's simply not enough gas in the ARM tank to get people where they're wanting it to go, whereas RISC-V has enough fundamental differences and advantages that it's just overtly better than it's contemporaries. The downside is that RISC-V is still in a heavy experimentation phase, and even the "physical" RISC-V boards that you can buy are really just glorified FPGAs running someone's virtual machine.

garmaine3 years ago

RISC-V?

pjmlp3 years ago

Just wait until they get finished with all ongoing extensions.

hyperman13 years ago

I think one less visible aspect of RISC is the more orthogonal instruction set.

Consider a CISC instruction set with all kinds of exceptional cases requiring specific registers. Humans writing assembler won't care much. When code was written in higher level languages and compilers did more advanced optimizations, instruction sets had to adapt to this style: A more regular instruction set, more registers, and simpler ways to use each register with each instruction. This was also part of the RISC movement.

Consider the 8086, eg with http://mlsite.net/8086/

* Temporary results belong in AX,so note in rows 9 and A how some common instructions have shorter encodings if you use AL/AX as a register.

* Counters belong in CX, so shift and rotation only work with CX. There is a specific JCXZ, jump if CX is zero, intended for loops.

* Memory is pointed at with BX,SI,DI, the mod r/m byte simply has no encoding for the other registers.

* There are instructions as XLAT or AAM that are almost impossible for a compiler to use.

* Multiplication and division have AX:DX as implicit register pair for one operand.

* Conditional jumps had a short range of +/- 128 bytes, jumping further required an unconditional jump.

Starting from the 80386 32 bit mode, a lot of this was cleaned up and made better accessible for compilers: EAX EBX ECX EDX ESI EDI were more or less interchangeable. Multiplication, shifting and memory access became possible with all these registers. Conditional jumps could reach the whole address space.

I heard people at the time describing the x86 instruction set as more RISC-like starting with the 80386.

lokedhs3 years ago

I think this is specific to x86, which is not the only CISC CPU. Other CISC architectures are much more regular. I'm familiar with M68k which is both regular and CISC.

You then have others, like the PDP-10 and S370 which are also regular but doesn't have these register-specific requirements that the Intel CPU's are stuck with.

hyperman13 years ago

True, the 8086 instruction set is ugly as hell. The 68000 was much better. I never saw the PDP-10 or S370 assembly , so I can't comment there.

AFAIK it was a quick and dirty stopgap processor to drop in the 8080-shaped hole until they could finish the iapx432. Intel wanted 8080 code to be almost auto-translatable to 8086 code and give their customers a way out of the 8bit 64K world. So they designed instructions and a memory layout to make this possible, at the cost of orthogonality.

Then IBM hacked together the quick and dirty PC based on the quick and dirty processor, and somehow one of the worst possible designs became the industry standard.

Thinking of it, the 80386 might be Intel coming to terms with the fact that everyone was stuck with this ugly design, and making the best of it. See also the 80186, a CPU incompatible with the PC. Maybe a sign Intel didn't believed in the future of the PC ?

leeter3 years ago

I think intel and IBM didn't expect the need for compatibility to be an issue. After all when the IBM PC was built generally speaking turning on and having a BASIC env was considered good enough. IBM added DOS so that CP/M customers would feel comfortable and it shows in PC-DOS 1.0. which is insanely bare bones. So it was not unreasonable for both IBM and Intel to assume that things like the PC-JR made sense, because backwards compatibility was the exception at that point not the rule. IBM in particular didn't take the PC market seriously and paid for it by getting their lunch eaten by the clones.

It's the clones we have to thank for the situation we're in today. If Compaq hadn't done a viable clone and survived the lawsuit we'd probably be using something else (Amiga?). But they did and the rest is history, computing on IBM-PC compatible hardware became affordable and despite better alternatives (sometimes near equal cost) the PC won out.

monocasa3 years ago

> See also the 80186, a CPU incompatible with the PC. Maybe a sign Intel didn't believed in the future of the PC ?

The 80186 was already well in its design phase when the PC was developed. And the PC wasn't even what Intel thought a personal computer should look like; they were pushing the multibus based systems hard at the time with their iSBC line.

dehrmann3 years ago

When transistor density is growing and clock speed isn't, specialized instructions make a lot of sense.

sifar3 years ago

Some references for FISC.

The medium article [1]. The Ars Technica article it refers [2]. The paper which the Ars Technica refers to[3].

[1] https://medium.com/macoclock/interesting-remarks-on-risc-vs-...

[2] https://archive.arstechnica.com/cpu/4q99/risc-cisc/rvc-5.htm...

[3] http://www.eng.ucy.ac.cy/theocharides/Courses/ECE656/beyond-...

marcodiego3 years ago

> Thus begins the slide from RISC to (what POWER/PowerPC ended up calling) FISC.

You mean from RISC to CISC, right?

addaon3 years ago

No, although one could make that argument. RISC (reduced instruction set) has a few characteristics besides just the number of instructions -- most "working" instructions are register-to-register, with load/store instructions being the main memory-touching instructions; instructions are of a fixed size with a handful of simple encodings; instructions tend to be of low and similar latency. CISC starts at the other side -- memory-to-register and memory-to-register "working" instructions, variable length encodings, instructions of arbitrary latency, etc.

FISC ("fast instruction set") was a term used for POWER/PowerPC to describe a philosophy that started very much with the RISC world, but considered the actual number of instructions to /not/ be a priority. Instructions were freely added when one instruction would take the place of several others, allowing higher code density and performance while staying more-or-less in line with the "core" RISC principles.

None of the RISC principles are widely held by ARM today -- this thread is an example of non-trivial memory operations, Thumb adds many additional instruction encodings of variable length, load/store multiple already had pretty arbitrary latency (not to mention things like division)... but ARM still feels more RISC-like than CISC-like. In my mind, the fundamental reason for this is that ARM feels like it's intended to be the target of a compiler, not the target of a programmer writing assembly code. And, of the many ways we've described instruction sets, in my mind FISC is the best fit for this philosophy.

microtherion3 years ago

> RISC (reduced instruction set) has a few characteristics besides just the number of instructions

Many (or all?) of the RISC pioneers have claimed that RISC was never about keeping the number of instructions low, but about the complexity of those instructions (uniform encoding, register-to-register operations, etc, as you list).

"Not a `reduced instruction set' but a `set of reduced instructions'" was the phrase I recall.

monocasa3 years ago

It's both.

Most of RISC falls out of the ability to assume the presence of dedicated I caches. Once you have a pseudo Harvard arch and your I fetches don't fight with your D fetches for bandwidth in inner loops, most of the benefit of microcode is gone, and a simpler ISA that looks like vertical microcode makes a lot more sense. Why have a single instruction that can crank away for hundreds of cycles computing a polynomial like VAX did if you can just write it yourself with the same perf?

Gaelan3 years ago

> Thumb adds many additional instruction encodings of variable length, load/store multiple already had pretty arbitrary latency

Worth noting that, AFAIK, both of these were removed on aarch64 (and aarch64-only cores do exist, notably Amazon's Graviton2)

galdosdi3 years ago

Fair, haha. But I think the distinction intended lies in that the old CISC ISAs were complex out of a desire to provide the assembly programmer ergonomic creature comforts, backwards compatibility, etc. Today's instruction sets are designed for a world where the vast majority of machine code is generated by an optimizing compiler, not hand crafted through an assembler, and I think that was part of what the RISC revolution was about.

jamesfinlayson3 years ago

Looks like FISC is Fast Instruction Set Computing (maybe - all I could find was a Medium article that says that).

fay593 years ago

ARM already has several instructions that aren’t exactly RISC. ldp and stp can load/store two registers at a given address and also update the value of the address register.

baybal23 years ago

Both x86, and 68xxx started that way. Old silicon actually game had a premium for smarter cores, which can do tricks like µOp fusions to compensate for smaller decoders.

RISC was originally about getting reasonably small cores, which can do what they are advertised, and nothing more, and µOp fusing was certainly outside of that scope.

Now, silicon is definitely cheaper, and both decoders, and other front-end smarts are completely microscopic in comparison to other parts of a modern SoC.

ncmncm3 years ago

How/when will we ever be able to confidently tell Gcc to generate these instructions, when we generally only know the code will be expected to run on some or other Aaargh64?

It is the same problem as POPCNT on Amd64, and practically everything on RISC-V. Checking some status flag at program start is OK for choosing computation kernels that will run for microseconds or longer, but for things that take only a few cycles anyway, at best, checking first makes them take much longer.

I imagine monkeypatching at startup, the way link relocations used to get patched in the days before we had ISAs that didn't support PIC. But that is miserable.

ndesaulniers3 years ago

Good questions.

For computer support, generally you would pass a -mcpu= flag (or maybe -mattr=, but that might be a compiler internal flag, I forget). Obviously then that's not portable and has implications on the ABI. I didn't read the article but I suspect they might be in ARMv9.0, hopefully, otherwise "better luck next major revision."

For monkey patching, the Linux kernel already does this aggressively since it generally has permission to read the relevant machine specific registers (MSRs). Doesn't help userspace, but userspace can do something similar with hwcaps and ifuncs.

Unklejoe3 years ago

I guess the fist step could be to handle it in the C library using some capability check and function pointers, then perhaps later on in the compiler if some mcpu flag or something is provided.

floatboth3 years ago

All implementation selection should be done with ifuncs. Sadly lots of programs still do it with just function pointers.

th3typh00n3 years ago

ifuncs is a non-standard compiler extension that only works on certain operating systems.

Developers that cares about portability are obviously going to stay far away from such things.

hawk_3 years ago

sorry i am out of loop here. what specifically is the problem with POPCNT on AMD64?

ncmncm3 years ago

POPCNT was added to amd64 in 2003. Because there are still amd64 machines from before then, compilers don't produce POPCNT instructions unless directed to target a later chip.

MSVC emits them to implement extension _popcount, but does not use that in its stdlib. Gcc, without a directive, expands __builtin_popcount to much slower code.

You can check for a "capability", but testing and branching before a POPCNT instruction adds latency and burns a precious branch prediction slot.

Most of the useful instructions on RISC-V are in optional extensions. Sometimes this is OK because you can put a whole loop behind a feature test. But some of these instructions would tend to be isolated and appear all over. That is the case for memcpy and memset, too, which often operate over very small blocks.

userbinator3 years ago

...so it only took them over three decades to realise the power of REP MOVS/STOS? ;-)

On x86, it's been there since the 8086, and can do cacheline-sized pieces at a time on the newer CPUs. This behaviour is detectable in certain edge-cases:

https://repzret.org/p/rep-prefix-and-detecting-valgrind/

gatronicus3 years ago

Except that for decades REP MOVS/STOS were avoided on x86 because they were much slower than hand written assembly. This only changed recently.

userbinator3 years ago

That was really only in the 286-486 era. On the 8086 it was the fastest, and since the Pentium II, which introduced cacheline-sized moves, it's basically nearly the same as the huge unrolled SIMD implementations that are marginally faster in microbenchmarks.

Linus Torvalds has some good comments on that here: https://www.realworldtech.com/forum/?threadid=196054&curpost...

josefx3 years ago

Linus seems to consider rep mov still too slow for small copies:

https://www.realworldtech.com/forum/?threadid=196054&curpost...

https://www.realworldtech.com/forum/?threadid=196054&curpost...

It seems to me that rep move is so bad that you want to avoid it, but trying to write a fast generic memcpy results in so much bloat to handle edge cases that rep move remains competitive in the generic case.

mackman3 years ago

I remember implementing memcpy for a PS3 game. If you were doing a lot of copying (which we were for some streaming systems) it was hugely beneficial to add some explicit memory prefetching with a handful of compiler intrinsics. I think the PPC processor on that lacked out of order execution so you would stall a thread waiting for memory all too easily.

mhh__3 years ago

23-stage in-order pipeline according to Wikipedia

https://en.wikipedia.org/wiki/Cell_(microprocessor)#Power_Pr...

dvdkhlng3 years ago

Well, the Cell CPU also had DMA engines that were fully integrated into the MMU memory-mapping, so you would have been able to asynchronously do a memcpy() while the CPU's execution resources were busy running computations in parallel.

wbsun3 years ago

A reminder that ARM is short for Advanced RISC Machines or previously Acorn RISC Machine[1].

[1]: https://en.wikipedia.org/wiki/ARM_architecture

nneonneo3 years ago

These could be really great if they get optimized well in hardware - as single instructions, they’d be easy to inline, reducing both function call overhead and code size all at once. I do wish they’d included some documentation with this update so it’d be clearer how these instructions can be used, though.

cjensen3 years ago

Instructions like this need to be interruptible since they take longer than standard instructions. I assume the ARM designers have thought about this?

brandmeyer3 years ago

ARM has been managing interruptible instructions with partial execution state for a long time.

In ARM assembly syntax, the exclamation point in an addressing mode indicates writeback. Its difficult to be certain without seeing the architecture reference manual, but it would be consistent for instruction to be writing back all three of the source pointer, destination pointer, and length registers.

A memcpy is interruptible without replaying the entire instruction (say, because it hit a page that needed to be faulted-in by the operating system) if it wrote back a consistent view of all three registers prior to transferring control to an interrupt handler.

addaon3 years ago

The old ARM<=7 load multiple / store multiple instructions were interruptible on most implementations. My recollection is that some implementations checkpointed and resumed, but at least the smaller cores tended to do a full-restart (so no guarantee of forward progress when approaching livelock). I'd expect the same here, with perhaps more designs leaning towards checkpointing.

baybal23 years ago

It's well known in the ARM world, and it's the reason we were complaining for years about impossibility of using DMA controller from userspace to do large memcpys.

More importantly today, using DMA to do large memcpy for non-latency-sensitive tasks allows cores to sleep more often, and it's a godsend for I/O intensive stuff like modern Java apps on Android which are full of giant bitmaps.

brandmeyer3 years ago

ARMv7-M squirrels away the progress of the ldm/stm in the program status register to avoid restarting it completely.

colonwqbang3 years ago

It's strange that such features seem to not be standard in CPUs. I wonder why? Copy-based APIs are not ideal but they seem to be hard to avoid.

In those ARM cores I've programmed, the core has a few extra DMA channels which can be used for such things. However, using them from userspace has always seemed a bit of a hassle.

hannob3 years ago

I haven't done assembler for a long time, but if my memory serves me well on x86 there's the rep movsb commands that will do effectively a memcpy-like operation.

hyperman13 years ago

Correct. There is the whole rep family doing all kinds of fun stuff. You can add the rep/repnz/repz prefixes to at least:

movs[b|w|d]: move data in bytes/words/doublewords aka memcpy

stos[b|w|d]: put a value in bytes/words/dwords aka memset

cmps[b|w|d]: compare values aka memcmp

scas[b|w|d]: scan for a value aka memchr

ins[b|w|maybe d]: read from IO port

outs[b|w|maybe d]: write to IO port.

lods[b|w|d] : read from memory was probably not meant to be combined with rep as it would just throw everything but the last byte away. I once saw a rep lodsb to do repeated reads from EGA video ram. The video card saw which bytes were touched and did something to them based on plane mask. This way touching 1 bit changed the color of a whole 4 bit pixel, speeding up things with a factor 4.

Then one day, someone found that rep movs was not the fastest way to copy data on an x86 and they all went out of vogue. I think rep stos recently came back as fastest memset, as it had a very specific CPU optimization applied.

Update: See https://stackoverflow.com/questions/33480999/how-can-the-rep...

vardump3 years ago

"Copy-based APIs are not ideal but they seem to be hard to avoid."

If everything resides in CPU L1 cache, it hardly matters at all. Other than L1 cache pressure, of course.

Other example is copying DMA transferred data followed by immediately consuming said data. Also in this case, the copy often effectively just brings the data to the CPU cache and the consuming code reads from cache. Of course it does increase overall memory write bandwidth use when the cache line(s) are eventually evicted, but total performance degradation can be pretty minimal for anything that fits in L1.

Aissen3 years ago

I saw this the other day, wanted to read it and failed; and again today. Luckily Google has it in cache:

http://webcache.googleusercontent.com/search?q=cache%3Ahttps...

Aissen3 years ago

I'm wondering if this isn't solving a problem only with a local optimum. How much better would be to have a standard way (i.e, not device-specific) to memzero (or memset) directly into the DRAM chips ? Or to use DMA for memcpy, while the CPU does other things ? Now of course, this could be a nightmare for cache coherency, but I've seen worse things done for performance.

dvdkhlng3 years ago

In fact the Cell CPU [1] had a DMA facility accessible from the SPU cores by non-privileged software [2]. This worked cleanly, as all DMA operations were subject to normal virtual memory paging rules.

But then the SPU did not have direct RAM access (only 256 kB of local S-RAM addressible from the CPU instructions), so DMA was something that followed naturally from the general design. Also not having any cache meant there were none of the usual cache coherency problems (though you may run into coherency problems during concurrent DMA to shared memory from multiple SPUs).

[edit] note also that the SPUs did not usually do any multitasking / multi-threading, which also simplified handling of DMA. Otherwise task switches would have to capture and restore the whole DMA unit's state (and also potentially all 256 kB of local storage as these cannot be paged).

[1] https://en.wikipedia.org/wiki/Cell_(microprocessor)

[2] https://arcb.csc.ncsu.edu/~mueller/cluster/ps3/SDK3.0/docs/a...

petermcneeley3 years ago

Tis a real shame we did not see SPU-like cores in later generations. The problem that I saw was that instead of embracing the power of a new architectural paradigm people just considered it weird and difficult.

I think had they provided (very slow but) normal path for accessing memory it would have made the situation much more acceptable to nominal developers.

The difficulty in adopting the PS3 basically killed the idea of Many-Core as the future for high performance gaming architecture.

Unklejoe3 years ago

> all DMA operations were subject to normal virtual memory paging rules.

That's the key right there. Many embedded SoC's I've worked with have DMA engines, but they are all behind the MMU and only work with physical addresses. It makes using them for something like "accelerated memcpy" kind of cumbersome and usually not even worth it unless it's moving HUGE chunks of memory (to overcome the page table walk that you have to do first).

dvdkhlng3 years ago

Well, I recently found the Cortex-M SoCs to be a blessing in that regard: no MMU, no need to run a fully fledged operating system, but still with LwIP, FreeRTOS&friends, they can handle surprisingly complex software tasks, while the lack of MMU and privilege-separation means that all the hardware: DMA-engines, communication interfaces and accelerator facilities (2-D GPU) are right at the tip of your hands.

monocasa3 years ago

Thankfully we're starting to get IO-MMUs on larger systems with DMA controllers like that. Much easier to pass around.

smallpipe3 years ago

This instruction doesn’t prevent the implementation of doing that, it just gives a standard interface.

Aissen3 years ago

Actually, I think it does: you cannot be using the core while it's doing the memset or memcpy, so it's technically not what I'm describing. Even if it did: a cross-industry reference implementation would go a long way into making this a reality.

smallpipe3 years ago

I'm willing to bet that within 5 years we'll see a CPU that effectively embeds a DMA engine used through this instruction. The way I'd implement it is a small FSM in the LLC that does the bulk copy while the CPU keeps running, while maintaining a list of addresses reads/writes to avoid (i.e. stall on) until the memcpy is finished.

truth_seeker3 years ago

How efficient it would be from performance and security point of view ?

ruslan3 years ago

A completely useless use of silicon. The fastest way of copying memory block is to offload it to DMA or some other dedicated hardware. Using CPU to copy blocks is just a stall. And please, do not call ARM a RISC!

branson23 years ago

DMA is cache oblivious, so you will have to clear all CPU caches before every memcpy. Very bad idea.