CISC jokes aside, this an interesting turn of events.
Classic ARM had LDM/STM which could load/store from a list of registeres. While very handy, it was a nightmare from a hardware POV. For example, it made error handling and rollback much much more complex in out-of-order implementations.
ARMv8 removed those in aarch64 and introduced LDP/STP which only handled two registers at a time (the P is for Pair, M for multiple). This made things much easier but it seems the performance hit was not negligible.
Now with v8.8 and v9.3 we get this, which looks much nicer than intels ancient string functions that have been around since 8086. But I am curious how it affects other aspects of the CPU, specially those with very long and wide pipelines.
I'm concerned this is another patch on a very difficult problem. There are something like 16 different combinations of source alignment, destination alignment, partial starting word, partial ending word memory-move operations. What's needed is an efficient move that does the right thing at runtime, which is to fetch the largest bus-limited chunks and align as it goes.
This includes a pipeline to re-align from source to destination; partial-fill of the pipe line at the start and partial dump at the end; and page-sensitive fault and restart logic throughout.
Multiple versions of memcpy is suspicious to start with: is the compiler expected to know the alignment statically at code generation time? It might be from arbitrary pointers. Alignment is best determined at runtime. Each pass through the same memcpy code may have different aligment and so on.
Years ago I debugged the standard linux copy on a RISC machine. It has a dozen bugs related to this. I remember thinking at the time, this should all be resolved at runtime by microcode inside the processor. It's been years now, and we get this. Sigh. It's a step anyway.
Maybe someone at Arm has sympathy on my plight to get graphics cards running on a Pi—I've had to replace memcpy calls (and memset) to get many parts of the drivers to work at all on arm64.
Note that the Pi also has a not-fully-standard PCIe bus implementation, so that doesn't really help things either.
If it’s because of having to map the BARs as device memory, then it’s a problem that will be taken care of. That said, the details of the right approach to work around this isn’t quite ironed out yet.
On Apple M1, trying to map the BARs as normal memory directly causes an SError.
As a reminder, the Arm Base System Architecture spec mandates that PCIe BARs must be mappable as normal non-cacheable memory.
I have watched most of your videos on the subject but cannot recall - have you tried running an external GPU on a Nvidia Jetson? Perhaps that is a place to start? (or perhaps I am just letting my ignorance on the matter show)
You mean using the M.2 Key E Slot? Been trying to find a good adaptor for that, any recommendations? (Including nonstandard carrier boards with 8x pcie slots, etc)
If you want to do a bit DIY and go cheap you can but the m.2 to SSF-8643 cards and Dremel a proper keying into it and get a linkreal LRFC6911 for the pcie slot side.
Sounds like a job for Red Shirt Jeff.
That is a possibility, but if you like to do things the easy way there is a full-size PCIe slot on the Xavier AGX dev kit.
I am equally disappointed.. Would be surprised, but not verrry, if “people” use this for mining.
EDIT: would you be satisfied with a Jetson Nano combined with the hack mentioned by consp in another comment? (I would be)
Ah right. Was thinking of Nano. As alluded to in the sibling comment, Xaviers seem to be officially out of stock.
Just wanted to say I adore your videos -- and it will be interesting to see where ARM goes in the future to make projects like yours easier (even if only incidentally)
Ill take this opportunity to echo this persons words. Keep the great videos coming.
How come? What is special about how memcpy works compared to a regular load from BAR memory?
Extended Zawinski's Law: "Every Instruction Set attempts to expand until it can read mail. Those Instruction Set which cannot so expand are replaced by ones which can." ;)
I feel like every instruction set eventually becomes VAX.
Is it true that VAX allowed customers to extend the ISA themselves?
I think I learned about that many years ago, but I couldn't find anything about that recently when skimming the 11/780 user manual.
Yes. Search for “writable control store”.
https://en.wikipedia.org/wiki/Control_store#Writable_stores
Here’s an example usage:
Was this about the KU780 option?
Absolutely correct, KU780 was the Writeable Control Store described here http://bitsavers.trailing-edge.com/pdf/dec/vax/handbook/VAX_...
Sophisticated customers hacking the instruction sets of their machines goes back pretty much to the beginning. The earliest I personally know of is Prof Jack Dennis hacking MIT's PDP-1 to support timesharing, sometime in 1961. Commercial machines like the Burroughs B1700 had a WCS that was designed so various compiled languages could be optimized - e.g. a FORTRAN instruction set, a COBOL instruction set, etc https://en.wikipedia.org/wiki/Burroughs_B1700 It was also in the IBM360s because they had to emulate the IBM1401 software (although I don't know if the capability was open to users to modify).
Today of course you have the various optional features of the RISC-V ecosystem --- easy to load up on an FPGA.
Perhaps we should remember that we are in the very very early days of Computers, and we should expect continued modification / experimentation.
Thus begins the slide from RISC to (what POWER/PowerPC ended up calling) FISC. It's not about reducing the instruction set, it's about designing a fast instruction set with easy-to-generate, generalizable instructions. Even more than PowerPC (which generally added interesting but less primitive register-to-register ops), this is going straight to richer memory-to-memory ops.
Begins? Where do SVE2's histogram instructions fit? Or even NEON's VLD3/VLD4, dating to armv7? (which can decode into over two dozen µops, depending on CPU)
RISC has been definitively dead since Dennard scaling ran out; complex instructions are nothing new for ARM.
>RISC has been definitively dead since Dennard scaling ran out
Except this is still not agreed upon on HN. Every single thread you did see more than half of the reply about RISC and RISC-V and how ARM v8 / POWER are no longer RISC hence RISC-V is going to win.
The RISC-V hype is crazy, but I feel like it must be a product of marketing. Or I'm missing something big. I've read the (unprivileged) instruction set spec and while it's a nice tidy ISA, it also feels like pretty much a textbook RISC with nothing to set it apart, no features to make it interesting in 2021. And it's not the first open ISA out there. Why is there so much hype surrounding it?
If anything, I got the vibe that they were more concerned about cost of implementation and "scaling it down" than about a future-looking, high-performance ISA. And I'd prefer an ISA designed for 2040s high end PCs rather than one for 2000s microcontrollers..
> Everyone is jumping on it because no longer do they have to deal with a GCC/LLVM backend
That seems like why everyone in the low-end space would be jumping on it (like WD for their storage controllers). But that's not really an advantage over the existing ARM & X86 ISAs in the mid to high-end space since they already have that software tooling built up.
But that also seems rather narrowly scoped to those who are willing to design & fab custom SoCs, which seems to need both ultra-low margins and ultra-high volumes to justify. Anyone going off-the-shelf already has things like the Cortex-M with complete software tooling out of the box. And anyone going high-margin can always just take ARM's more advanced licenses to start with a better baseline & better existing software ecosystem (ex, graviton2, Apple Silicon, Nvidia's Denver, Carmel & Grace, etc..)
Esperanto Technologies already did (ET-SoC-1 has four OoO RV64GC cores), but I doubt they were first.
Their SIMD vectorized instructions are very neat and clean up the horrible mess of x64 ISA (I am not familiar enough with Neon and SVE so I don't know if ARM is a mess too)
I don't get the hype either. RISC-V is basically MIPS, and probably will replace the latter in all the miscellaneous places it's currently used.
It is both.
RISC-V has some big names promoting it heavily.
What open ISA would be a real competitor to RISC-V?
> Why is there so much hype surrounding it?
I Am Not A RISC Expert (IANARE), but I think it boils down to how reprogrammable each core is. My understanding is that each core has degrees of flexibility that can be used to easily hardware-accelerate workflows. As the other commenter mentioned, SIMD also works hand-in-hand with this technology to make a solid case for replacing x86 someday.
Here's a thought: the hype around ARM is crazy. In a decade or so, when x86 is being run out of datacenters en-masse, we're going to need to pick a new server-side architecture. ARM is utterly terrible at these kinds of workloads, at least in my experience (and I've owned a Raspberry Pi since Rev1). There's simply not enough gas in the ARM tank to get people where they're wanting it to go, whereas RISC-V has enough fundamental differences and advantages that it's just overtly better than it's contemporaries. The downside is that RISC-V is still in a heavy experimentation phase, and even the "physical" RISC-V boards that you can buy are really just glorified FPGAs running someone's virtual machine.
RISC-V?
Just wait until they get finished with all ongoing extensions.
I think one less visible aspect of RISC is the more orthogonal instruction set.
Consider a CISC instruction set with all kinds of exceptional cases requiring specific registers. Humans writing assembler won't care much. When code was written in higher level languages and compilers did more advanced optimizations, instruction sets had to adapt to this style: A more regular instruction set, more registers, and simpler ways to use each register with each instruction. This was also part of the RISC movement.
Consider the 8086, eg with http://mlsite.net/8086/
* Temporary results belong in AX,so note in rows 9 and A how some common instructions have shorter encodings if you use AL/AX as a register.
* Counters belong in CX, so shift and rotation only work with CX. There is a specific JCXZ, jump if CX is zero, intended for loops.
* Memory is pointed at with BX,SI,DI, the mod r/m byte simply has no encoding for the other registers.
* There are instructions as XLAT or AAM that are almost impossible for a compiler to use.
* Multiplication and division have AX:DX as implicit register pair for one operand.
* Conditional jumps had a short range of +/- 128 bytes, jumping further required an unconditional jump.
Starting from the 80386 32 bit mode, a lot of this was cleaned up and made better accessible for compilers: EAX EBX ECX EDX ESI EDI were more or less interchangeable. Multiplication, shifting and memory access became possible with all these registers. Conditional jumps could reach the whole address space.
I heard people at the time describing the x86 instruction set as more RISC-like starting with the 80386.
I think this is specific to x86, which is not the only CISC CPU. Other CISC architectures are much more regular. I'm familiar with M68k which is both regular and CISC.
You then have others, like the PDP-10 and S370 which are also regular but doesn't have these register-specific requirements that the Intel CPU's are stuck with.
True, the 8086 instruction set is ugly as hell. The 68000 was much better. I never saw the PDP-10 or S370 assembly , so I can't comment there.
AFAIK it was a quick and dirty stopgap processor to drop in the 8080-shaped hole until they could finish the iapx432. Intel wanted 8080 code to be almost auto-translatable to 8086 code and give their customers a way out of the 8bit 64K world. So they designed instructions and a memory layout to make this possible, at the cost of orthogonality.
Then IBM hacked together the quick and dirty PC based on the quick and dirty processor, and somehow one of the worst possible designs became the industry standard.
Thinking of it, the 80386 might be Intel coming to terms with the fact that everyone was stuck with this ugly design, and making the best of it. See also the 80186, a CPU incompatible with the PC. Maybe a sign Intel didn't believed in the future of the PC ?
I think intel and IBM didn't expect the need for compatibility to be an issue. After all when the IBM PC was built generally speaking turning on and having a BASIC env was considered good enough. IBM added DOS so that CP/M customers would feel comfortable and it shows in PC-DOS 1.0. which is insanely bare bones. So it was not unreasonable for both IBM and Intel to assume that things like the PC-JR made sense, because backwards compatibility was the exception at that point not the rule. IBM in particular didn't take the PC market seriously and paid for it by getting their lunch eaten by the clones.
It's the clones we have to thank for the situation we're in today. If Compaq hadn't done a viable clone and survived the lawsuit we'd probably be using something else (Amiga?). But they did and the rest is history, computing on IBM-PC compatible hardware became affordable and despite better alternatives (sometimes near equal cost) the PC won out.
> See also the 80186, a CPU incompatible with the PC. Maybe a sign Intel didn't believed in the future of the PC ?
The 80186 was already well in its design phase when the PC was developed. And the PC wasn't even what Intel thought a personal computer should look like; they were pushing the multibus based systems hard at the time with their iSBC line.
When transistor density is growing and clock speed isn't, specialized instructions make a lot of sense.
Some references for FISC.
The medium article [1]. The Ars Technica article it refers [2]. The paper which the Ars Technica refers to[3].
[1] https://medium.com/macoclock/interesting-remarks-on-risc-vs-...
[2] https://archive.arstechnica.com/cpu/4q99/risc-cisc/rvc-5.htm...
[3] http://www.eng.ucy.ac.cy/theocharides/Courses/ECE656/beyond-...
> Thus begins the slide from RISC to (what POWER/PowerPC ended up calling) FISC.
You mean from RISC to CISC, right?
No, although one could make that argument. RISC (reduced instruction set) has a few characteristics besides just the number of instructions -- most "working" instructions are register-to-register, with load/store instructions being the main memory-touching instructions; instructions are of a fixed size with a handful of simple encodings; instructions tend to be of low and similar latency. CISC starts at the other side -- memory-to-register and memory-to-register "working" instructions, variable length encodings, instructions of arbitrary latency, etc.
FISC ("fast instruction set") was a term used for POWER/PowerPC to describe a philosophy that started very much with the RISC world, but considered the actual number of instructions to /not/ be a priority. Instructions were freely added when one instruction would take the place of several others, allowing higher code density and performance while staying more-or-less in line with the "core" RISC principles.
None of the RISC principles are widely held by ARM today -- this thread is an example of non-trivial memory operations, Thumb adds many additional instruction encodings of variable length, load/store multiple already had pretty arbitrary latency (not to mention things like division)... but ARM still feels more RISC-like than CISC-like. In my mind, the fundamental reason for this is that ARM feels like it's intended to be the target of a compiler, not the target of a programmer writing assembly code. And, of the many ways we've described instruction sets, in my mind FISC is the best fit for this philosophy.
> RISC (reduced instruction set) has a few characteristics besides just the number of instructions
Many (or all?) of the RISC pioneers have claimed that RISC was never about keeping the number of instructions low, but about the complexity of those instructions (uniform encoding, register-to-register operations, etc, as you list).
"Not a `reduced instruction set' but a `set of reduced instructions'" was the phrase I recall.
It's both.
Most of RISC falls out of the ability to assume the presence of dedicated I caches. Once you have a pseudo Harvard arch and your I fetches don't fight with your D fetches for bandwidth in inner loops, most of the benefit of microcode is gone, and a simpler ISA that looks like vertical microcode makes a lot more sense. Why have a single instruction that can crank away for hundreds of cycles computing a polynomial like VAX did if you can just write it yourself with the same perf?
> Thumb adds many additional instruction encodings of variable length, load/store multiple already had pretty arbitrary latency
Worth noting that, AFAIK, both of these were removed on aarch64 (and aarch64-only cores do exist, notably Amazon's Graviton2)
Fair, haha. But I think the distinction intended lies in that the old CISC ISAs were complex out of a desire to provide the assembly programmer ergonomic creature comforts, backwards compatibility, etc. Today's instruction sets are designed for a world where the vast majority of machine code is generated by an optimizing compiler, not hand crafted through an assembler, and I think that was part of what the RISC revolution was about.
Looks like FISC is Fast Instruction Set Computing (maybe - all I could find was a Medium article that says that).
ARM already has several instructions that aren’t exactly RISC. ldp and stp can load/store two registers at a given address and also update the value of the address register.
Both x86, and 68xxx started that way. Old silicon actually game had a premium for smarter cores, which can do tricks like µOp fusions to compensate for smaller decoders.
RISC was originally about getting reasonably small cores, which can do what they are advertised, and nothing more, and µOp fusing was certainly outside of that scope.
Now, silicon is definitely cheaper, and both decoders, and other front-end smarts are completely microscopic in comparison to other parts of a modern SoC.
How/when will we ever be able to confidently tell Gcc to generate these instructions, when we generally only know the code will be expected to run on some or other Aaargh64?
It is the same problem as POPCNT on Amd64, and practically everything on RISC-V. Checking some status flag at program start is OK for choosing computation kernels that will run for microseconds or longer, but for things that take only a few cycles anyway, at best, checking first makes them take much longer.
I imagine monkeypatching at startup, the way link relocations used to get patched in the days before we had ISAs that didn't support PIC. But that is miserable.
Good questions.
For computer support, generally you would pass a -mcpu= flag (or maybe -mattr=, but that might be a compiler internal flag, I forget). Obviously then that's not portable and has implications on the ABI. I didn't read the article but I suspect they might be in ARMv9.0, hopefully, otherwise "better luck next major revision."
For monkey patching, the Linux kernel already does this aggressively since it generally has permission to read the relevant machine specific registers (MSRs). Doesn't help userspace, but userspace can do something similar with hwcaps and ifuncs.
I guess the fist step could be to handle it in the C library using some capability check and function pointers, then perhaps later on in the compiler if some mcpu flag or something is provided.
All implementation selection should be done with ifuncs. Sadly lots of programs still do it with just function pointers.
ifuncs is a non-standard compiler extension that only works on certain operating systems.
Developers that cares about portability are obviously going to stay far away from such things.
sorry i am out of loop here. what specifically is the problem with POPCNT on AMD64?
POPCNT was added to amd64 in 2003. Because there are still amd64 machines from before then, compilers don't produce POPCNT instructions unless directed to target a later chip.
MSVC emits them to implement extension _popcount, but does not use that in its stdlib. Gcc, without a directive, expands __builtin_popcount to much slower code.
You can check for a "capability", but testing and branching before a POPCNT instruction adds latency and burns a precious branch prediction slot.
Most of the useful instructions on RISC-V are in optional extensions. Sometimes this is OK because you can put a whole loop behind a feature test. But some of these instructions would tend to be isolated and appear all over. That is the case for memcpy and memset, too, which often operate over very small blocks.
...so it only took them over three decades to realise the power of REP MOVS/STOS? ;-)
On x86, it's been there since the 8086, and can do cacheline-sized pieces at a time on the newer CPUs. This behaviour is detectable in certain edge-cases:
Except that for decades REP MOVS/STOS were avoided on x86 because they were much slower than hand written assembly. This only changed recently.
That was really only in the 286-486 era. On the 8086 it was the fastest, and since the Pentium II, which introduced cacheline-sized moves, it's basically nearly the same as the huge unrolled SIMD implementations that are marginally faster in microbenchmarks.
Linus Torvalds has some good comments on that here: https://www.realworldtech.com/forum/?threadid=196054&curpost...
Linus seems to consider rep mov still too slow for small copies:
https://www.realworldtech.com/forum/?threadid=196054&curpost...
https://www.realworldtech.com/forum/?threadid=196054&curpost...
It seems to me that rep move is so bad that you want to avoid it, but trying to write a fast generic memcpy results in so much bloat to handle edge cases that rep move remains competitive in the generic case.
I remember implementing memcpy for a PS3 game. If you were doing a lot of copying (which we were for some streaming systems) it was hugely beneficial to add some explicit memory prefetching with a handful of compiler intrinsics. I think the PPC processor on that lacked out of order execution so you would stall a thread waiting for memory all too easily.
23-stage in-order pipeline according to Wikipedia
https://en.wikipedia.org/wiki/Cell_(microprocessor)#Power_Pr...
Well, the Cell CPU also had DMA engines that were fully integrated into the MMU memory-mapping, so you would have been able to asynchronously do a memcpy() while the CPU's execution resources were busy running computations in parallel.
A reminder that ARM is short for Advanced RISC Machines or previously Acorn RISC Machine[1].
These could be really great if they get optimized well in hardware - as single instructions, they’d be easy to inline, reducing both function call overhead and code size all at once. I do wish they’d included some documentation with this update so it’d be clearer how these instructions can be used, though.
Instructions like this need to be interruptible since they take longer than standard instructions. I assume the ARM designers have thought about this?
ARM has been managing interruptible instructions with partial execution state for a long time.
In ARM assembly syntax, the exclamation point in an addressing mode indicates writeback. Its difficult to be certain without seeing the architecture reference manual, but it would be consistent for instruction to be writing back all three of the source pointer, destination pointer, and length registers.
A memcpy is interruptible without replaying the entire instruction (say, because it hit a page that needed to be faulted-in by the operating system) if it wrote back a consistent view of all three registers prior to transferring control to an interrupt handler.
The old ARM<=7 load multiple / store multiple instructions were interruptible on most implementations. My recollection is that some implementations checkpointed and resumed, but at least the smaller cores tended to do a full-restart (so no guarantee of forward progress when approaching livelock). I'd expect the same here, with perhaps more designs leaning towards checkpointing.
It's well known in the ARM world, and it's the reason we were complaining for years about impossibility of using DMA controller from userspace to do large memcpys.
More importantly today, using DMA to do large memcpy for non-latency-sensitive tasks allows cores to sleep more often, and it's a godsend for I/O intensive stuff like modern Java apps on Android which are full of giant bitmaps.
ARMv7-M squirrels away the progress of the ldm/stm in the program status register to avoid restarting it completely.
It's strange that such features seem to not be standard in CPUs. I wonder why? Copy-based APIs are not ideal but they seem to be hard to avoid.
In those ARM cores I've programmed, the core has a few extra DMA channels which can be used for such things. However, using them from userspace has always seemed a bit of a hassle.
I haven't done assembler for a long time, but if my memory serves me well on x86 there's the rep movsb commands that will do effectively a memcpy-like operation.
Correct. There is the whole rep family doing all kinds of fun stuff. You can add the rep/repnz/repz prefixes to at least:
movs[b|w|d]: move data in bytes/words/doublewords aka memcpy
stos[b|w|d]: put a value in bytes/words/dwords aka memset
cmps[b|w|d]: compare values aka memcmp
scas[b|w|d]: scan for a value aka memchr
ins[b|w|maybe d]: read from IO port
outs[b|w|maybe d]: write to IO port.
lods[b|w|d] : read from memory was probably not meant to be combined with rep as it would just throw everything but the last byte away. I once saw a rep lodsb to do repeated reads from EGA video ram. The video card saw which bytes were touched and did something to them based on plane mask. This way touching 1 bit changed the color of a whole 4 bit pixel, speeding up things with a factor 4.
Then one day, someone found that rep movs was not the fastest way to copy data on an x86 and they all went out of vogue. I think rep stos recently came back as fastest memset, as it had a very specific CPU optimization applied.
Update: See https://stackoverflow.com/questions/33480999/how-can-the-rep...
"Copy-based APIs are not ideal but they seem to be hard to avoid."
If everything resides in CPU L1 cache, it hardly matters at all. Other than L1 cache pressure, of course.
Other example is copying DMA transferred data followed by immediately consuming said data. Also in this case, the copy often effectively just brings the data to the CPU cache and the consuming code reads from cache. Of course it does increase overall memory write bandwidth use when the cache line(s) are eventually evicted, but total performance degradation can be pretty minimal for anything that fits in L1.
I saw this the other day, wanted to read it and failed; and again today. Luckily Google has it in cache:
http://webcache.googleusercontent.com/search?q=cache%3Ahttps...
I'm wondering if this isn't solving a problem only with a local optimum. How much better would be to have a standard way (i.e, not device-specific) to memzero (or memset) directly into the DRAM chips ? Or to use DMA for memcpy, while the CPU does other things ? Now of course, this could be a nightmare for cache coherency, but I've seen worse things done for performance.
In fact the Cell CPU [1] had a DMA facility accessible from the SPU cores by non-privileged software [2]. This worked cleanly, as all DMA operations were subject to normal virtual memory paging rules.
But then the SPU did not have direct RAM access (only 256 kB of local S-RAM addressible from the CPU instructions), so DMA was something that followed naturally from the general design. Also not having any cache meant there were none of the usual cache coherency problems (though you may run into coherency problems during concurrent DMA to shared memory from multiple SPUs).
[edit] note also that the SPUs did not usually do any multitasking / multi-threading, which also simplified handling of DMA. Otherwise task switches would have to capture and restore the whole DMA unit's state (and also potentially all 256 kB of local storage as these cannot be paged).
[1] https://en.wikipedia.org/wiki/Cell_(microprocessor)
[2] https://arcb.csc.ncsu.edu/~mueller/cluster/ps3/SDK3.0/docs/a...
Tis a real shame we did not see SPU-like cores in later generations. The problem that I saw was that instead of embracing the power of a new architectural paradigm people just considered it weird and difficult.
I think had they provided (very slow but) normal path for accessing memory it would have made the situation much more acceptable to nominal developers.
The difficulty in adopting the PS3 basically killed the idea of Many-Core as the future for high performance gaming architecture.
> all DMA operations were subject to normal virtual memory paging rules.
That's the key right there. Many embedded SoC's I've worked with have DMA engines, but they are all behind the MMU and only work with physical addresses. It makes using them for something like "accelerated memcpy" kind of cumbersome and usually not even worth it unless it's moving HUGE chunks of memory (to overcome the page table walk that you have to do first).
Well, I recently found the Cortex-M SoCs to be a blessing in that regard: no MMU, no need to run a fully fledged operating system, but still with LwIP, FreeRTOS&friends, they can handle surprisingly complex software tasks, while the lack of MMU and privilege-separation means that all the hardware: DMA-engines, communication interfaces and accelerator facilities (2-D GPU) are right at the tip of your hands.
Thankfully we're starting to get IO-MMUs on larger systems with DMA controllers like that. Much easier to pass around.
This instruction doesn’t prevent the implementation of doing that, it just gives a standard interface.
Actually, I think it does: you cannot be using the core while it's doing the memset or memcpy, so it's technically not what I'm describing. Even if it did: a cross-industry reference implementation would go a long way into making this a reality.
I'm willing to bet that within 5 years we'll see a CPU that effectively embeds a DMA engine used through this instruction. The way I'd implement it is a small FSM in the LLC that does the bulk copy while the CPU keeps running, while maintaining a list of addresses reads/writes to avoid (i.e. stall on) until the memcpy is finished.
How efficient it would be from performance and security point of view ?
A completely useless use of silicon. The fastest way of copying memory block is to offload it to DMA or some other dedicated hardware. Using CPU to copy blocks is just a stall. And please, do not call ARM a RISC!
DMA is cache oblivious, so you will have to clear all CPU caches before every memcpy. Very bad idea.
Note that in ARM-based controllers, LDM/STM also have a non-negligible impact on interrupt latency. These are defined in a way that they cannot be interrupted mid-instruction, so worst-case interrupt latency is higher that would be expected with a RISC CPU (especially if LDM/STM happen to run on a somewhat slower memory region)
AFAICS x86 "rep" prefixed instructions are defined so that they can in fact be interrupted without problems. The remaining count is kept in (e)cx, so just doing an iret into "rep stosb" etc. will continue its operation.
I think VIA's hash/aes instruction set extension also made use of the "rep" prefix and kept all encryption/hash state in the x86 register set, so that they could in fact hash large memory regions on a single opcode without hampering interrupts.
> These are defined in a way that they cannot be interrupted mid-instruction...
Usually. Cortex-M3 and M4 cores allow LDM/STM to be interrupted by default, and offer a flag to disable that (SCB->ACTLR.DISMCYCINT).
https://developer.arm.com/documentation/ddi0439/b/System-Con...
Yes, looks like they added a few bits to the PSR register to capture the internal state of LDM/STM . Like a small version of x86's CS register.
AFAICS x86 "rep" prefixed instructions are defined so that they can in fact be interrupted without problems. The remaining count is kept in (e)cx, so just doing an iret into "rep stosb" etc. will continue its operation.
The 8086/8088 have a bug (one of very few!) where segment override prefixes were lost after an interrupted string instruction:
https://www.pcjs.org/documents/manuals/intel/8086/
I believe it was fixed in later versions.