Welcome to this training module about the M profile of Vector Extension, MVE. This extension is a part of the Armv8.1-M extensions to the M profile architectures and it adds single instruction, multiple data or SIMD support to the architecture. The key takeaways for this module are that by the end of this, you should be able to describe the background behind why the MVE extension was added to the architecture. You should be able to summarize the key features that MVE adds, and how you would use software and compilation tools to build for MVE. This module is split into three sections; an overview, a detailed look at some of the features, and then software support. Let's start with the overview. What is MVE? As the slide says, it is a SIMD data processing engine. SIMD means single instruction multiple data. This is a specific type of instruction in the field of computer architecture that allows you to simultaneously put multiple bits of data through the same process using just one instruction. This adds some of the functionality from the Armv8A neon, and SVE extensions to MVE. We'll take a look at some of the more details of this on the next slide. The diagrams on this slide show a couple of examples of MVE instructions and how this single instruction, multiple data concept comes into play. On the top right, we have two input registers, Q0 and Q1. It's split into four values, as you can see here. Pairs of these values from each register output through some operation. Then the result is stored in the corresponding location in the destination register Q2. The operation in this case is an add instruction, as indicated by this opcode VADD, where V means vector and ADD indicates the type of the operation. The.S32 indicates that this vector is comprised of four 32-bit elements. We'll look at more of these types and registers in the next few slides. Some more background about why we added MV to the architecture is primarily to address digital signal processing use cases. For example, an audio device can now directly do lots of high performance critical digital signal processing work directly on an M profile CPU instead of having to issue that to dedicated DSP. We discussed that MVE is a single instruction multiple data extension of the M profile architecture. But what data types can actually operate on? Well, MVE supports both integer types and floating point types. Specifically, it supports 8-bit, 16-bit, 32-bit, and 64-bit integers. For floating point, it supports both single precision and half precision. Half precision floating point is commonly used in machine learning applications. Therefore, this extension can be quite powerful for low-power machine learning devices. Not all instructions in MVE are equal. They operate in different manners. For example, you don't always have to do operations based on pairs of registers. Some MVE instructions can operate across a vector. This is also known as operating across a vector horizontally. For example, this instruction here just has one source register, Qn. All it does is it adds together each of the four elements within Qn and stores them to a scalar register Rm. You can usually tell whether an instruction is one that operates across the vector horizontally if it has the V suffix immediately after the main opcode. So here it is a SIMD instruction because of the V prefix. An ADD instruction, this signifies the operation type and the V specifies it's operating across the vector horizontally. As we saw in the previous slide, this instruction is also operating on 32-bit scalar values. MVE has lots of other features which we'll get into later on in the slides. The main ones are to do with better handling of loops and how you handle loops that can't be perfectly vectorized in standard or classic SIMD extensions in the Arm architectures. The second main feature is to do with how MVE allows you to easily and dynamically address memory without having to pre-program specific patterns. This slide shows the evolution of the DSP features in the M profile architectures over time. Before the introduction of MVE in the Armv7-M and the Armv8-M architectures. Arm featured the digital signal processing or DSP extension. This offered limited support for DSP operations for M profile course. MVE includes the DSP extension features, but enhances them significantly. For example, previously, the DSP extension operated on the standard 32-bit general purpose registers, so R0 through to R15. Now, you have a separate register bank of 8, 128-bit wide registers. These are actually shared with the floating point unit, as we'll see later, but now you have much wider registers. As a consequence of this, you can split them up to 16 values per register. Whereas previously, you could only do four. The biggest change though, is previously you could only do integer operations with the DSP extensions. Now, you can do integer operations, single position operations, and floating point operations. Why MVE? Why would you use an M profile core with MVE instead of a dedicated digital signal processor? The first thing is, it reduces the design complexity of your system on chip. Because everything is being handled by one core, you can just do everything through a single memory system. You don't have to shuffle data between your DSP co-processor and the M profile core, so everything can operate a lot faster as well. We've also observed through testing for machine learning use cases that operate on 8-bit integer types that you can get significant performance uplifts compared to Cortex-M33 when you use MVE. A 15 times improvement, for example, for matrix multiplications testing for machine learning and five times uplifts compared to Cortex-M33 for complex phosphorylated transforms for audio processing. This means that at runtime, if you can finish your task sooner, you can put your core to sleep sooner. Therefore, you will save some power while the core rests. As it's implemented by Arm, we're working directly with third parties and toolchain vendors to include support for MVE in their ecosystems. It takes advantage of the Arm ecosystem at-large. Finally, as the instructions for MVE can be generated by the same compiler that's handling the rest of your code, and the design for the MVE instruction set is very parametric, it makes the MVE instruction set a good target for automatic vectorization of code. We'll see more about this near the end of the module. But basically it means, for the most part, you can rely on the compiler doing a pretty good job with your code to make the most of MVE.