What the Hell is a Neural Engine?

This following article is an adapted script from my YouTube Video: "What the Hell is a Neural Engine?"

If you've purchased an iPhone or iPad after 2017 or an Apple Silicon Mac, it has the Apple Neural Engine. The short answer to my rhetorical question is that the ANE was initially designed for machine learning features like FaceID and Memoji on iOS and debuted on the iPhone X with the A11 chipset.

Machine Learning uses the power of algorithms and statistical models that enable computers to perform tasks without explicit instructions. Machine Learning learns to make predictions or decisions based on data, known as training. The learning process generally involves feeding large amounts of data into the algorithm, allowing it to learn and improve its accuracy over time. It varies a lot, and training can take on many forms, such as using tagged data and/or unsupervised learning or Neural Networks. For example, Large-Language models use a mixture of unsupervised and supervised fine-tuning and, later, human reinforcement when stealing the collective works of humanity.

Machine learning is used in mundane tasks like email filtering to catch spam or more exciting things like computer vision, such as the ability to identify objects in photos. With the AI choo-choo express hype train, many machine learning and neural networks are being rebranded as AI.

Machine learning requires a lot of computing power, and CPUs are not the most efficient at training and executing machine learning. For example, GPUs are parallel processors that can quickly execute millions of certain math operations in a single clock cycle; thus, they are much better suited for the needs of machine learning.

Apple designed the Apple Neural Engine (ANE) to supplement certain types of machine learning tasks, both in training and executing, using CoreML.

It's essential to understand Core ML, Apple's machine learning API, doesn't exclusively utilize the ANE; it leverages the CPU and GPU and, if present, the ANE. To quote Apple,

Apple's Cores for ML

"Core ML then seamlessly blends CPU, GPU, and ANE (if available) to create the most effective hybrid execution plan exploiting all available engines on a given device. It lets a wide range of implementations of the same model architecture benefit from the ANE even if the entire execution cannot take place there due to idiosyncrasies of different implementations." Apple.com - Deploying Transformers on the Apple Neural Engine

This means when using CoreML, it will automagically use all the tools it has available. The advantage of this approach is that developers do not have to worry about programming for various hardware configurations. If you use Core ML, you're likely getting the best performance, regardless of the device the tasks are being executed on.

Unlike, say, a GPU, there is no public framework for directly programming on the ANE. There are some esoteric projects designed to measure the Neural Engine performance, and so are not-so-esoteric ones like Geekbench ML, which does not seem to properly isolate the Neural Engine.

Apple has provided some graphs and has stated that the M1's Neural Engine could perform up to 11 trillion FP16 operations per second, the M2 and M3 neural engine process up to 15.8 trillion operations per second, and the M4 can do 38 trillion operations per second.

The ANE isn't just an accelerator for floating point math; it's better thought of as a low power consumption optimizer as it can be leveraged for certain types of ML tasks. It's faster and uses much less memory, less power allowing for on-device execution of machine learning tasks.

NPUs

The ANE is not unique to Apple as it is generally considered a neural processing unit, or AI accelerator, or NPU. Neural processors can be found in the AI engine of Qualcomm Snapdragons, the NPU of Samsung's Exynos, and the Da Vinci NPU of Huawei's Kirin. There's a common thread that many readers probably noticed with the aforementioned chipsets: they are all ARM-based. The lack of NPUs for x86 has to do with several factors, the first of which is that x86 hasn't been found in extremely low-power devices like phones and wearables, where every watt counts. The second reason is the existence of exceptionally powerful dedicated GPUs in high-end computers. GPUs can perform the same operations as an NPU and perform more operations, making them more useful for both training and executing machine learning tasks at the cost of a higher TDP. The M4 ANE has 38 Trillion operations per second, but high end Nvidia GPU can hit 1,300 Trillion operations per second.

Another reason why NPUs aren't typically found on x86 are the type of AI tasks that NPUs really excel at, like facial recognition and computation photography, which doesn't really exist on desktop computers. Lastly, for serious AI tasks like model training, buying expensive GPUs or leasing computer time on cloud services with hardware acceleration would be more effective than designing NPUs for x86.

However, we're seeing a shift in the role of machine learning on desktops with the rise of "AI" and more and more demand for the raw compute power required for AI. Windows 11's questionable Copilot + requires 40 trillion operations per second.

What is an NPU exactly used for?

Let's use a real-world example. Core ML is a foundation for Apple's computational photography. As everyone hopefully is aware today, when one snaps a photo, there is no longer anything such as "no filters," and billions of operations are performed to process the image, including everything from face detection to color balancing, noise reduction, smart HDR, video stabilization, emulating depth of focus in cinema mode, and scene analysis. This requires millions of operations to happen, in real-time or near instantaneously. Rather than send the matrices of floating-point operations to the CPU and GPU, the Neural Engine can take on heavy lifting.

These are incredibly dense operations, like scene analysis, which might sound simple, but Apple has developed an entire ecosystem called Apple Neural Scene Analyzer or ANSA. This is the backbone of many features like the Photo app's Memories, where images are tagged, aesthetics are evaluated, detection is done for duplicates or near duplicates of photos, objects detected, and locations are grouped. This is all done on the devices using another principle Apple calls differential privacy , where Photos learns about significant people, places, and events to create memories while protecting the anonymity of the users. Exploring how Apple's memories work probably should be an article in itself. While this feature makes extensive use of machine learning, it's not dependent on the ANE alone; instead, it assists in performing the analytics.

However, it's hard to evaluate how much of this chain occurs on the ANE. That's due to the lack of information Apple has published. One can find frustrated developers complaining about the lack of info. One of the main sources for information is The Neural Engine — what do we know about it?

The TLDR is that the neural engine is an on-device Neural Processing Unit part of Apple Silicon that is leveraged for machine learning along with the CPU and GPU. It's very good for certain math operations and is partially a power-saving mechanism designed to assist low power computing, rather than utilizing a more power-hungry GPU.

Screenshot of Apple Watch Webpage

This is especially the case with the Apple Watch, which needs to be ultra-efficient. Since the series 4, the Apple Watch line has included a stripped neural engine to assist with faster on-device processing of inputs. In Apple's marketing material for the series 9 Apple Watch, Apple suggests that the Apple neural engine is even used for the double tap gesture.

It will be interesting to see how Apple leverages it in the future. It seems increasingly likely that Apple will be doing some of its AI using cloud services. Also, AI functions are very RAM intensive. In a recent video, I demonstrated the limitations of 8 GB of RAM when a Mac mini m1 was bested by a Mac Pro 2013. Apple may regret shipping low RAM configurations.

This year's WWDC was very focused on Apple Intelligence, Apple's branding on AI, a term that gets increasingly obfuscated day by day. Apple plans to bring AI on multiple fronts, running local AI models and upchaining requests to the cloud when local isn't enough. There are a lot of questions to be answered on how well this strategy will work, and perhaps when you read this, many of them will be answered. One minor reveal is that only M series Macs and the A17 Pro, as of recording, are confirmed to support Apple's AI strategy.

There are plenty of posts and videos breaking down the features of Apple Intelligence. Still, just as a refresher, they included generative text editing, generative AI for uninspired images and emojis, with one truly dystopian example on the iPad where a stylish sketch is turned into a soulless rendering, some very impressive natural language interactions, and personalized notifications. It's very unclear when and which interactions are on-device, but on-device services likely include dictation and personal contexts, and some of the textual generation; by that, I mean Siri responses. This, of course, will be revealed in the upcoming months. If executed well, it will be the most cohesive and useful AI strategy we've seen by any major company for everyday people, but I expect growing pains.

We should fully expect more emphasis on the NPUs moving forward, but companies haven't managed to communicate effectively the value of NPUs or what they do to consumers and are often cagey even towards developers. This is certainly not the first time a coprocessor was nebulous to its potential buyers, be it early GPUs or math coprocessors, and if anyone remembers the failed attempt at selling Physics processing units for gaming.

Training and FP16

In Apple's AI page, the Neural Engine isn't mentioned as part of the chain used for do-on-device training. This is likely because the ANE is primarily optimized for the execution (inference) of machine learning. This is evidenced by it only supporting FP16, GPUs and CPUs can execute FP32 which is higher precision, which is needed for many small adjustments from the gradients calculated during backpropagation. CPUs and GPUs can do mixed precision training, where FP16 data can be converted FP32 when more precision is needed.

To translate that back to human, NPUs in consumer devices are targeted for running existing models as opposed to creating new ones. The ANE is not for AI model creation for developers.

None of this should be a surprise. As I stated earlier in this article, typically if one was performing serious ML training, one have a very expensive GPU step-up or lease cloud computer time.

Without going too deep into computer science, 1 bit can store two values, 2 bits can store 4 values, 3 bits can store 8 and so on. 16 Bits can store 65,536, and 32-bit can store 4,294,967,296.

For non-whole numbers, such as those with decimal points, one would need to express where the decimal is. For example, 1245678 could be 12.345678 or 123456.78. A floating point format is used to handle this by specifying the decimal's position. This involves components like the mantissa and exponent, but in essence, it allows the number to 'float' to where the decimal is needed.

In machine learning, different bit depths are used, and a 16-bit floating point (FP16) is popular because it offers a reasonable balance of accuracy, memory usage, and processing power. Models can be quantized from 32-bit to 16-bit, trading some accuracy for performance. This process is similar to downsampling a 24-bit image to 8-bit rather than simple rounding.

Apple now provides developers with the App Intents framework, which opens up applications for interactions performed by Siri using the personal context awareness and action capabilities of Apple Intelligence. This allows developers to integrate features based on predefined trained models without having to create their own. How useful and widely adopted this is remains to be seen.