It’s been a few years since machine learning and neural networks first started to be the hot new news topic. Ever since then, the market has transformed a lot and a lot of companies and the industry as a whole has shifted from a notion of “what can we do with this” to rather a narrative of “this is useful, we should really have it”. Although the market is very much far from being mature, it’s no longer in the early wild-west stages that we saw a few years ago.
A notable development in the industry is that there’s been a whole lot of silicon vendors who have chosen to develop their own IP instead of licensing things out – in a sense IP vendors were a bit behind the curve in terms of actually offering solutions, forcing in-house developments in order for their product not to fall behind in competitiveness.
Today, CEVA announces the new second generation of NeuPro neural networks accelerators, the new NeuPro-S. The new offering improves and advances the capabilities seen in the first generation, with CEVA also improving vendor flexibility and a new product offering that embraces the fact that a wide range of vendors now have their own in-house IP.
The NeuPro-S is a direct successor to last year’s first-generation NeuPro IP, improving on the architecture and microarchitecture. The core improvements of the new generation lie around the way the block now improves and handles memory, including new compression and decompression of data. CEVA quotes figures such as 40% reduces memory footprint and bandwidth savings, all while enabling energy efficiency savings of up to 30. Naturally this also enables for an increase in performance, claiming up to 50% higher peak performance in a similar hardware configuration versus the first generation.
Diving deeper into the microarchitectural changes, innovations of the new generation includes new weight compression as well as network sparsity optimisations. The weight data is retrained and compressed via CDNN via CEVA’s offline compiler and remains in a compressed form in the machine’s main memory – with the NeuPro-S decompressing in real time via hardware.
In essence, the new compression and sparsity optimisation sound similar to what Arm is doing in their ML Processor with zero-weight pruning in the models. CEVA further goes on to showcase the compression rate factors that can be achieved – with the factor depending on the % of zero-weights as well as the weight sharing bit-depth. Weight-sharing is a further optimisation of the offline compression of the model which reduces the actual footprint of the weight data by sharing finding commonalities and sharing them across each other. The compression factors here range from 1.3-2.7x in the worst cases with few sparsity improvements to up to 5.3-.7x in models with significant amount of zero weights.
Further optimisations on the memory subsystem level includes a doubling of the internal interfaces from 128-bit AXI to 256-bit interfaces, enabling for more raw bandwidth between the system, CEVA XM processor and the NeuPro-S processing engine. We’ve also seen an improvement of the internal caches, and CEVA describe the L2 memory utilisation to have been optimised by better software handling.
In terms of overall scaling of the architecture, the NeuPro-S doesn’t fundamentally change compared to its predecessor. CEVA doesn’t have any fundamental limit here in terms of the implementation of the product and they will build the RTL based on a customer’s needs. What is important here is that there’s a notion of clusters and processing units within the clusters. Clusters are independent of each other and cannot work on the same software task – customers would implement more clusters only if they have a lot of parallel workloads on their target system – for example this would make sense in an automotive implementation with many camera streams, but wouldn’t necessarily see a benefit in a mobile system. The cluster definition is a bit odd and wasn’t quite as clear whether it’s actually any kind of hardware delimitation, or the more likely definition of software operation of different coherent interconnect blocks (As it’s all still connected via AXI).
Within a cluster, the mandatory block is CEVA’s XM6 vision and general-purpose vector processor. This serves as the control processor of the system and takes care of tasks such as control flow and processing of fully-connected layers. CEVA notes that processing of ML models can be processed fully independently by the NeuPro-S system, whereas maybe other IPs need to still rely on maybe the CPU for some processing of some layers.
The NeuPro-S engines are naturally the MAC processing engines that add the raw horsepower for wider parallel processing and getting to the high TOPS figures. A vendor needs at minimum a ratio of 1:1 XM to NeuPro engines, however it may chose to employ more XM processors which may be doing separate computer visions tasks.
CEVA allows allow scaling of the MAC engine size inside a single NeuPro-S block, which ranges from 1024 8x8 MACs to up to 4096 MACs. The company also allows for different processing bit-depths, for example allowing 16x16 as it still sees the need for some use cases that take advantage of the higher precision 16-bit formats. There are also mixed format configurations like 16x8 or 8x16 where the data and weight precision can vary.
In total, a single NeuPro-S engine in its maximum configuration (NPS4000, 4096 MACs) is quoted as reaching up to 12.5 TOPS on a reference clock of 1.5GHz. Naturally the frequency will vary based on the implementation and process node that the customer will deploy.
As some will have noted in the block diagram earlier, CEVA also now allows the integration of third-party AI engines into their CDNN software stack and to interoperate with them. CEVA calls this “CDNN-Invite”, and essentially the company here is acknowledging the existence of a wide-range of custom AI accelerators that have been developed by various silicon vendors.
CEVA wants to make available their existing and comprehensive compiler and software to vendors and enable them to plug-in their own NN accelerators. Many vendors who chose to go their own route likely don’t have quite as extensive software experience or don’t have quite as much resources developing software, and CEVA wants to enable such clients with the new offering.
While the NeuPro-S would remain a fantastic choice for generic NN capabilitites, CEVA admits that there might be custom accelerators out there which are hyper-optimised for certain specific tasks, reaching either higher performance or efficiency. Vendor could thus have the best of both worlds by having a high degree of flexibility, both in software and hardware. One could choose to use the NeuPro-S as the accelerator engine, use just their own IP, or create a system with both units. The only requirement here is that a XM processor be implemented as a minimum.
CEVA claims the NeuPro-S is available today and has been licensed to lead customers in automotive camera applications. As always, silicon products are likely 2 years away.
Related Reading:
ncG1vNJzZmivp6x7orrAp5utnZOde6S7zGiqoaenZH51hJZrZpydppZ6orrNqKynm5Woeq%2Bx1KmpqKtdqLKku82dnp6mopbBqrvNZqWnZZml