Quadric Announces Llama2 LLM Support Immediately Available for Chimera GPNPUs
First Processor IP Core or Consumer Device Silicon Product Known to Run Llama2
Burlingame, CA – September 18, 2023 – Quadric® today announced that support for the Llama 2 large language model (LLM) is immediately available on its ChimeraTM general purpose neural processing unit (GPNPU) intellectual property (IP) core. Unlike other IP and semiconductor application processor suppliers, Quadric was able to add this support with a simple software port with no hardware changes, so existing designs can immediately run this model. Other suppliers have announced plans to change their hardware to offer support in 2024 or beyond.
Quadric Beats Other Timelines for Support
Meta introduced the Llama2 LLM for generative artificial intelligence (AI) on July 18 of this year. Coincident with the unveiling of Llama2, Meta and Qualcomm announced a partnership to port Llama2 to future Qualcomm Snapdragon chips expected in 2024 smartphones and laptops. Use of LLMs had previously been considered to only be viable in cloud datacenters. Meta’s announcement set off a flurry of activity from chip and IP providers to race to capture market attention – and investment dollars – for on-device LLM implementations.
In addition to Qualcomm’s announced timeline of greater than six months to port Llama2, fellow silicon company Mediatek announced four weeks later that it, too, was working on Llama2 support with an expectation that 2024 mobile phones powered by Mediatek flagship silicon would support Llama2. Also in August, IP licensor Ceva announced that a redesigned IP core would soon be delivered to its customers that could support LLMs. Cadence similarly unveiled new IP in September to support LLMs. Those Ceva and Cadence customers will need to license new cores and design new chips for delivery sometime in 2025 or 2026. Notably, all four chip and IP announcements directly or indirectly imply that silicon respins are required to gain this new capability.
“Why would these titans of the semiconductor and IP worlds need to wait until 2024 or 2025 or beyond to support today’s newest, hottest ML model?” noted Quadric CMO, Steve Roddy. “Why would a new machine learning model force a silicon respin – typically costing in excess of $100M – to be able to run new ML code? SoCs with Quadric’s Chimera GPNPU are ready to run Llama2 today!”
Why Not Just Port to the CPU?
Existing silicon chips for consumer devices all have leading-edge applications-class CPUs. Many have eight ostensibly high-performance CPUs. Why wouldn’t SoC vendors simply port Llama2 to the CPU, which can run any ML model? Clearly either the performance won’t meet consumer expectations, or the power consumed will kill device battery life. Hence the need to respin the ML accelerator to run LLMs. Or why not choose a fully programmable GPU from a leading ML vendor such as Nvidia? Perhaps the 10 Watt power dissipation and the need for a cooling fan prohibits use in a smartphone?
Like those easily programmable but power-hungry CPU and GPU solutions, Quadric’s GPNPU is also easily programmable. But only Quadric Chimera GPNPUs deliver programmability with the power-performance profile needed in today’s portable consumer devices.
Quadric’s processor architecture uniquely combines the best attributes of C++ programmability – the ability to run any ML model – with the performance efficiency of NPU accelerators found in many first-generation SoCs in the market today. But unlike inflexible accelerators that force silicon respins when complex new models such as Llama2 are invented, Chimera cores are fully programmable. Chimera GPNPUs run any model. All of the model – all of the layers. No removal of problematic layers. No partitioning. No forcing the data scientist to convert convolutions to adhere to the limited subset of conv types supported in hardware. Any model, any network, any operator.
Fast Porting, Low Effort
Quadric’s team invested a total of 13 engineer-weeks over a total of 4 elapsed weeks to port an INT8 quantized version of Llama2 to the Chimera platform and tune performance. Two new ML operator layers, and two variants of existing operator kernels, were coded in C++ by the Quadric applications team to get the model running. A further two engineer-weeks investment ironed out corner case performance and accuracy tweaks to ensure operation across all three sizes of the Chimera QB series processors (1 TOPs, 4 TOPs and 16 TOPs variants). Other machine learning inference solution providers with much larger teams are still struggling to meet six-month porting targets!
Impressive Performance
Quadric’s Chimera QB4 4 TOPs GPNPU running Llama2 15M delivers 225 Token/Sec/Watt efficiency in a 5nm technology, while occupying only 2.5 mm2.
For comparison, an M1 Pro laptop’s highest performance single CPU delivers only 11 Token/Sec/W running the same Int8 version of Llama2. (Int8 model running ONNX runtime). Quadric delivers 20X improvement in ML inference per Watt compared to a leading-edge CPU!
About Quadric
Quadric Inc. is the leading licensor of general-purpose neural processor IP (GPNPU) that runs both machine learning inference workloads and classic DSP and control algorithms. Quadric’s unified hardware and software architecture is optimized for on-device ML inference. Learn more at www.quadric.io.
|
Related News
- Quadric Announces Vision Transformer Support for Chimera GPNPUs
- Quadric's 3rd Generation Chimera GPNPU Product Family Expands to 864 TOPs, Adds Automotive-Grade Safety Enhanced Versions
- Quadric Presents and Demos AI+ML Chimera GPNPU at Embedded Vision Summit 2024
- Quadric's New Chimera GPNPU Processor IP Blends NPU and DSP into New Category of Hybrid SoC Processor
- Visit Quadric at CES to Discover the GPNPU that Solves The Biggest ML Inference Chip Design Challenges
Breaking News
- Ubitium Debuts First Universal RISC-V Processor to Enable AI at No Additional Cost, as It Raises $3.7M
- TSMC drives A16, 3D process technology
- Frontgrade Gaisler Unveils GR716B, a New Standard in Space-Grade Microcontrollers
- Blueshift Memory launches BlueFive processor, accelerating computation by up to 50 times and saving up to 65% energy
- Eliyan Ports Industry's Highest Performing PHY to Samsung Foundry SF4X Process Node, Achieving up to 40 Gbps Bandwidth at Unprecedented Power Levels with UCIe-Compliant Chiplet Interconnect Technology
Most Popular
- Cadence Unveils Arm-Based System Chiplet
- CXL Fabless Startup Panmnesia Secures Over $60M in Series A Funding, Aiming to Lead the CXL Switch Silicon Chip and CXL IP
- Esperanto Technologies and NEC Cooperate on Initiative to Advance Next Generation RISC-V Chips and Software Solutions for HPC
- Eliyan Ports Industry's Highest Performing PHY to Samsung Foundry SF4X Process Node, Achieving up to 40 Gbps Bandwidth at Unprecedented Power Levels with UCIe-Compliant Chiplet Interconnect Technology
- Arteris Selected by GigaDevice for Development in Next-Generation Automotive SoC With Enhanced FuSa Standards
E-mail This Article | Printer-Friendly Page |