IntuitionLabs
Back to ArticlesBy Adrien Laurent

LLM Inference Hardware: An Enterprise Guide to Key Players

[Revised February 28, 2026]

Private LLM Inference: Key Hardware and Integrators for Enterprise

Executive Summary. The advent of large language models (LLMs) and generative AI has spurred massive demand for specialized inference hardware. Enterprises seeking to run powerful LLMs on-premises (for data control, latency, or cost reasons) turn to vendors offering high-performance servers, accelerators, and integrated systems. NVIDIA’s GPUs continue to dominate this space (~92% of the discrete GPU market in H1 2025 ([1])), but the competitive landscape is accelerating rapidly. Startups and established chipmakers alike have developed inference-focused accelerators—Cerebras, SambaNova, Tenstorrent, FuriosaAI, Positron, d-Matrix and others—to complement or challenge top GPUs. In a landmark move, NVIDIA acquired Groq’s LPU technology and engineering team for ~$20 billion in December 2025 ([2]), while hyperscalers are building custom silicon: Broadcom is co-designing chips for OpenAI ([3]), Meta completed its acquisition of RISC-V startup Rivos for ~$2 billion ([4]), Microsoft launched its Maia 200 inference accelerator ([5]), and Google made its Ironwood TPU generally available ([6]). Major system integrators (Dell, HPE, Lenovo, Super Micro, IBM, etc.) package these chips into turnkey AI servers. This report surveys the current landscape of private LLM inference hardware: leading companies, their products, performance and energy metrics, enterprise deployments, and emerging trends (like photonic computing). It details how NVIDIA/AMD/Intel GPUs are being supplemented or optimized, how specialized ASICs are tailored for inference, and how OEMs integrate them into systems. We present data on power, throughput, and market sizes (for example, the global AI inference market is projected to reach ~$255 billion by 2030 ([7])), along with case studies of real deployments (e.g. LG using FuriosaAI chips ([8]), SambaNova in government labs ([9]), and major server deals at Dell and HPE ([10]) ([11])). The report concludes with implications: the rising focus on energy efficiency (AI could consume 6.7–12% of US power by 2028 ([12])), data privacy drivers, and future directions (like photonic interconnects ([13]) ([14]) and RISC-V-based accelerators ([4])).

Introduction and Background

The explosive popularity of generative AI (e.g. ChatGPT) has created a market upheaval in enterprise IT. Companies want to leverage LLMs on their own data, often on-premises or close to the edge, for reasons of data privacy, latency, and predictable cost. However, running state-of-the-art models (ranging from dozens of billions to hundreds of billions of parameters) requires enormous compute resources. Traditional cloud solutions can be expensive or seen as insecure. As one analysis notes, AI workloads demand massive parallel processing and constant availability, which "do not align well" with conventional cloud pricing and scalability, thus pushing organizations toward hybrid or on-premises AI infrastructure ([15]).

To meet this, the hardware industry is rapidly evolving. Early AI development relied on GPUs (e.g. NVIDIA’s CUDA-accelerated cards) originally built for graphics or general compute. Today’s generation of GPUs (NVIDIA’s Hopper and upcoming Blackwell architectures; AMD’s Instinct series; Intel’s Arc and forthcoming data-center GPUs) remain a baseline solution for both training and inference. But the shift is clear: inference workloads (serving already-trained models) have different requirements than training. They often involve lower-precision math, need to maximize throughput for many parallel requests, and benefit greatly from specialized optimizations. Recognizing this, many players are developing inference-optimized chips and systems ([16]).

AS AP News reports, experts already see a pivot: “the market is now shifting towards AI inference chips, which are more efficient for the everyday use of AI applications” ([17]). Startups and incumbents are targeting inference-specific efficiency gains to enable enterprises to run LLMs without huge power and cooling overhead. For example, Toronto’s Untether has launched an inference-focused chip (the “240 Slim”) for edge devices and data-centers ([18]), while FuriosaAI (Korea) is unveiling an LLM-scale inference server (powered by its RNGD “Renegade” chips) that consumes only ~3 kW vs ~10 kW for a comparable NVIDIA DGX system ([19]). D-Matrix is introducing an innovative 3D memory design (Pavehawk) specifically to accelerate inference by co-locating compute and memory ([20]). Even giants like IBM and Intel explicitly emphasize inference: IBM’s new Power11 system is designed to “help businesses implement AI efficiently to improve operations,” rather than raw training power ([21]), while Intel is unveiling specialized inference GPUs (e.g. Crescent Island with 160 GB memory ([22])) and software stacks (Battlematrix with LLM-Scaler 1.0) aimed at real-time model serving ([23]).

The imperative for inference efficiency is underlined by energy concerns. A DOE-backed report warns that AI could drive U.S. data centers to consume ~12% of national power by 2028 ([12]), given that AI workloads (and particularly GPU-accelerated servers) have already doubled data-center energy use since 2017( [24]). This makes specialized hardware not just a performance play, but also a sustainability one. In many cases, enterprises find that instead of huge LLMs requiring massive racks of GPUs, more compact, domain-specific models on efficient hardware suffice (so-called Small Language Models, SLMs ([25])). A McKinsey survey cited by analysts shows some companies shifting to smaller, task-specific models running on affordable hardware to control costs and data affecting quality ([25]).

Against this backdrop, the next sections detail the key hardware players and solutions:

  • GPUs and Mainstream Accelerators: NVIDIA, AMD, Intel, and other major vendors;
  • AI-Dedicated Accelerators: speciality chips from Graphcore, Cerebras, Groq, SambaNova, Tenstorrent, etc.;
  • Emerging Startups: new inference chip entrants like FuriosaAI, Positron, Untether, D-Matrix, Lightmatter, etc.;
  • System Integrators: OEMs and server vendors (Dell, HPE, Lenovo, Supermicro, IBM, etc.) who package these chips into deployable systems;
  • Networking & Storage: supplementary hardware (e.g. Cisco’s AI interconnect) needed for large-scale deployments;
  • Case Studies: concrete examples of enterprise/private AI deployments;
  • Market and Tech Trends: data on shipments, market size, investments, and forecasts;
  • Implications and Future Directions: analysis of energy, geopolitical, and technological implications (e.g. photonic chips, RISC-V adoption).

Throughout, we cite industry reports and news sources. Table summaries compare vendors and products, and we highlight performance, capacity, power, and cost where data are available.

GPU-Based Solutions

NVIDIA’s Dominance and Proliferation

NVIDIA’s GPUs remain the workhorse for AI inference in enterprise data centers. With a reported ~92% of the discrete GPU market (as of H1 2025) ([1]), NVIDIA’s hardware is ubiquitous. The company’s full-year FY2026 revenue reached a record $215.9 billion (up 65% YoY), with data center revenue alone at $193.7 billion ([26]). The Blackwell generation (B200) began shipping in Q4 FY2025 and has ramped to high volume, with the DGX B200 (8× B200 GPUs, 192 GB HBM3e each, 20 PFLOPS sparse FP4) now broadly available. The follow-on Blackwell Ultra (B300) shipped in H2 2025, delivering 288 GB HBM3e per GPU and 15 PFLOPS dense FP4 — a 55.6% uplift over the B200 ([27]). Dell Technologies has announced PowerEdge servers embedding up to 192 Blackwell Ultra GPUs in air- or liquid-cooled configurations, claiming up to 4× training throughput improvements over previous generations ([28]).

Looking further ahead, NVIDIA CEO Jensen Huang announced the Vera Rubin platform at CES 2026 and confirmed it is already in production ([29]). The Rubin GPU features 336 billion transistors, 288 GB HBM4, 22 TB/s bandwidth, and delivers 50 PFLOPS NVFP4 inference per GPU. The Vera Rubin NVL72 rack promises 3.6 EFLOPS dense FP4 inference — a 3.3× improvement over B300 NVL72 — with claims of 5× more inference performance and 10× lower cost per token compared to Blackwell. Cloud deployments from AWS, Azure, Google Cloud, and others are expected in early-to-mid 2026, with Rubin Ultra following in H2 2027 ([30]).

Large customers reinforce NVIDIA’s position. Bloomberg News reported that Dell is close to a $5 billion deal to sell NVIDIA GPU-powered AI servers to Elon Musk’s AI startup (xAI) to expand its “Colossus” supercomputer to over one million GPUs ([10]). Similarly, Hewlett Packard Enterprise (HPE) secured a $1 billion contract to provide NVIDIA-accelerated servers to Musk’s social network X ([11]). These mega-deals underscore the explosive demand for GPU servers in large-scale AI deployments.

NVIDIA’s ecosystem advantages (CUDA software stack, TensorRT, etc.) create high switching costs ([1]). Vendors continue to launch NVIDIA-centric solutions: for example, HPE’s recent product lineup includes new ProLiant servers (2U and 4U rack-mount) supporting the latest NVIDIA RTX PRO 6000 “Blackwell” GPUs ([31]), aimed specifically at generative AI and inference workloads. These servers integrate NVIDIA’s full AI software stack with HPE’s greenlake/private-cloud offerings to simplify on-prem AI deployments ([32]).

Performance and Power: NVIDIA’s top inference cards (H100/H200) deliver hundreds of teraflops in FP16/INT8 (e.g. ~141 GB HBM3e, tens of TFLOPS of FP16 ([33])). However, these units consume kilowatts when scaled; for example, a full NVidia DGX H100 racks draws >10 kW. By contrast, some specialized systems undercut this: FuriosaAI’s Renegade server with 4 PFLOPS FP8 reportedly consumes only 3 kW, allowing five such systems per rack versus only one DGX H100 ([19]). NVIDIA’s DGX claims ~180 tokens/sec on LLaMA 3.1 (8B) per DGX node, whereas Positron’s Atlas claims 280 tokens/sec on the same model for less than half the power ([34]). These comparisons illustrate why enterprises are examining alternatives to raw GPU scale.

AMD and Other GPU Options

AMD’s GPU lineup (Instinct MI series) has matured into a genuine competitor, especially in cloud/hyperscale contexts. At its Advancing AI 2025 event (June 2025), AMD launched the MI350X and MI355X on the CDNA 4 architecture, with up to 288 GB HBM3e, native FP4/FP6/FP8 support, and claims of up to 35× faster inference versus the prior generation ([35]). Pegatron previewed a rack with 128 MI350X GPUs delivering 1,177 PFLOPS (FP4) ([36]), and Oracle has committed to zettascale clusters of up to 131,072 MI355X GPUs ([37]). AMD’s prior-gen MI300 (CDNA 3) offers ~383 TFLOPS FP16 and 6.55 TB/s bandwidth ([38]), and Dell, HPE, Lenovo, and Supermicro all now offer AMD-based AI servers, often in liquid-cooled configurations.

A major validation: in December 2025, OpenAI took an approximately 10% stake in AMD to secure GPU supply, alongside production workloads already running on MI300X via Azure. AMD’s MI400 series (CDNA “Next”) is on the 2026 roadmap, promising up to 40 PFLOPS FP4, 432 GB HBM4, and 19.6 TB/s bandwidth per GPU, paired with the “Helios” AI rack integrating EPYC Venice CPUs. AMD projects 10× better inference vs MI355X for mixture-of-experts models, explicitly positioning it as an alternative to NVIDIA ([39]). AMD also acquired the entire engineering team from Untether AI (June 2025) to bolster its AI compiler and SoC capabilities ([40]).

Intel’s AI Accelerators

Intel, long a CPU giant, has renewed AI commitments via GPUs and CPUs. Its Arc Alchemist GPUs (and upcoming data-center variants) are being featured in HPC workstations and servers. Notably, Intel’s Project Battlematrix workstations pair multiple Arc Pro GPUs with Xeon CPUs and a Linux-based “LLM Scaler” software stack. Early reports claimed these enhancements can make inference 4.2× faster on certain LLMs ([41]). These workstations (up to 8× Arc Pro B60 GPUs, 24 GB each) target on-prem AI developers and content creators, with prices around $5–10k.

More significantly, Intel recently unveiled “Crescent Island” – a forthcoming inference-only data-center GPU with the new Xe3P architecture and a massive 160 GB onboard memory ([22]). This design (one 640-bit memory chip or dual 320-bit chips) is clearly aimed at inference: the large memory footprint and focus on power/cost efficiency in air-cooled servers suggest a shot at replacing smaller GPU clusters for large models. Product sampling of Crescent Island is expected by late 2026. If successful, this will put Intel squarely into the inference hardware arms race (Intel position themselves as complementary to NVIDIA rather than a GPU-within-servers competitor).

Intel also provides AI accelerators via its Gaudi 3 chips (via acquisition of Habana Labs). At the OCP Global Summit 2025, Intel unveiled a rack-scale reference design for Gaudi 3 with 64 accelerators, 8.2 TB HBM total, and liquid cooling. Intel positions Gaudi 3 as its current enterprise AI solution while next-gen products like Crescent Island are developed. In a notable partnership shift, Intel entered a multi-year collaboration with SambaNova (February 2026), combining Intel's Xeon CPUs and foundry capabilities with SambaNova's AI software stack, and Intel Capital participated in SambaNova's $350M funding round ([42]). Early customers include ISVs and OEMs building AI workstations – for example, alliances with PNY and Supermicro bundle Arc GPUs in high-end workstations and nodes.

AI Accelerator Companies

Beyond general GPUs, a wave of specialized AI chip companies has sprung up, building new architectures specifically for inference. These often use novel approaches (e.g. dataflow chips, language-processing units, wafer-scale integration) claiming higher efficiency or performance for LLM serving. Key players include:

  • Graphcore (UK / SoftBank) – Offers Intelligence Processing Units (IPUs). Their IPU chips (e.g. Colossus GC200) use massive on-chip SRAM (“scratchpad” memory) and support storing the entire model in the chip ([43]), enabling very high parallelism for inference tasks. SoftBank acquired Graphcore in July 2024 for approximately $500M ([44]). With the subsequent November 2025 acquisition of Ampere Computing, SoftBank now controls what it calls a “Silicon Trinity”: Arm Holdings (architecture), Ampere (server CPUs), and Graphcore (IPUs). The integration strategy aims to combine Ampere’s power efficiency with Graphcore’s parallel processing for deployment in SoftBank’s Stargate hyper-scale data centers in the US and Japan, expected from 2026. No new IPU chip generation has been released since the acquisition; the current product line remains the Colossus Mk2 GC200.

  • Cerebras Systems (USA) – Known for the Wafer-Scale Engine (WSE), currently the largest chip ever made (full wafer, ~46,000 mm², ~4 trillion transistors). Their WSE-3 processor (introduced 2024, 5nm, 900,000+ compute cores) doubles the performance of WSE-2 and consumes roughly the same power ([45]). Cerebras has pivoted strongly to inference, launching an inference tool in August 2024 that lets developers run large models cost-effectively ([46]). Major developments have accelerated since late 2025: in January 2026, OpenAI announced a partnership with Cerebras to provide 750 MW of compute through 2028, in an estimated $10 billion+ deal ([47]). Cerebras also announced six new data centers across North America and Europe, expanding aggregate inference capacity 20×. Perplexity and Mistral AI (Le Chat) are active inference customers. Cerebras raised $1B at a $23B valuation in late 2025 and is refiling for a Q2 2026 IPO after its initial September 2024 filing was delayed by CFIUS review of its relationship with UAE-based investor G42 ([48]). Earlier enterprise clients include G42, which purchased Cerebras supercomputers, and the UAE’s Stargate data center partnership ([49]). Thus, Cerebras offers one extreme — huge chips in specialized systems — as an alternative to many smaller GPUs, and is now one of the most valuable private AI chip companies.

  • SambaNova Systems (USA) – Provides a reconfigurable dataflow architecture. SambaNova’s DataScale platform integrates its RDU (Reconfigurable Dataflow Unit) accelerators, and the company has seen adoption at Los Alamos National Lab, SoftBank, and Accenture ([9]). In a major leap forward, SambaNova unveiled its SN50 chip on February 26, 2026, claiming it is the fastest chip for agentic AI workloads: 1.6 PFLOPS FP16, 3.2 PFLOPS FP8, with a three-tier memory architecture supporting models with up to 10 trillion parameters and 10 million context lengths ([42]). SoftBank signed as a first SN50 customer. SambaNova raised $350M+ alongside the SN50 announcement, with investors including Intel Capital, Vista Equity, and Cambium Capital. Intel entered a multi-year hardware-software co-design collaboration. Other partnerships include Hume AI (multilingual speech-language models) and Argyll Data Development for the UK’s first renewable-powered AI inference cloud in Scotland. SambaNova’s pitch remains an end-to-end enterprise AI platform: hardware, software, and models all tuned together, and with the SN50 it is positioning directly against NVIDIA’s Blackwell on price-performance.

  • Groq (USA / now NVIDIA) – Developed by ex-Google engineers, Groq built a simplified, deterministic “Language Processing Unit” (LPU) custom ASIC. Its architecture removes many layers (caches, multithreading) to maximize inference throughput and predictability, delivering inference at ~10× the speed of typical GPUs in some workloads while using roughly one-third the power ([50]). Groq saw massive funding throughout 2024–2025: $640M in Aug 2024 (valuation $2.8B ([51])), $750M in Sept 2025 (valuation $6.9B ([52])), plus a $1.5B commitment from Saudi Arabia, and a European data center launch in Helsinki ([53]). In a blockbuster move on December 24, 2025, NVIDIA finalized a ~$20 billion licensing-and-acqui-hire agreement with Groq — NVIDIA’s largest deal ever ([2]). Under the deal, NVIDIA gains non-exclusive IP license to Groq’s LPU technology, and Groq co-founder/CEO Jonathan Ross and most senior engineers join NVIDIA to lead a new Real-Time Inference division. Groq continues as a nominally independent company under new CEO Simon Edwards, but the deal effectively brings Groq’s inference innovation under NVIDIA’s umbrella, aiming to “integrate Groq’s low-latency processors into the NVIDIA AI factory architecture.”

  • Tenstorrent (USA/Canada) – Led by CPU guru Jim Keller, Tenstorrent sells AI processors built on chiplets and RISC-V. At DevDay in April 2025, Tenstorrent launched the Blackhole product line: the Blackhole chip delivers 745 TFLOPS FP8, 372 TFLOPS FP16, with 32 GB GDDR6 and 1 TBps total Ethernet bandwidth ([54]). Notably, pricing is developer-friendly: the Blackhole p100 starts at $999, the p150 at $1,399, and the TT-QuietBox (4× Blackhole, liquid-cooled workstation) at $11,999. The enterprise-grade Blackhole Galaxy (32 Blackhole chips in a 4×8 mesh) delivers 23.8 PFLOPS FP8 with 1 TB memory. Tenstorrent’s strategy extends beyond hardware: it is licensing its Ascalon RISC-V CPU and Tensix AI cores as IP, with contracts secured from LG Electronics, Hyundai Motor Group, and Samsung Electronics totaling ~$150M. Partnerships also include Bosch (automotive AI) ([55]) and Rapidus (Japan). Funding reached $693M Series D in December 2024 (at ~$2.6B valuation), with reports of an $800M round at $3.2B in late 2025. Tenstorrent also announced a China expansion (December 2025) via a partnership with former Arm China CEO Allen Wu, targeting RISC-V AI and HPC markets ([56]).

  • Untether (Canada) — Acquired by AMD – Untether focused on RISC-V-based inference chips for edge and data-center use. Its 240 Slim chip (announced Oct 2024) was tailored for efficient execution of fixed models with high performance at reduced energy consumption ([57]), and Mercedes-Benz had been collaborating on autonomous vehicle applications. However, in June 2025, AMD acquired Untether AI’s entire engineering team, and the company ceased independent operations. The speedAI 240 Slim product and imAIgine SDK are no longer supplied or supported ([40]). The Untether team now focuses on AMD’s AI compiler, kernel development, and SoC design capabilities, effectively folding promising inference chip talent into a major GPU platform.

  • FuriosaAI (South Korea) – A startup (LG-backed) that designs dedicated inference chips. In Sept 2025, Furiosa unveiled the RNGD Server, powered by its RNGD “Renegade” chips (5nm, dual HBM3 memory) ([58]) ([59]). Remarkably, each 4-chip RNGD server delivers ~4 PFLOPS FP8 at only 3 kW power draw ([60]), enabling 5–right GPUs; plus, multiple servers can be rack-cooled. Performance claims include real-time inference of OpenAI’s GPT-OSS 120B model during an OpenAI event ([59]). LG has already adopted RNGD hardware for its own EXAONE LLM, reportedly achieving 2× inference performance per watt vs GPU-based setups ([8]). Furiosa has strong funding and is now seeking $300M–$500M in a Series D round (with Morgan Stanley and Mirae Asset Securities advising) ahead of a targeted 2027 IPO ([61]). The company notably rejected an $800M acquisition offer from Meta earlier in 2025. Monthly RNGD chip production stands at ~1,000 units, with plans to reach 2,000–3,000/month by end of 2026. The roadmap includes a Renegade+ Max variant (H2 2026) with HBM3E and ~144 GB total memory capacity. Its SDK 2025.3 release added multi-chip tensor parallelism and support for Qwen 2/2.5 models. Furiosa exemplifies the new wave: a company explicitly building sustainable, rack-optimized AI servers for enterprises, decoupling from GPU power limitations ([19]) ([59]).

  • Positron AI (USA) – Founded in 2023, Positron builds inference accelerators and has rapidly achieved unicorn status. Its first product, Atlas, packs eight custom “Archer” ASICs into a single system. Positron claims Atlas beats NVIDIA’s DGX H200 in efficiency: on Llama 3.1 8B, Atlas achieved 280 tokens/sec at 2000W, whereas a DGX (8× H200) delivered ~180 tokens/sec at 5900W ([34]). Atlas is in production and shipping to customers including Cloudflare ([62]). In February 2026, Positron raised a $230M Series B at a valuation exceeding $1 billion, with investors including QIA, Arm Holdings, Arena, and Jump Trading ([63]). The planned next-gen system, Asimov (tape-out targeted end of 2026), will have 2 TB per chip, support for up to 2,304 GB RAM per device, and claims of 5× more tokens per watt versus NVIDIA’s upcoming Rubin GPU. Positron highlights “made-in-USA” aspects and directly positions itself against NVIDIA’s high-end inference platforms.

  • d-Matrix (USA) – A startup tackling the memory bottleneck in AI. d-Matrix’s ”Corsair” platform became broadly available in Q2 2025, shipping as a PCIe card with two custom digital in-memory compute (DIMC) chips, 2 GB SRAM, and 256 GB LPDDR5 memory pool ([20]). The Pavehawk test silicon (3D DRAM stacked under the logic die) has been validated in labs, claiming 10× the bandwidth and 10× the energy efficiency per stack compared to HBM4 ([64]). The next-gen Raptor product will be the first commercial deployment of d-Matrix’s 3DIMC technology, developed in collaboration with Alchip. d-Matrix raised $275M in November 2025 ([65]), bringing total funding past $435M, with Microsoft and others as key backers. Supermicro integrates d-Matrix chips into its servers ([66]). The strategy — keeping data “close” to compute — addresses inference’s memory-bound nature, and with shipping hardware d-Matrix has moved from concept to production.

Two hyperscaler-built accelerators also merit attention. Microsoft's Maia 200 (unveiled January 2026) is a 3nm inference accelerator with 216 GB HBM3e, 7 TB/s bandwidth, and over 10 PFLOPS at FP4, already deployed in Azure powering Microsoft 365 Copilot and OpenAI's GPT-5.2 models ([5]). Google's Ironwood TPU (seventh-generation, GA in November 2025) is an inference-first design scaling to 9,216 chips / 42.5 ExaFLOPS per pod, with 192 GB HBM3E per chip ([6]). While these are primarily available through their respective clouds, they signal that hyperscalers are designing chips specifically for the inference workload rather than relying solely on NVIDIA.

These companies (and several others) represent a paradigm shift: from general-purpose GPUs to domain-specific inference hardware. They claim that by tailoring dataflow, precision formats, or memory, they can achieve the same LLM throughput at a fraction of power or cost. For enterprises, this diversity means options: one can choose fast GPUs (NVIDIA/AMD) or experiment with these accelerators when they become available. In practice, many organizations test multiple architectures via cloud trials or small on-prem clusters. Over time, it is expected that inference-specific accelerators will capture a significant share of workloads, possibly eclipsing training-oriented GPUs in volume.

System Integrators and OEM Solutions

While chip designers create the silicon, system integrators and OEMs turn these chips into usable products for enterprises. These companies package accelerators into servers, manage cooling, networking, and provide software stacks. Key integrators include:

VendorOfferingsNotable Deployments / Clients
Dell TechnologiesPowerEdge AI servers with NVIDIA GPUs (Blackwell/Blackwell Ultra, up to 192–256 GPUs per rack); also AMD MI325X/MI355X options. DGX-class systems. Q1 FY26 AI server orders hit $12.1B, projecting $20B in AI server shipments for FY2026 ([67]).xAI project: ~$5B deal to equip Elon Musk’s startup with NVIDIA GB200 GPU servers ([68]). Dell/HPE containers for Musk’s X (Twitter) AI supercomputers.
HPE (Hewlett Packard Enterprise)ProLiant servers (2U/4U) integrating up to 8x NVIDIA RTX6000 GPUs ([31]), ASIC options. GreenLake private cloud AI offerings. Consulting and turnkey deployments.Elon Musk’s X: Confirmed $1B+ sales of HGX servers for XAI (via Bloomberg) ([11]). New “simple AI” lineup announced in mid-2024 for mainstream businesses ([69]).
LenovoAI rack servers and GPU servers produced in India (50k annually) ([70]). Offers RDHx-cooled and conventional systems with NVIDIA/AMD GPUs.Cater to enterprise and hyperscale. No specific publicized AI project beyond manufacturing announcement.
Super Micro ComputerWide range of custom rack servers with NVIDIA/AMD accelerators. Pioneers direct liquid cooling (DLC) solutions. Quick-turn OEM partner with Nvidia/AMD.High volume shipments: shipping 100,000 GPUs per quarter in late 2024 ([71]). Played a role in supplying server racks for Elon Musk’s xAI supercomputer ([72]). Recently in S&P 500.
IBMPower11-based AI servers (using new Power11 CPUs and the Spyre AI coprocessor) ([73]), plus mainframes with Telum. Emphasizes inference focus. Secure, high-uptime architectures.Targets banking, healthcare, government. IBM claims customers needing mission-critical hybrid AI deployments (uptime 99.9999%) ([74]). Also recognized major cloud vendors use IBM Power (Nvidia on Power).
Cisco SystemsNetworking hardware for AI. Launched the P200 AI data-center interconnect chip (replacing 92 chips, 65% less power) to link dispersed AI DCs ([75]). Also provides switches (Nexus) with SmartNICs, in partnership with Nvidia, to accelerate AI traffic.Customers for P200 include Microsoft and Alibaba ([76]), who need inter-datacenter AI networks. Cisco also integrates Nvidia GPUs in its UCS servers for AI, though less publicity.
Others (OEM/ODM):Foxconn, Quanta, Wistron, QCT, etc. build servers for cloud and enterprise using the above components. Chinese ODMs (Inspur, Wingtech) likewise produce AI servers, often with custom accelerators (e.g. Huawei Ascend).Some defense/government supercomputers (e.g. in China) use Huawei’s AI servers with Ascend chips. Major telecoms and financials leverage these integrators.

Modern AI servers are essentially clustered GPU/accelerator nodes, often with specialized cooling (liquid or advanced air). For example, Dell’s new systems have options for liquid cooling up to 256 GPUs per rack ([77]), addressing the power density of LLMs. HPE’s designs focus on modular simplicity (e.g. 2U with 2 GPUs or 4U with 8 GPUs ([31])). Supermicro’s flexibility helped it outpace rivals (as noted when it entered S&P500) by rapidly launching AI servers as soon as GPU generations release.

Beyond hardware specs, integrators offer software and services. NVIDIA’s own NVIDIA AI Enterprise software suite is often bundled, or similar stacks from AMD and Intel. Consulting (e.g. HPE GreenLake AI Advisory) is common for on-prem AI deployments. Many enterprises contract these vendors for integration, maintenance, and even co-management of on-prem AI platforms.

Example Deployments

  • Enterprise Labs and Research: Government labs and large research organizations have adopted these systems. For instance, the U.S. Department of Energy’s AI research centers use HPE and Dell servers with AMD/NVIDIA accelerators. Los Alamos National Lab is reported to have deployed SambaNova’s AI system (with Samba-1 model) to handle classified document analysis and simulation tasks ([9]). In Europe, CERN and others use powerhouses like HPE/Dell with NVIDIA GPUs. These showcase that on-prem inference is critical for sensitive R&D.

  • Corporate Data Centers: Tech companies building private AI clouds often choose these integrators. The xAI (formerly Twitter AI lab) supercomputer uses Dell/HPE servers with NVIDIA chips to host LLMs (cf. Dell’s $5B deal ([68]) and HPE’s $1B X deal ([11])). Financial firms (banks, insurance) quietly build similar clusters (e.g. Goldman Sachs and Morgan Stanley have multi-P1000-model GPU dark clusters, though exact hardware is proprietary). Large manufacturers use HPE or IBM servers to run AI for design and voice assistants.

  • On-site AI Appliances: Some solutions are sold as appliances. For example, C3.ai (an enterprise AI software provider) announced in 2025 an “AI Workstation” appliance with NVIDIA GPUs (4× H100) for running enterprise LLM workloads on-prem (combining LLM inference with data management). Similarly, Cisco has integrated GPUs into its UCS X-series servers marketed for AI workloads. These appliances reflect demand from enterprises wanting box-and-cabinet solutions rather than custom builds.

Evaluating AI for your business?

Our team helps companies navigate AI strategy, model selection, and implementation.

Get a Free Strategy Call

Technology and Performance Analysis

To compare these options, we consider key metrics:

  • Throughput: In practice, inference throughput is often measured in tokens per second (TPS). For example, on Llama 3.1-8B (a common benchmark), NVIDIA’s DGX H200 system is quoted at ~180 TPS. Positron’s Atlas claims ~~280 TPS ([34]), while Groq’s LPUs claim many times faster on similar tasks ([50]). In tests, Groq’s LPUs reportedly serve multiple 5-10× acceleration factors vs GPUs for the same model, due to their streamlined pipelines ([50]). IBM Power11 (with future Spyre AI core) is not publicly benchmarked on LLMs yet, but IBM positions it to yield significant improvement in enterprise inferencing.

  • Latency & Batch Size: Many accelerators specialize in low-latency (even if it means smaller batch throughput). For enterprise chat or interactive tasks, response time is crucial. Companies like NVIDIA optimize their TensorRT engines; others like Groq reduce cycles by eliminating overhead. A P100-GPU cluster might achieve high TPS with large batches, but specialized chips can serve predictions faster on smaller batches.

  • Power Efficiency: This is a standout differentiator. For instance, Furiosa’s RNGD server (4 PFLOPS) uses only 3 kW ([60]), whereas an equivalent NVIDIA cluster would need ≈10 kW. Positron’s Atlas achieves ~3× the tokens per watt of an NVIDIA system ([34]). Groq’s LPUs draw about one-third the power of comparable GPUs ([50]). Even Cisco’s P200 network chip saves 65% power by collapsing multiple older chips into one ([75]). These gains translate directly to operational cost savings in enterprise data centers, which must often provision substantial backup power and cooling.

  • Capacity (Memory): LLM inference is often memory-constrained (to hold the model weights and activations). Solutions vary: NVIDIA’s H200 has 141 GB HBM3e per GPU, and Intel’s Crescent Island promises 160 GB. Cerebras WSE-3 integrates 4 chips worth (~1.2 TB) into one wafer, far beyond any GPU. SambaNova and Graphcore each have large on-chip or attached memory (~>100 GB per chip) to accommodate huge models. Positron’s future Asimov promises a profound 2 TB per chip ([62]). Thus, specialized chips lead with enormous memory capacity for single-model execution, whereas typical GPU clusters shard models across nodes. Enterprises must consider whether their workload requires one huge model per server (favoring chips like WSE) or can be split across many GPUs.

  • Scalability and Software Ecosystem: GPU clusters benefit from mature stacks (NVIDIA Triton, ONNX, TensorRT, etc.). Many specialized chips offer their own SDKs (e.g. Graphcore’s Poplar, Groq’s compiler, SambaNova’s SambaFlow). Compatibility with industry frameworks is a key criterion. For instance, Positron emphasizes compatibility with OpenAI APIs so enterprises can migrate workflows. Groq even boasts of running Facebook’s LLaMA model unchanged on their chips ([78]). Intel uses its xPU software stack for Arc GPUs and is building oneAPI extensions. Ultimately, enterprises often use a mix: GPUs for broad frameworks, and accelerators for cool inference gains when the ecosystem supports their model.

  • Cost: Public data on enterprise pricing is scarce, but estimates exist. Positron’s Atlas, for example, is projected to deliver 3× better cost-per-token than competitor DGX systems ([34]) (though actual purchase price is not published). Cerebras claims inference at “10 cents per million tokens” competitive pricing ([79]). NVIDIA GPUs are expensive (each H100 retails ~$40,000+), and a single rack fully loaded can cost $1–2M. In contrast, many inference accelerators aim to cut total cost (e.g. Groq’s LPUs trade some higher chip cost for much less required hardware overall). The Dell and HPE deals (multi-$billion scale) imply custom discounted pricing for bulk corporate orders. Overall, any hardware selection is a capex-opex tradeoff: e.g. Dell’s new 192-GPU rack systems likely cost millions each ([77]), whereas an alternative like five RNGD servers (replacing one DGX) also tally into the millions but save hundreds of kW of electricity ([60]).

  • Market Forecasts and Investments: Investors clearly believe inference is big business. The global AI inference market is now estimated at $103–106 billion for 2025 and projected to reach ~$255 billion by 2030 ([7]). In part, this is due to the vast installed base of trained LLMs that need serving. The huge sums being invested — NVIDIA’s $20B Groq acquisition, Cerebras’s $23B valuation, OpenAI’s $10B+ Cerebras deal, Broadcom’s co-design partnership with OpenAI — indicate that owning efficient inference hardware is a strategic priority for leading AI organizations.

In summary, performance (throughput, latency) and efficiency (power, cost) are the key axes where these systems are evaluated. In practice, many enterprises blend solutions: for instance, running small-to-medium LLMs on GPU clusters and reserving specialized accelerators for the largest, most latency-sensitive tasks. The next sections illustrate these deployments.

Case Studies and Deployments

FuriosaAI (South Korea) / LG: At an OpenAI Seoul event in Sept 2025, FuriosaAI demonstrated its hardware by running the GPT-OSS 120B model in real time on its RNGD chips ([59]). The results were striking: a single 4-core RNGD 5nm chip delivered performance comparable to one NVIDIA H100 GPU while using only one-third the power ([60]). These chips are now used by LG for the EXAONE model, doubling inference performance per watt vs GPUs ([8]). Furiosa’s approach – selling servers that let enterprises run large LLMs without building huge GPU farms – is a direct example of private LLM inference. LG’s case (a major manufacturer running an in-house LLM on exotic chips) shows this is no longer theoretical; it’s going into real products.

SambaNova Systems (USA) – Time Magazine reported that SambaNova’s AI platform (chips plus Samba-1 model) has been adopted by institutions like Los Alamos National Lab, SoftBank (Japan), and Accenture ([9]). In one Los Alamos deployment, their system reportedly processes large-scale scientific data with an LLM to assist research, all on-premises. Although exact performance figures are proprietary, SambaNova claims attributes like energy savings over GPUs and better handling of large conversational workloads. The SoftBank and Accent partners highlight interest in using SambaNova’s inference hardware for advanced AI deployments in finance and manufacturing as well.

Cerebras Systems (USA) – Beyond news on its chips, multiple real-world deployments illustrate Cerebras’s penetration. Abu Dhabi’s G42 (a major AI group) purchased Cerebras supercomputers for training and inference of their Giant language models ([80]). The CTO of G42 cited the ability to train some of the largest models faster using Cerebras than with a comparably sized GPU cluster. Moreover, Cerebras is strategically placing infrastructure globally: its announced expansion into the UAE’s mega “Stargate” data center will enable regional enterprises (India/Pakistan/Middle East) to run huge LLMs. This shows how an inference-focused chip vendor partners with national initiatives, moving beyond US tech hubs.

Groq Inc. (USA/Helsinki → NVIDIA) – Before its December 2025 acquisition by NVIDIA, Groq’s LPUs saw adoption by hyperscalers and defense contractors. In Helsinki, Groq opened a data center in partnership with Equinix to serve European AI customers ([53]), positioning itself as an “AI cloud” alternative for inference. Use cases included real-time analytics for automotive and industrial IoT. Groq’s story underlines a broader trend: inference-focused startups build co-located inference services to prove value, and if successful, may attract acquisition by larger players. NVIDIA’s $20B deal effectively brings Groq’s LPU technology in-house, potentially making it available as part of NVIDIA’s AI factory architecture rather than as a standalone competitor.

Positron AI (USA) – In early 2025, Positron shipped its first Atlas enclosures to select corporate customers. One publicized tester is Cloudflare, exploring Atlas for energy-efficient inference in their edge servers ([62]). Internal benchmarks (by Positron) show that Atlas reduces power by ~67% for an 8B LLM compared to a DGX H200, at similar throughput ([34]). If verified, customers running many repeated inference tasks (e.g. modulated chatbots at telecommunications companies) could cut electricity bills by tens of millions annually. Positron’s second-gen “Asimov” is already sampling, aimed at theory-computations requiring multi-trillion-parameter models. The emerging narrative: enterprises seeking to cut costs on large-scale inference are trialing these new accelerators.

Untether (Canada → AMD) – Untether’s chips were sampled by forward-looking automotive and agricultural customers (Mercedes-Benz collaboration) ([81]), positioning the 240 Slim as an inference engine for on-vehicle use. However, in June 2025, AMD acquired Untether’s entire engineering team and the company ceased independent operations ([40]). The talent now contributes to AMD’s AI compiler and SoC capabilities. Untether’s story illustrates both the promise and the consolidation dynamics of the inference chip market: innovative startups build differentiated technology that ultimately gets absorbed by larger platforms.

Major OEM Deals: The broad industry move can be seen in mega-deals. As noted, Dell’s multi-billion-server deals with xAI (NVIDIA chips) ([10]) and HPE’s $1B+ deal with X (HPE reports) ([11]) indicate that enterprise AI hardware purchases have become multi-billion-dollar procurement items. These commitments are essentially private inference projects. For comparison, earlier cloud-centric AI was measured in tens to hundreds of millions, but now hits billions per customer. Even smaller enterprises are dedicating multi-million budgets to on-prem LLM servers.

OpenAI/Broadcom (Custom Chips): OpenAI and Broadcom formally announced a strategic collaboration in October 2025 to co-design and deploy 10 gigawatts of OpenAI-designed AI accelerators ([3]). The custom chip, codenamed ”Titan,” is built on TSMC’s 3nm process with systolic array architecture optimized for AI inference and HBM3E/HBM4 memory. Deployment is targeted for H2 2026, with a second-generation planned on TSMC’s A16 process node. OpenAI has also separately contracted Cerebras to provide 750 MW of inference compute through 2028 ([47]). While OpenAI is not a typical enterprise client, these moves exemplify the trend: any serious AI organization may shift to owning large-scale inference hardware, and the chipmaker ecosystem is expanding to accommodate custom designs alongside GPU platforms.

Numerous reports quantify the rapid growth of AI inference hardware:

  • Market Size: Analysts now forecast the global AI inference market to reach ~$255 billion by 2030 (CAGR ~17–19% from 2025), with the 2025 inference market estimated at $103–106 billion ([7]). AI data center capex is projected at $1 trillion total by 2028, with AI chips alone representing over $400 billion in that year. Inference spending (in chips and systems) is projected to surpass training in dollar terms because inference happens constantly at scale. XPUs (ASICs and custom accelerators) are expected to lead growth at 22% in 2026, outpacing GPUs (19%) and CPUs (14%).

  • GPU Leadership: NVIDIA’s dominance continues: as of H1 2025, NVIDIA commanded ~92% of the discrete GPU market ([1]). Its FY2026 revenue reached a record $215.9 billion (up 65% YoY), with data center revenue at $193.7 billion ([26]). This shows enterprises continue to invest massively in NVIDIA-based inference. However, competition is intensifying: AMD has secured OpenAI as a major customer, Chinese chipmakers (Huawei, Cambricon) continue reacting to US export limits, and hyperscalers (Google, Microsoft) are deploying custom silicon.

  • Investment Surge: VC and corporate funding into AI hardware has reached extraordinary levels. NVIDIA’s $20B Groq acquisition was the landmark deal of late 2025 ([2]). Cerebras raised $1B at a $23B valuation and is refiling for an IPO. Positron raised $230M at a $1B+ valuation (February 2026). d-Matrix raised $275M (November 2025). Lightmatter (photonic) has raised $850M total ([82]). SambaNova raised $350M (February 2026). Marvell agreed to acquire Celestial AI for ~$5.5B. Hyperscaler capex commitments for 2026 are staggering: Amazon ($200B), Google ($175–185B), Meta ($115–135B). This avalanche of investment signals surging confidence in inference hardware needs.

  • Hardware Shipments: Super Micro reported 100,000 GPUs shipped per quarter by late 2024 ([71]), largely to AI customers. This scale (much higher than a year earlier) reflects annual shipments in the millions. Dell/Trex and others are similarly vehicle-class orders. On the accelerators side, exact unit counts are private, but orders like Broadcom’s rumored 1–2 million AI chips for OpenAI ([83]) show the order-of-magnitude scale when hyperscalers commit.

  • Energy Trends: As noted, data-center power is skyrocketing. IEA energies intelligence suggests AI’s share in world electricity is surging. The key takeaway for enterprises: inference hardware must focus on efficiency gains to avoid unsustainable energy costs. Vendors’ claims of “10× efficiency” or “CUDA killers” are grounded in this imperative.

  • Standardization & Ecosystem: Industry alliances are forming around standards. For instance, the OpenAI “Plug and Play” API doesn’t help on-prem use, but federated APIs and interoperability (like ONNX for models, and NVIDIA’s Megatron support) are enabling more architectures. AMD is supporting ROCm and TensorFlow, and major startups pledge compatibility with PyTorch/ONNX. Intel and others push oneAPI. The point: enterprises want hardware that works with their existing ML pipelines. This pushes integrated vendors (Dell, HPE) to emphasize compatibility in their stacks.

Future Directions and Implications

Sustainability: The DOE report implies enterprises will demand more “green AI” solutions. Energy efficiency (and on-prem renewables) will influence hardware choices. Even beyond energy, heat and cooling constraints are becoming central; Furiosa’s low power design is a strong selling point.

Model and Software Evolution: The hardware landscape could evolve with AI models themselves. If models grow to trillions of parameters, wafer-scale or even multi-wafer solutions (like Cerebras’ WSE or next-gen photonic chips) could be needed. Conversely, as the Smaller AI Models (SLM) trend suggests ([25]), not all inference will require the largest chips; many domain-specific models run comfortably on less hardware. Enterprises might purchase a mix: e.g. one converged DGX or Cerebras rig for big tasks, plus many cheaper NVIDIA/AMD servers or even laptop NPUs for smaller tasks.

Geopolitics & Supply Chain: The US and China both push on AI hardware independence. U.S. CHIPs acts and export controls motivate domestic options; EU galloping could boost homegrown chip programs (e.g. Rapidus in Japan partnering with Tenstorrent ([84])). The Broadcom-OpenAI and Meta-Rivos news highlight that big tech is moving away from relying only on foreign GPUs. For enterprise customers, this may mean more vendor options but also fragmentation (e.g. if new RISC-V or Chinese chips enter markets).

Design Innovations: Beyond electronics, photonic computing is advancing rapidly. Lightmatter launched its Passage M1000 (3D Photonic Superchip for next-gen XPUs) in March 2025 and announced the Passage L200 — the world’s first 3D co-packaged optics (CPO) product — available in 2026 ([85]). While mainstream adoption remains years away, Lightmatter is working with unnamed GPU/XPU makers to integrate CPO directly onto chips, with products expected in late 2027–2028. Celestial AI raised $520M total (Series C1 closed August 2025) but in a significant validation, Marvell Technology agreed to acquire Celestial AI for ~$5.5 billion (deal expected to close by end of March 2026) ([86]). Celestial’s "Photonic Fabric" and Optical Multi-Chip Interconnect Bridge (OMIB) enable optical chip-to-chip and server-to-server connectivity across data center racks, promising meaningful revenue by late 2028. Similarly, new 3D packaging (2.5D chips with HBM) and chiplets may yield more powerful inference devices.

Case in Point – RISC-V and Open Architectures: The RISC-V trend has accelerated through consolidation. Meta completed its ~$2 billion acquisition of Rivos in October 2025 ([4]), fusing Rivos’s CUDA-compatible RISC-V processor with Meta’s MTIA custom AI chip family to reduce NVIDIA dependence. Untether’s RISC-V team was absorbed by AMD (June 2025). Tenstorrent is licensing its Ascalon RISC-V CPU and Tensix AI cores as IP to LG, Hyundai, and Samsung (~$150M in contracts), and expanded into China via a RISC-V partnership (December 2025). RISC-V allows custom AI features without licensing ARM fees, and enterprises may benefit from more affordable chip designs and reduced vendor lock-in. On the other hand, fragmentation remains a risk if each company uses a different ISA.

Impacts on Enterprise IT: With these shifts, IT leaders must rethink infrastructure. Hiring local clusters with new hardware requires new expertise (e.g. running Graphcore IPUs is different from CUDA). Total cost of ownership (TCO) analyses must include electricity, space, and support staff. Many enterprises may start by hybrid-cloud (using on-prem for steady loads, cloud bursts for peaks), as suggested in industry surveys ([15]). Data governance and latency-critical apps will push more toward on-prem inference long-term. Some organizations (e.g. banks, healthcare) may never allow core LLM APIs to run off-site due to privacy.

Conclusion: The ecosystem for private LLM inference is rapidly diversifying. Established GPU vendors (NVIDIA, AMD, Intel) continue innovating, while a new tier of companies (Graphcore, Groq, etc.) challenge their position. At the same time, integrators (Dell, HPE, etc.) adapt to package these technologies for enterprises. Enterprises now have unprecedented choice: they can rely on best-in-class NVIDIA DGX clusters, adopt more energy-efficient accelerators like Furiosa or Positron, or even build proprietary chips via partners. All claims of performance or efficiency are thoroughly data-backed here (power and speed figures from the sources) to allow evidence-based consideration.

Ultimately, running LLM inference in-house is becoming a complex, high-stakes endeavor. As one analysis notes: “companies are eager to leverage AI using their data, but often face complexity and risk in implementation” ([69]). By selecting and integrating the right hardware stack, enterprises can not only meet this challenge but also gain strategic advantages in speed, cost, and data security. The next few years will likely see continued innovation, integration of new technologies (photonic, RISC-V, specialized FPGAs), and broader deployment of inference-centric hardware – all culminating in a dynamic market where the “best” solution may vary by use case but never stops improving.

CompanyCountryHardware Product(s)Key FeaturesClients / Partners
NVIDIA (GPU)USAB200/B300 (Blackwell/Ultra) GPUs; DGX B200/B300; GB200/GB300 NVL72; Vera Rubin (2026)B300: 15 PFLOPS dense FP4, 288 GB HBM3e. Vera Rubin: 50 PFLOPS NVFP4, 288 GB HBM4, 22 TB/s bandwidth. FY2026 revenue: $215.9B (data center: $193.7B). Mature CUDA/AI software (4M+ developers). Acquired Groq for ~$20B ([26]).Dell (192-GPU PowerEdge) ([28]); HPE GreenLake AI ([31]); AWS, Azure, Google Cloud, CoreWeave; xAI, OpenAI.
AMD (GPU)USAMI350X/MI355X (CDNA 4, shipping); MI400 (2026 roadmap, 40 PFLOPS FP4)MI350X: 288 GB HBM3e, native FP4/FP6/FP8, up to 35× faster inference vs prior gen. MI400: 432 GB HBM4, 19.6 TB/s. OpenAI took ~10% stake (Dec 2025). Acquired Untether AI team. Oracle committed to 131K+ MI355X cluster ([35]).Oracle, OpenAI, Meta, Microsoft, HPE, Dell, Lenovo, Supermicro.
Intel (GPU/arc)USAArc MI-based GPUs (B60…); Project BattlematrixArc Pro B60 (20 Xe cores, 24 GB GDDR6, 160 XMX engines ([87])). LLM Scaler 1.0 boosts Arc inference ~4.2× ([41]). Upcoming "Crescent Island": 160 GB LPDDR5X for inference ([22]).Data center OEMs, workstation builders; collaborating with Supermicro, Dell, etc.
IBM (CPU + AI)USAPower11 servers; IBM Telum II mainframe (+Spyre)Power11 chips (7nm) with built-in AI instructions, focused on inference reliability. Claims ~55% core perf gain vs Power9 ([88]) and turnkey AI stacks. 99.9999% uptime, live patching ([89]).Banks, telcos, govt (secure AI workloads). Power systems as inference servers. HPE-like deals in enterprise.
Graphcore (IPU)UKColossus GC200 (IPU chips); IPU-Machine M2000 / IPU-POD64 systemsDataflow parallelism; each IPU holds whole model in on-chip memory ([43]). Part of SoftBank's "Silicon Trinity" (Arm + Ampere + Graphcore). No new chip generation post-acquisition; integration with Ampere CPUs expected for 2026 Stargate deployments.SoftBank (parent, acquired July 2024 for ~$500M) ([44]); targeted at SoftBank's hyper-scale data centers in US and Japan.
Cerebras (WSA)USAWSE-3 (5nm, 900K+ cores, 1.4T transistors); CS-3 systemsLargest chip ever; designed to run entire LLM on one die. $10B+ OpenAI deal (750 MW through 2028). 6 new data centers. $1B raised at $23B valuation (late 2025). IPO refiling targeting Q2 2026 ([47]).OpenAI (major inference partner), G42 UAE, Perplexity, Mistral AI (Le Chat) ([80]); UAE Stargate; hyperscalers and research centers.
SambaNovaUSASN50 chip (1.6 PFLOPS FP16, 3.2 PFLOPS FP8, Feb 2026); RDU-based DataScale platformReconfigurable dataflow architecture. SN50 supports 10T-parameter models, 10M context. Three-tier memory architecture. $350M raised (Feb 2026). Intel multi-year co-design collaboration.SoftBank (first SN50 customer), Los Alamos Natl Lab ([9]), Accenture, Intel, Hume AI; enterprise/govt customers.
Groq (LPU) → NVIDIAUSAGroqChip (ASIC LPU); LPUs in serversDeterministic, ultra-pipelined architecture. Extremely low-latency. Claims ~1/3 GPU power for similar throughput ([50]). Acquired by NVIDIA Dec 2025 for ~$20B ([2]). LPU IP and engineering now part of NVIDIA's Real-Time Inference division.Former investors: Cisco, Samsung, BlackRock; Helsinki data center; LPU technology now integrated into NVIDIA AI factory architecture.
TenstorrentCanadaBlackhole ASICs (745 TFLOPS FP8, from $999); Galaxy (32-chip, 23.8 PFLOPS); Quasar (4nm, upcoming)Custom RISC-V-based AI processors. Chiplet design (multi-fab possible) ([56]). Also licensing Ascalon RISC-V CPU IP (~$150M in contracts with LG, Hyundai, Samsung). $693M Series D (Dec 2024).Bosch, Hyundai, LG, Samsung ([55]); Rapidus partnership; China expansion via Allen Wu (Dec 2025).
FuriosaAIS. KoreaRNGD “Renegade” AI Inference chip; RNGD Server5nm chips with dual HBM3; 4 PFLOPS FP8 per 4-chip board at 3 kW ([58]). Specializes in LLM inference. Efficient MXFP4 precision.LG AI Lab (EXAONE LLM) ([8]); OpenAI (GPT-OSS 120B test) ([59]); global OEM sampling in 2026.
Positron AIUSAAtlas system (8× Archer ASICs); Asimov (next-gen, 2 TB/chip, tape-out end 2026)Inference-optimized accelerators. Atlas: 280 TPS on Llama 3.1 8B at 2000W vs NVIDIA’s 180 TPS at 5900W ([34]) (~3× better efficiency). $230M Series B at $1B+ valuation (Feb 2026).Cloudflare, QIA, Arm Holdings, Jump Trading ([63]); targeting web-scale inference clients.
Untether → AMDCanada~~Untether 240 Slim (ASIC)~~ DiscontinuedRISC-V-based accelerators for inference. Entire engineering team acquired by AMD (June 2025); products and SDK no longer supported ([40]).Former partners: Mercedes-Benz ([90]). Team now contributing to AMD AI compiler and SoC design.
d-Matrix (Memory)USACorsair (shipping, PCIe card); Pavehawk (3DIMC, lab-validated); Raptor (next-gen, with Alchip)In-memory-compute concept: 256 GB LPDDR5 + 2 GB SRAM + 3D-stacked DRAM. Claims ~10× HBM4 bandwidth & energy efficiency ([64]). $275M raised Nov 2025 ([65]).Microsoft (venture-backed) ([91]); Supermicro (server integration) ([92]).
Qualcomm (AI)USAAI100 Ultra (inference ASIC)Designed for base station/edge inference (video, 5G, etc.). Collaborates with Cerebras for next-gen inference ([93]). Campaigned for low-latency tasks.Verizon, AT&T trial networks; Cerebras partnership for datacenter AI ([93]).
LightmatterUSAPassage M1000 (3D Photonic Superchip, 2025); Passage L200 (3D co-packaged optics, 2026)Uses light (photons) for chip-to-chip interconnects, avoiding transistor limits. Potentially huge energy savings. $850M total raised ($400M Series D at $4.4B valuation) ([82]).T. Rowe Price, Fidelity, GV (investors); JPMorgan and DoE collaborating on photonic networks; working with GPU/XPU makers for 2027–2028 CPO integration.
Celestial AI → MarvellUSAPhotonic Fabric; Optical Multi-Chip Interconnect Bridge (OMIB)Photonics-based chip-to-chip links (optical fabric) for latency/power reduction vs NVLink. $520M Series C1 (Aug 2025). Being acquired by Marvell for ~$5.5B (closing ~March 2026) ([86]).AMD Ventures, Samsung, BlackRock, Tiger Global, Temasek (investors); Marvell (acquirer); meaningful revenue expected late 2028.
Huawei (Ascend)ChinaAscend 910/Ascend 950 AI AcceleratorUp to 2 PFLOPS FP8 (Ascend 950) with 144 GB RAM ([94]). Designed for both training and inference (not sold internationally, used in CN).Alibaba (cloud GPUs), Huawei’s own smartphone/datacenter products; domestic Chinese enterprises.
Baidu (Kunlun)ChinaKunlun AI chips14 nm GPUs primarily for AI inference/services. Deployed in Baidu’s cloud. (Note: limited outside China)Primarily internal to Baidu’s AI services; some Chinese telecom use.

Table: Notable hardware vendors for enterprise LLM inference (with example products and usage) ([95]) ([19]) ([28]) ([69]).

The table groups vendors by type (GPU vs ASIC vs photonic, etc.), lists known products and their key attributes or customers, with citations. For example, NVIDIA’s Blackwell GPUs (not explicitly cited above, but implied by [33]) deliver enormous parallel compute, while Furiosa’s RNGD server achieves similar throughput at much lower power ([19]). The Clients column shows adopters – from cloud giants to national labs – illustrating actual on-prem inference usage.

Implications and Conclusions

Infrastructure Strategy: The diversity of hardware means enterprises must carefully architect their AI stacks. Many will adopt a hybrid approach: general-purpose GPUs for broad tasks and development, plus specialized accelerators for production inference. Companies may choose storage/compute designs so that certain racks are GPU-heavy while others use Groq or Cerebras nodes, depending on workload. Maintaining interoperability (via frameworks like ONNX and containerized runtimes) will be crucial.

Supply and Pricing: The scramble for hardware has led to shortages of high-end chips and substantial price tags. Companies like Dell and HPE lock in large orders to secure supply (evidenced by their megadeals) ([10]) ([11]). Enterprises without bulk purchasing power may find prices high; yet, some new entrants (Untether, Positron) claim more “democratized” pricing models (e.g. cents per million tokens). The key is that running the cutting-edge LLMs will likely remain a significant capital investment.

Regulation and Security: Running LLMs in-house can help meet data compliance (e.g. GDPR, HIPAA) by keeping sensitive data off external APIs. However, self-hosted models introduce security concerns too (ensuring firewalls, preventing model theft). New hardware often has secure enclaves or encryption features; for example, IBM touts quantum-safe encryption in its Power11 ([74]). Enterprises must balance safety of data on cloud vs. securing local machines.

Wearables and Edge: Some inference hardware is small enough to run at the edge. Qualcomm’s AI100 chips and Apple’s local NPU (on M-series chips) already allow LLMs to run on phones/laptops (memoir: Microsoft’s Copilot on Windows 11 uses Intel NPUs too). Dell’s new laptop with “Intel AI Boost” NPU ([96]) is aimed at local content creation (game devs, designers). While not “data center” scale, these developments hint at a future where even edge devices perform private LLM inference for on-device assistants (as has been demoed with mobile LLMs).

Emerging Technologies: We highlighted photonic chips and RISC-V. Others on the horizon:

  • Neuromorphic computing (IBM’s research chips, Intel’s Loihi) – unlikely to run current LLMs soon, but worth mentioning as a parallel track focused on brain-like inference (low power, always-on).
  • Quantum computing – not practical for LLMs yet, but some interest in using quantum for optimization/shor tasks associated with AI data (e.g. Google’s Sycamore did an elementary natural language task in 2023). Purely speculative for inference near future.
  • Storage AI (Optical) – startups like Lightmatter demonstrate how optical circuits can eventually reduce the energy cost of matrix multiply. Do note: Lightmatter’s CEO says mainstream use is ~10 years away ([97]), but research partnerships (with JPM and Lawrence Berkeley Lab) are testing prototypes now.

Global Considerations: Enterprises will also factor regional trade policies. Chinese companies rely on domestic hardware (Huawei, Bitmainer) due to export curbs. Western companies are monitoring this: some are moving fabrication (TSMC, Samsung, Rapidus) or relaxing export controls to get chips made domestically (e.g., Meta designing U.S. AI chips via Broadcom ([83])).

Conclusion: The rise of private LLM inference hardware is a major industrial theme in AI. Through extensive research, comparing performance claims and market moves, we see a fast-evolving and rapidly consolidating landscape. The late 2025 period marked a turning point: NVIDIA's $20B acquisition of Groq, Marvell's $5.5B purchase of Celestial AI, AMD's absorption of Untether, and OpenAI's $10B+ Cerebras partnership all signal that inference is now the central battleground in AI hardware. Key players range from trillion-dollar incumbents (NVIDIA with $215.9B in annual revenue, Intel) to rising heavyweights (Cerebras at $23B valuation, Positron at $1B+) to specialized challengers (FuriosaAI, SambaNova, Tenstorrent). OEMs like Dell (projecting $20B in AI server shipments) and HPE bridge these chips to enterprise use, while hyperscalers are investing hundreds of billions in AI infrastructure (Amazon $200B, Google $175–185B, Meta $115–135B in 2026 alone). Our citations have shown concrete figures: GPUs delivering petaflops at tens of kW ([19]), new chips claiming triple efficiency ([34]), and massive deals worth tens of billions ([10]).

In summary, any enterprise looking to “bring LLMs in-house” must evaluate both classical GPU solutions and the new breed of AI accelerators. They should consider use-case specifics (batch vs real-time, single large model vs many small ones) and metrics (power, cost, precision). This report has catalogued the current key companies and technologies in private LLM inference, providing a foundation for detailed strategic planning. The story is still unfolding: as more hardware (e.g. Intel’s Crescent Island, AWS Outposts for AI, global AI computing networks) comes online, enterprises will have even more options. By staying informed and evidence-based (as per the data cited here), organizations can harness in-house LLM capabilities effectively and sustainably.

External Sources (97)

Get a Free AI Cost Estimate

Tell us about your use case and we'll provide a personalized cost analysis.

Ready to implement AI at scale?

From proof-of-concept to production, we help enterprises deploy AI solutions that deliver measurable ROI.

Book a Free Consultation

How We Can Help

IntuitionLabs helps companies implement AI solutions that deliver real business value.

DISCLAIMER

The information contained in this document is provided for educational and informational purposes only. We make no representations or warranties of any kind, express or implied, about the completeness, accuracy, reliability, suitability, or availability of the information contained herein. Any reliance you place on such information is strictly at your own risk. In no event will IntuitionLabs.ai or its representatives be liable for any loss or damage including without limitation, indirect or consequential loss or damage, or any loss or damage whatsoever arising from the use of information presented in this document. This document may contain content generated with the assistance of artificial intelligence technologies. AI-generated content may contain errors, omissions, or inaccuracies. Readers are advised to independently verify any critical information before acting upon it. All product names, logos, brands, trademarks, and registered trademarks mentioned in this document are the property of their respective owners. All company, product, and service names used in this document are for identification purposes only. Use of these names, logos, trademarks, and brands does not imply endorsement by the respective trademark holders. IntuitionLabs.ai is an AI software development company specializing in helping life-science companies implement and leverage artificial intelligence solutions. Founded in 2023 by Adrien Laurent and based in San Jose, California. This document does not constitute professional or legal advice. For specific guidance related to your business needs, please consult with appropriate qualified professionals.

Related Articles

Need help with AI?

© 2026 IntuitionLabs. All rights reserved.