ARTIFICIAL INTELLIGENCE

Perplexity's Tool Runs Massive AI Models on Older Hardware

Perplexity AI introduces an open-source tool, TransferEngine, enabling large language models to run efficiently across diverse cloud hardware, bypassing costly upgrades.

Read time: 5 min read
Word count: 1,016 words
Date: Nov 6, 2025

Summarize with AI

Perplexity AI has launched an innovative open-source software, TransferEngine, addressing two significant challenges in enterprise AI deployment: vendor lock-in and the need for expensive hardware upgrades. This tool facilitates high-speed communication among large language models across different cloud providers, allowing trillion-parameter models to operate effectively on existing H100 and H200 GPU systems. TransferEngine leverages RDMA technology to create a universal interface, ensuring seamless data transfer between GPUs regardless of the underlying networking protocol. This development promises enhanced flexibility and cost savings for companies deploying advanced AI.

An illustration representing interconnected data streams, symbolizing the efficient data transfer facilitated by new AI tools across diverse cloud environments. Credit: Shutterstock

🌟 Non-members read here

Bridging the AI Hardware Divide with TransferEngine

Perplexity AI has unveiled an open-source software solution designed to tackle two pervasive and costly issues for enterprises leveraging artificial intelligence systems. This innovative tool aims to free companies from dependence on a single cloud provider and eliminate the immediate need for expensive, cutting-edge hardware to operate massive AI models. The introduction of this technology marks a significant step towards democratizing access to high-performance AI.

The new tool, named TransferEngine, facilitates high-speed communication between large language models operating across distinct cloud providers’ hardware. This capability allows models with trillions of parameters, such as DeepSeek V3 and Kimi K2, to run efficiently on existing H100 and H200 GPU systems. Consequently, companies can avoid waiting for and investing in next-generation hardware, as detailed in a research paper published by Perplexity and subsequently open-sourced on GitHub.

Existing implementations often exhibit limitations, being tied to specific Network Interface Controllers. This restriction has historically hindered seamless integration into inference engines and portability across various hardware platforms. Perplexity’s TransferEngine seeks to overcome these fundamental technical barriers, offering a more flexible and robust solution for modern AI deployments.

Addressing Vendor Lock-in and Hardware Dependency

The issue of vendor lock-in primarily stems from inherent technical incompatibilities between different cloud environments. Cloud providers employ diverse networking protocols for high-speed GPU communication, creating a fragmented ecosystem. For instance, Nvidia’s ConnectX chips utilize one standard, while Amazon Web Services’ (AWS) Elastic Fabric Adapter (EFA) operates on a proprietary protocol.

Previous solutions were typically optimized for one system or the other, failing to offer universal compatibility. This forced organizations to commit to a specific provider’s ecosystem or accept significantly reduced performance, impacting their operational flexibility and budget. The challenge is particularly acute with the latest Mixture-of-Experts (MoE) models, which demand substantial computational resources.

Models like DeepSeek V3, with 671 billion parameters, and Kimi K2, reaching a full trillion, are too large to be contained within single eight-GPU systems. While Nvidia’s new GB200 systems, essentially massive 72-GPU servers, offer a potential solution, they carry hefty price tags, face severe supply chain shortages, and are not universally available. In contrast, H100 and H200 systems are more readily available and comparatively less expensive.

However, distributing large models across multiple older systems has traditionally led to severe performance penalties. The research team highlighted the absence of viable cross-provider solutions for large language model inference, noting that existing libraries either lack AWS support or experience substantial performance degradation on Amazon’s infrastructure. TransferEngine aims to transform this landscape by enabling portable point-to-point communication for contemporary large language model architectures, avoiding vendor lock-in while complementing collective libraries for cloud-native deployments.

The Inner Workings of TransferEngine

TransferEngine functions as a universal translator for GPU-to-GPU communication, establishing a common interface compatible across various networking hardware. It achieves this by identifying the core functionalities shared across diverse systems, abstracting away the underlying proprietary protocols. This approach ensures consistent and efficient data transfer regardless of the specific cloud environment.

At its core, TransferEngine utilizes Remote Direct Memory Access (RDMA technology. RDMA facilitates direct data transfer between graphics cards without involving the main central processing unit, creating a dedicated, high-speed pathway between chips. This direct access significantly reduces latency and enhances throughput, crucial for the demanding nature of large language models.

Perplexity’s implementation has demonstrated impressive performance, achieving 400 gigabits per second throughput on both Nvidia ConnectX-7 and AWS EFA. This performance matches that of existing single-platform solutions, validating TransferEngine’s effectiveness. Moreover, the tool supports the aggregation of bandwidth by using multiple network cards per GPU, further accelerating communication speeds for even more intensive workloads.

The researchers elaborated that portability is achieved by leveraging common functionalities across heterogeneous RDMA hardware. This method establishes a reliable abstraction layer over underlying protocols, ensuring compatibility without compromising performance. By providing a unified communication layer, TransferEngine allows for greater flexibility in deploying AI models across different cloud infrastructures, optimizing resource utilization and reducing operational complexities.

Impact and Future Outlook

The capabilities of TransferEngine are not merely theoretical; Perplexity has already integrated this technology into its production environment to power its advanced AI search engine. This real-world application demonstrates the tool’s robustness and efficiency in handling complex, high-demand AI tasks. Its deployment across critical systems showcases its versatility and immediate value.

For disaggregated inference, TransferEngine expertly manages high-speed data transfers between servers, enabling dynamic scaling of AI services. This allows companies to adjust their computational resources based on demand, optimizing costs and performance. Furthermore, the library underpins Perplexity’s reinforcement learning system, facilitating weight updates for trillion-parameter models in a remarkably short timeframe of just 1.3 seconds, highlighting its efficiency in managing large-scale model training and updates.

Perhaps one of its most significant applications is in routing for Mixture-of-Experts models. These models direct different requests to specialized “experts” within the model, generating substantially more network traffic compared to traditional models. While DeepSeek developed its own DeepEP framework for this purpose, it was restricted to Nvidia ConnectX hardware. TransferEngine not only matched DeepEP’s performance on ConnectX-7 but also achieved state-of-the-art latency on Nvidia hardware while providing the first viable implementation compatible with AWS EFA.

During extensive testing of DeepSeek V3 and Kimi K2 on AWS H200 instances, Perplexity observed considerable performance improvements when distributing models across multiple nodes, especially at medium batch sizes—a common configuration for production serving. This demonstrates TransferEngine’s ability to unlock the full potential of existing hardware for demanding AI applications. The open-source nature of this production infrastructure distinguishes Perplexity from many competitors, such as OpenAI and Anthropic, who typically maintain proprietary control over their technical implementations.

Perplexity’s decision to release the complete library, including code, Python bindings, and benchmarking tools, under an open license, mirrors the successful strategy employed by Meta with PyTorch. This approach aims to establish an industry standard, fostering community contributions and accelerating innovation. The company continues to refine the technology, specifically optimizing it for AWS in response to recent updates in Amazon’s networking libraries, further reducing latency and enhancing overall performance. This commitment to open collaboration and continuous improvement positions TransferEngine as a pivotal tool in the evolving landscape of AI infrastructure.