• About
  • Landing Page
  • Buy JNews
SB Crypto Guru News- latest crypto news, NFTs, DEFI, Web3, Metaverse
  • HOME
  • BITCOIN
  • CRYPTO UPDATES
    • GENERAL
    • ALTCOINS
    • ETHEREUM
    • CRYPTO EXCHANGES
    • CRYPTO MINING
  • BLOCKCHAIN
  • NFT
  • DEFI
  • WEB3
  • METAVERSE
  • REGULATIONS
  • SCAM ALERT
  • ANALYSIS
No Result
View All Result
  • HOME
  • BITCOIN
  • CRYPTO UPDATES
    • GENERAL
    • ALTCOINS
    • ETHEREUM
    • CRYPTO EXCHANGES
    • CRYPTO MINING
  • BLOCKCHAIN
  • NFT
  • DEFI
  • WEB3
  • METAVERSE
  • REGULATIONS
  • SCAM ALERT
  • ANALYSIS
No Result
View All Result
SB Crypto Guru News- latest crypto news, NFTs, DEFI, Web3, Metaverse
No Result
View All Result

NVIDIA Enhances Llama 3.1 405B Performance with TensorRT Model Optimizer

SB Crypto Guru News by SB Crypto Guru News
August 29, 2024
in Blockchain
0 0
0
NVIDIA Enhances Llama 3.1 405B Performance with TensorRT Model Optimizer




Lawrence Jengar
Aug 29, 2024 16:10

NVIDIA’s TensorRT Model Optimizer significantly boosts performance of Meta’s Llama 3.1 405B large language model on H200 GPUs.



NVIDIA Enhances Llama 3.1 405B Performance with TensorRT Model Optimizer

Meta’s Llama 3.1 405B large language model (LLM) is achieving new levels of performance thanks to NVIDIA’s TensorRT Model Optimizer, according to the NVIDIA Technical Blog. The enhancements have resulted in up to a 1.44x increase in throughput when running on NVIDIA H200 GPUs.

Outstanding Llama 3.1 405B Inference Throughput with TensorRT-LLM

TensorRT-LLM has already delivered remarkable inference throughput for Llama 3.1 405B since the model’s release. This was achieved through various optimizations, including in-flight batching, KV caching, and optimized attention kernels. These techniques have accelerated inference performance while maintaining lower precision compute.

TensorRT-LLM added support for the official Llama FP8 quantization recipe, which calculates static and dynamic scaling factors to preserve maximum accuracy. Additionally, user-defined kernels such as matrix multiplications from FBGEMM are optimized via plug-ins inserted into the network graph at compile time.

Boosting Performance Up to 1.44x with TensorRT Model Optimizer

NVIDIA’s custom FP8 post-training quantization (PTQ) recipe, available through the TensorRT Model Optimizer library, enhances Llama 3.1 405B throughput and reduces latency without sacrificing accuracy. This recipe incorporates FP8 KV cache quantization and self-attention static quantization, reducing inference compute overhead.

Table 1 demonstrates the maximum throughput performance, showing significant improvements across various input and output sequence lengths on an 8-GPU HGX H200 system. The system features eight NVIDIA H200 Tensor Core GPUs with 141 GB of HBM3e memory each and four NVLink Switches, providing 900 GB/s of GPU-to-GPU bandwidth.








Maximum Throughput Performance – Output Tokens/Second
8 NVIDIA H200 Tensor Core GPUs
Input | Output Sequence Lengths 2,048 | 128 32,768 | 2,048 120,000 | 2,048
TensorRT Model Optimizer FP8 463.1 320.1 71.5
Official Llama FP8 Recipe 399.9 230.8 49.6
Speedup 1.16x 1.39x 1.44x

Table 1. Maximum throughput performance of Llama 3.1 405B with NVIDIA internal measurements

Similarly, Table 2 presents the minimum latency performance using the same input and output sequence lengths.








Batch Size = 1 Performance – Output Tokens/Second
8 NVIDIA H200 Tensor Core GPUs
Input | Output Sequence Lengths 2,048 | 128 32,768 | 2,048 120,000 | 2,048
TensorRT Model Optimizer FP8 49.6 44.2 27.2
Official Llama FP8 Recipe 37.4 33.1 22.8
Speedup 1.33x 1.33x 1.19x

Table 2. Minimum latency performance of Llama 3.1 405B with NVIDIA internal measurements

These results indicate that H200 GPUs with TensorRT-LLM and TensorRT Model Optimizer are delivering superior performance in both latency-optimized and throughput-optimized scenarios. The TensorRT Model Optimizer FP8 recipe also achieved comparable accuracy with the official Llama 3.1 FP8 recipe on the Massively Multitask Language Understanding (MMLU) and MT-Bench benchmarks.

Fitting Llama 3.1 405B on Just Two H200 GPUs with INT4 AWQ

For developers with hardware resource constraints, the INT4 AWQ technique in TensorRT Model Optimizer compresses the model, allowing Llama 3.1 405B to fit on just two H200 GPUs. This method reduces the required memory footprint significantly by compressing the weights down to 4-bit integers while encoding activations using FP16.

Tables 4 and 5 show the maximum throughput and minimum latency performance measurements, demonstrating that the INT4 AWQ method provides comparable accuracy scores to the Llama 3.1 official FP8 recipe from Meta.






Maximum Throughput Performance – Output Tokens/Second
2 NVIDIA H200 Tensor Core GPUs
Input | Output Sequence Lengths 2,048 | 128 32,768 | 2,048 60,000 | 2,048
TensorRT Model Optimizer INT4 AWQ 75.6 28.7 16.2

Table 4. Maximum throughput performance of Llama 3.1 405B with NVIDIA internal measurements






Batch Size = 1 Performance – Output Tokens/Second
2 NVIDIA H200 Tensor Core GPUs
Input | Output Sequence Lengths 2,048 | 128 32,768 | 2,048 60,000 | 2,048
TensorRT Model Optimizer INT4 AWQ 21.6 18.7 12.8

Table 5. Minimum latency performance of Llama 3.1 405B with NVIDIA internal measurements

NVIDIA’s advancements in TensorRT Model Optimizer and TensorRT-LLM are paving the way for enhanced performance and efficiency in running large language models like Llama 3.1 405B. These improvements offer developers more flexibility and cost-efficiency, whether they have extensive hardware resources or more constrained environments.

Image source: Shutterstock




Source link

Tags: 405BBitcoin NewsCrypto NewsCrypto UpdatesEnhancesLatest News on CryptoLlamaModelNvidiaOptimizerperformanceSB Crypto Guru NewsTensorRT
Previous Post

Proton Wallet Review: A Bitcoin Software Wallet That Simplifies Transactions

Next Post

Bitcoin (BTC) Strategy Working for El Salvador, Says President Nayib Bukele

Next Post
Bitcoin (BTC) Strategy Working for El Salvador, Says President Nayib Bukele

Bitcoin (BTC) Strategy Working for El Salvador, Says President Nayib Bukele

  • Trending
  • Comments
  • Latest
Meta Pumps a Further  Million into Horizon Metaverse

Meta Pumps a Further $50 Million into Horizon Metaverse

February 24, 2025
How to Get Token Prices with an RPC Node – Moralis Web3

How to Get Token Prices with an RPC Node – Moralis Web3

September 3, 2024
AI & Immersive Learning: Accelerating Skill Development with AI and XR

AI & Immersive Learning: Accelerating Skill Development with AI and XR

June 4, 2025
The Metaverse is Coming Back! – According to Meta

The Metaverse is Coming Back! – According to Meta

February 7, 2025
NFT Rarity API – How to Get an NFT’s Rarity Ranking – Moralis Web3

NFT Rarity API – How to Get an NFT’s Rarity Ranking – Moralis Web3

September 6, 2024
Samsung Unveils ‘Moohan’ to Compete with Quest, Vision Pro

Samsung Unveils ‘Moohan’ to Compete with Quest, Vision Pro

January 29, 2025
Turner painting bought last year for £500 sells for almost £2m at Sotheby’s – The Art Newspaper

Turner painting bought last year for £500 sells for almost £2m at Sotheby’s – The Art Newspaper

0
Aeza Group Banned by US Over Ransomware and Crypto Links

Aeza Group Banned by US Over Ransomware and Crypto Links

0
NFT Sales Hit +8M This Week, As NFT Buyers Increase +50%

NFT Sales Hit +$128M This Week, As NFT Buyers Increase +50%

0
NY Attorney General Demands Tougher Rules for Crypto Bills

NY Attorney General Demands Tougher Rules for Crypto Bills

0
Render Royale June 2025: Celebrating Creative Triumphs in Digital Art

Render Royale June 2025: Celebrating Creative Triumphs in Digital Art

0
XRP Price Prepares for Possible Bounce — Support Levels In Focus

XRP Price Prepares for Possible Bounce — Support Levels In Focus

0
Create Visuals, Content, and Presentations That Land with This  Bundle

Create Visuals, Content, and Presentations That Land with This $25 Bundle

July 6, 2025
Why Satoshi-Era Bitcoin Are Relevant To Market Dynamics — Analyst Explains

Why Satoshi-Era Bitcoin Are Relevant To Market Dynamics — Analyst Explains

July 6, 2025
Bitcoin Price Watch: Tight Range Signals Calm Before the Breakout

Bitcoin Price Watch: Tight Range Signals Calm Before the Breakout

July 6, 2025
NFT Sales Hit +8M This Week, As NFT Buyers Increase +50%

NFT Sales Hit +$128M This Week, As NFT Buyers Increase +50%

July 6, 2025
Bitcoin’s True Value Is Higher Than 0,000, Expert Warns

Bitcoin’s True Value Is Higher Than $110,000, Expert Warns

July 6, 2025
Best Crypto Signals Telegram Groups for Profitable Trading in 2025

Best Crypto Signals Telegram Groups for Profitable Trading in 2025

July 6, 2025
SB Crypto Guru News- latest crypto news, NFTs, DEFI, Web3, Metaverse

Find the latest Bitcoin, Ethereum, blockchain, crypto, Business, Fintech News, interviews, and price analysis at SB Crypto Guru News.

CATEGORIES

  • Altcoin
  • Analysis
  • Bitcoin
  • Blockchain
  • Crypto Exchanges
  • Crypto Updates
  • DeFi
  • Ethereum
  • Metaverse
  • Mining
  • NFT
  • Regulations
  • Scam Alert
  • Uncategorized
  • Web3

SITE MAP

  • Disclaimer
  • Privacy Policy
  • DMCA
  • Cookie Privacy Policy
  • Terms and Conditions
  • Contact us
  • Disclaimer
  • Privacy Policy
  • DMCA
  • Cookie Privacy Policy
  • Terms and Conditions
  • Contact us

© 2025 JNews - Premium WordPress news & magazine theme by Jegtheme.

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In
No Result
View All Result
  • HOME
  • BITCOIN
  • CRYPTO UPDATES
    • GENERAL
    • ALTCOINS
    • ETHEREUM
    • CRYPTO EXCHANGES
    • CRYPTO MINING
  • BLOCKCHAIN
  • NFT
  • DEFI
  • WEB3
  • METAVERSE
  • REGULATIONS
  • SCAM ALERT
  • ANALYSIS

© 2025 JNews - Premium WordPress news & magazine theme by Jegtheme.