Friday, January 9, 2026
  • Login
SB Crypto Guru News- latest crypto news, NFTs, DEFI, Web3, Metaverse
No Result
View All Result
  • HOME
  • BITCOIN
  • CRYPTO UPDATES
    • GENERAL
    • ALTCOINS
    • ETHEREUM
    • CRYPTO EXCHANGES
    • CRYPTO MINING
  • BLOCKCHAIN
  • NFT
  • DEFI
  • WEB3
  • METAVERSE
  • REGULATIONS
  • SCAM ALERT
  • ANALYSIS
CRYPTO MARKETCAP
  • HOME
  • BITCOIN
  • CRYPTO UPDATES
    • GENERAL
    • ALTCOINS
    • ETHEREUM
    • CRYPTO EXCHANGES
    • CRYPTO MINING
  • BLOCKCHAIN
  • NFT
  • DEFI
  • WEB3
  • METAVERSE
  • REGULATIONS
  • SCAM ALERT
  • ANALYSIS
No Result
View All Result
SB Crypto Guru News- latest crypto news, NFTs, DEFI, Web3, Metaverse
No Result
View All Result

NVIDIA Introduces Nemotron-CC: A Massive Dataset for LLM Pretraining

by SB Crypto Guru News
January 10, 2025
in Blockchain
Reading Time: 2 mins read
0 0
A A
0




Iris Coleman
Jan 10, 2025 14:13

NVIDIA debuts Nemotron-CC, a 6.3-trillion-token English dataset, enhancing pretraining for large language models with innovative data curation methods.



NVIDIA Introduces Nemotron-CC: A Massive Dataset for LLM Pretraining

NVIDIA has announced the release of Nemotron-CC, a groundbreaking 6.3-trillion-token English language dataset designed to advance the pretraining of large language models (LLMs). This dataset, derived from Common Crawl, aims to elevate the accuracy and efficiency of LLMs through innovative data curation techniques, including the use of 1.9 trillion tokens of synthetically generated data, according to NVIDIA.

Enhancing LLM Pretraining

NVIDIA’s initiative addresses a critical need in LLM training, where the quality of pretraining datasets plays a pivotal role. While recent models like Meta’s Llama series have been based on datasets comprising up to 15 trillion tokens, the exact composition of these datasets remains largely undisclosed. Nemotron-CC seeks to fill this gap by providing the wider community with a high-quality dataset capable of supporting both short and long token horizon training.

Traditional datasets often sacrifice up to 90% of data to improve benchmark accuracies, limiting their utility for extensive training. Nemotron-CC, however, demonstrates how to transform Common Crawl data into a superior dataset, surpassing even the Llama 3.1 8B model through advanced methods such as classifier ensembling and synthetic data rephrasing.

Significant Results

Nemotron-CC’s efficacy is evidenced by its performance in various benchmarks. When training 8B parameter models for one trillion tokens, the high-quality subset Nemotron-CC-HQ outperforms leading datasets like DCLM, increasing MMLU scores by 5.6 points. Furthermore, the complete 6.3-trillion-token dataset matches DCLM on MMLU while offering four times more unique real tokens. This enables effective training over long token horizons, with Nemotron-CC-trained models surpassing Llama 3.1 8B in multiple metrics, including a 5-point increase in MMLU and a 3.1-point rise in ARC-Challenge scores.

Innovative Data Curation Techniques

The development of Nemotron-CC involved several key insights. By ensembling different model-based classifiers, NVIDIA was able to select a broader array of high-quality tokens. Additionally, rephrasing techniques reduced noise and errors, yielding diverse and valuable data variants. The decision to disable traditional heuristic filters further boosted the dataset’s quality without compromising accuracy.

NVIDIA utilized its NeMo Curator tool to extract and refine data from Common Crawl, applying filters for language, deduplication, and quality classification. This process was complemented by synthetic data generation, contributing approximately two trillion tokens to the dataset.

Future Prospects

Nemotron-CC is positioned as a vital resource for pretraining state-of-the-art LLMs over varying token horizons. NVIDIA plans to expand its offerings by releasing more specialized datasets, including those focused on specific domains like mathematics, to further enhance LLM capabilities.

Image source: Shutterstock




Source link

Tags: Bitcoin NewsCrypto NewsCrypto UpdatesDatasetIntroducesLatest News on CryptoLLMMassiveNemotronCCNvidiaPretrainingSB Crypto Guru News
Previous Post

Ripple Moves $682 Million XRP to Unknown Wallet as XRPUSD Rebounds

Next Post

Gala Games Offers VIP Tickets to MAHA Inaugural Ball in Washington D.C.

Related Posts

Caterpillar Unveils Edge AI Integration with NVIDIA at CES 2026

Caterpillar Unveils Edge AI Integration with NVIDIA at CES 2026

by SB Crypto Guru News
January 8, 2026
0

James Ding Jan 08, 2026 09:16 Caterpillar showcases its integration of NVIDIA technologies, including Jetson Thor and AI speech models,...

Blockchain Security Basics for Business Leaders

Blockchain Security Basics for Business Leaders

by SB Crypto Guru News
January 7, 2026
0

Businesses all over the world have been exploring new use cases of blockchain to streamline their operations, gain the trust...

AAVE Surges Past 3 as SEC Investigation Closes

AAVE Surges Past $173 as SEC Investigation Closes

by SB Crypto Guru News
January 7, 2026
0

Jessie A Ellis Jan 07, 2026 07:50 Aave trades at $173.14 after a 4-year SEC probe ends with no enforcement...

WLD Price Prediction: Worldcoin Eyes alt=

WLD Price Prediction: Worldcoin Eyes $0.73 Target as Technical Breakout Builds – 30-Day Forecast

by SB Crypto Guru News
January 6, 2026
0

Alvin Lang Jan 06, 2026 08:57 WLD price prediction shows bullish momentum building with $0.73 medium-term target. Current technical setup...

Success Story: Tomas Chatila’s Learning Journey with 101 Blockchains

Success Story: Tomas Chatila’s Learning Journey with 101 Blockchains

by SB Crypto Guru News
January 5, 2026
0

About Tomas Chatila Full Name: Tomas Chatila Designation: Product Owner Country: Lithuania Tomas’ Learning Journey That Inspires Which 101 Blockchains Course(s)/Certification(s)...

Load More
Next Post
Gala Games Offers VIP Tickets to MAHA Inaugural Ball in Washington D.C.

Gala Games Offers VIP Tickets to MAHA Inaugural Ball in Washington D.C.

Bybit Freezes Indian Trades, Cites Compliance Challenges

Bybit Freezes Indian Trades, Cites Compliance Challenges

Facebook Twitter LinkedIn Tumblr RSS

CATEGORIES

  • Altcoin
  • Analysis
  • Bitcoin
  • Blockchain
  • Crypto Exchanges
  • Crypto Updates
  • DeFi
  • Ethereum
  • Metaverse
  • Mining
  • NFT
  • Regulations
  • Scam Alert
  • Uncategorized
  • Web3

SITE MAP

  • Disclaimer
  • Privacy Policy
  • DMCA
  • Cookie Privacy Policy
  • Terms and Conditions
  • Contact us

Copyright © 2022 - SB Crypto Guru News.
SB Crypto Guru News is not responsible for the content of external sites.

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In
No Result
View All Result
  • HOME
  • BITCOIN
  • CRYPTO UPDATES
    • GENERAL
    • ALTCOINS
    • ETHEREUM
    • CRYPTO EXCHANGES
    • CRYPTO MINING
  • BLOCKCHAIN
  • NFT
  • DEFI
  • WEB3
  • METAVERSE
  • REGULATIONS
  • SCAM ALERT
  • ANALYSIS

Copyright © 2022 - SB Crypto Guru News.
SB Crypto Guru News is not responsible for the content of external sites.