Building a Code Dataset Pipeline from NVIDIA Nemotron-Pretraining-Code-v3 Metadata with Streaming, Pandas, and tiktoken

In this tutorial, we work with NVIDIA’s Nemotron-Pretraining-Code-v3 dataset as a large-scale metadata index for code pretraining research. Instead of downloading the full multi-gigabyte dataset, we stream it, inspect its schema, and build a manageable sample for analysis. We then explore the dataset by studying languages, file extensions, repository frequency, and directory depth, which helps us understand how the index is structured. After that, we reconstruct the raw GitHub URLs from the metadata, attempt to fetch the actual source files, and estimate the token scale of the fetched code. By the end of the workflow, we create a reusable filtered sample and save processed outputs for further experimentation.

Streaming the NVIDIA Nemotron-Pretraining-Code-v3 Dataset and Inspecting Its Schema

!pip -q install -U "datasets>=2.19" huggingface_hub tiktoken pyarrow 2>/dev/null

import os, io, time, itertools, collections, textwrap, math

import pandas as pd

Streaming the NVIDIA Nemotron-Pretraining-Code-v3 Dataset and Inspecting Its Schema

!pip -q install -U "datasets>=2.19" huggingface_hub tiktoken pyarrow 2>/dev/null

import os, io, time, itertools, collections, textwrap, math

import pandas as pd

Building a Code Dataset Pipeline from NVIDIA Nemotron-Pretraining-Code-v3 Metadata with Streaming, Pandas, and tiktoken

Building a Code Dataset Pipeline from NVIDIA Nemotron-Pretraining-Code-v3 Metadata with Streaming, Pandas, and tiktoken

Other newsrooms on this story

Related reading

Building Supervised Fine-Tuning Data from NVIDIA Open-SWE-Traces: Trajectory…

A Coding Hands-On on FineWeb for Streaming, Filtering, Deduplication,…

How to Build a Document Processing Pipeline for RAG with Nemotron | NVIDIA…

CoderForge-Preview: SOTA open dataset for training efficient coding agents

Task-Seeded Synthetic Q&A Generation for Nemotron Pretraining

Nvidia Built Robots That Train Themselves Using AI Coding Agents - Decrypt

Other newsrooms on this story

Related reading

Building Supervised Fine-Tuning Data from NVIDIA Open-SWE-Traces: Trajectory…

A Coding Hands-On on FineWeb for Streaming, Filtering, Deduplication,…

How to Build a Document Processing Pipeline for RAG with Nemotron | NVIDIA…

CoderForge-Preview: SOTA open dataset for training efficient coding agents

Task-Seeded Synthetic Q&A Generation for Nemotron Pretraining

Nvidia Built Robots That Train Themselves Using AI Coding Agents - Decrypt