In this tutorial, we work with NVIDIA’s Nemotron-Pretraining-Code-v3 dataset as a large-scale metadata index for code pretraining research. Instead of downloading the full multi-gigabyte dataset, we stream it, inspect its schema, and build a manageable sample for analysis. We then explore the dataset by studying languages, file extensions, repository frequency, and directory depth, which helps us understand how the index is structured. After that, we reconstruct the raw GitHub URLs from the metadata, attempt to fetch the actual source files, and estimate the token scale of the fetched code. By the end of the workflow, we create a reusable filtered sample and save processed outputs for further experimentation.

Streaming the NVIDIA Nemotron-Pretraining-Code-v3 Dataset and Inspecting Its Schema

!pip -q install -U "datasets>=2.19" huggingface_hub tiktoken pyarrow 2>/dev/null

import os, io, time, itertools, collections, textwrap, math

import pandas as pd