Building Supervised Fine-Tuning Data from NVIDIA Open-SWE-Traces: Trajectory Parsing, Patch Analysis, Token Budgets, and Tool-Use Metrics

In this tutorial, we explore the Open-SWE-Traces dataset as a practical resource for studying and preparing agentic software-engineering trajectories for fine-tuning. We stream the dataset directly from Hugging Face, so we can work with a large dataset efficiently in Google Colab without downloading everything locally. We inspect individual records, normalize multi-turn agent conversations, parse final code patches, extract useful metadata, and build an analysis DataFrame to understand trajectory length, tool usage, patch size, language distribution, and resolution outcomes. We then use these insights to create a curated supervised fine-tuning subset that keeps only high-quality trajectories based on success labels, token limits, language filters, and patch availability.

Installing Dependencies and Configuration

import subprocess, sys

def _pip(*pkgs):

subprocess.run([sys.executable, "-m", "pip", "install", "-q", *pkgs], check=False)

Installing Dependencies and Configuration

import subprocess, sys

def _pip(*pkgs):

subprocess.run([sys.executable, "-m", "pip", "install", "-q", *pkgs], check=False)

Building Supervised Fine-Tuning Data from NVIDIA Open-SWE-Traces: Trajectory Parsing, Patch Analysis, Token Budgets, and Tool-Use Metrics

Building Supervised Fine-Tuning Data from NVIDIA Open-SWE-Traces: Trajectory Parsing, Patch Analysis, Token Budgets, and Tool-Use Metrics

Other newsrooms on this story

Related reading

How to Use AgentTrove: Streaming 1.7M Agentic Traces and Building a Clean…

CoderForge-Preview: SOTA open dataset for training efficient coding agents

Building a Stable Fable 5 Traces Workflow in Colab: Parsing Tool Calls,…

A Coding Hands-On on FineWeb for Streaming, Filtering, Deduplication,…

We Got Claude to Fine-Tune an Open Source LLM

Building a Code Dataset Pipeline from NVIDIA Nemotron-Pretraining-Code-v3…

Other newsrooms on this story

Related reading

How to Use AgentTrove: Streaming 1.7M Agentic Traces and Building a Clean…

CoderForge-Preview: SOTA open dataset for training efficient coding agents

Building a Stable Fable 5 Traces Workflow in Colab: Parsing Tool Calls,…

A Coding Hands-On on FineWeb for Streaming, Filtering, Deduplication,…

We Got Claude to Fine-Tune an Open Source LLM

Building a Code Dataset Pipeline from NVIDIA Nemotron-Pretraining-Code-v3…