I've been using LLMs to classify GitHub pull requests into changelog categories. The goal: automatically decide if a PR is a feature, bugfix, breaking change, or internal noise.

It took several iterations to get consistent output. Here's what actually worked.

The problem with direct classification

The naive approach:

Classify this PR: feature / bugfix / breaking / internal.