Tagging my blog posts with BERTopic and LLMs

I recently added tags to my blog using BERTopic and a mix of LLMs. You can see the tags in the sidebar to the right (or in the footer on mobile). I’ve done this before in 2023, with GGUF Mistral using llama-cpp, but never finished the project. Now, because the models have been getting so good, and my project was small, relatively well-defined, and easy to evaluate, the project took me about 6-10 hours over a month, using BERTopic, Claude Code, and Pi with Deepseek.Why so many different AI tools? Mostly to evaluate their different ways of working. Much of that time was spent noodling on the UX experience of the tags rather than iterating on the tags themselves.One of the genuinely useful use-cases of LLMs these days is for finishing personal projects that don’t touch production and have a small surface area that’s personalized for you. In other words, as Robin Sloan wrote, an app can be a home-cooked meal.I love having a static site because of how easy it is to write and publish content and how fast it loads, but sometimes I wish it was slightly more fully-featured. LLMs have allowed me to add features like search. The theme I use, Hugo Bear Blog, already has support for tags, but I’d never added them to posts, and I also wanted a slightly different way to visualize them.We consumers generally use LLMs for text or image or, if we’re developers, for code generation. But, one of the most underrated features of LLMs is the ability to compress rather than generate. This is really unsurprising: LLMs are, after all, natural language models.Since they were trained and fine-tuned originally on language modeling tasks they also perform really well at all the tasks that language models are meant for, such as summarization, information retrieval, question answering. LLMs are really good at labelling things. That is, they’re good at topic modeling, the machine learning task behind tagging, especially in a zero-shot context (where they have no previous training data from you specifically).In the early days of online blogging, tags were important for facilitating content discovery. People initially started tagging their blog posts on their individual blogs. Eventually, site aggregators like Delicious, surfaced top links with tags by aggregating tags across top links shared by users.Pinboard was another prominent platform where finding content through tags and looking at aggregated tags was an important feature of the platform. Coincidentally, Delicious was later acquired by Pinboard.Early on, the best way to find things that you liked was to manually participate in the curation of these folksonomies. Twitter and Tumblr developed some of the most creative folksonomic tagging systems. On Tumblr, tags became a way to not only discover posts, but to have conversations with other people about your post. On Twitter, hashtags became a way to signal communal discovery of people with shared interests before the implementation of Twitter’s SimClusters algorithm.Tagging and hashtags served as an implicit contract of content discovery across platforms for over a decade, across Twitter, Instagram, TikTok, and many other services. However, big social has been in big decline for a while now. Group chats have arisen as a medium for exchange and discovery, as well as Discord groups that rely less on traditional tagging mechanisms. Bluesky, a social platform that was founded more recently, has the ability to add hashtags to posts, but most folks don’t do so. Discovery happens, like with many platforms today, through starter packs and custom algorithmic feeds.The rise of LLMs led to an even greater decrease in the power of individual websites to add signal. An increase in AI overview features in search results that offer either summarization or RAG-assisted source synthesis has meant that visits to actual websites are dropping faster than ever. With LLMs, the rise of semantic and blended agentic-style search as a discovery mechanism means tags are not as important.Within blogging, content surfaces like (oh God) LinkedIn native posts and X’s longform posts are contributing to platform-specific lock-in. All of our public blog content is being scraped as training data anyway, and agentic search and RAG mean people access content through an LLM’s interpretation of it rather than going directly to a page. RSS still exists, but who is going to syndicate a site when they could write an article on X or LinkedIn? (Me, sure.) You can subscribe to my feed!. Blog tags as a mechanism for understanding what my blog is about realistically still probably matter only to me. But I still want them!Historical approaches to tagging with LDAGenerally, synthesizing and detecting topics across a body of text is an unsupervised learning machine learning problem. We don’t know ahead of time what the categories are. We have to infer them from the corpus.Traditionally, this was a really hard statistical problem in machine learning solved with approaches like latent dirichlet allocation (LDA). I really recommend Ted’s post if you’re interested in more of the core implementation details. The key idea is that, given a corpus of text documents, each document is modeled as a collection of topics. Each topic is a distribution over words, which means that each word in our corpus has a stated probability for being a member of a topic.There were a number of unsolved problems with this approach, and with all statistical approaches. The biggest one was that the underlying assumption of topic modeling is that documents are bags of words. That is, words that are independent of each other. But, because words lose their context when they’re broken up from their constituent sentences (as we did when we created the tokenized input text for the model), the model wouldn’t recognize that the word “bank” in these two sentences, with “I went to the bank” and “The bank of the river overflowed with water” are different from one another.One of the most trite and overused quotes in NLP is “You shall know a word by the company it keeps”, and this turned out to be one of the core successes of evolving neural approaches for language processing.From embeddings to tagging with early LLMs in 2023The evolving use of embeddings made it much easier to represent items that were semantically close together mathematically, by projecting into the same multi-dimensional vector space. The following generation of contextual embeddings calculated vectors based on the position or representation of the word, which eventually led to models that fully modeled the context - or the relationship of every word to every other word in a given text at the same time, creating weights of how relevant a word is to any other given word, aka the attention mechanism.The first part of any topic modeling approach is picking your seed topics that the model can extrapolate from. You can do this in a number of ways: by bootstrapping from a couple of selected ones and creating more detailed ones, by picking plausible topics across your document set through tf-idf or LDA, by going through all the content and having an LLM zero-shot them.Seeding topics with MistralThis is what I tried to do the first time I tried to do this exercise. When local models like Mistral 7B-Instruct became available in 2023, I started playing around with them and realized that they did well at tagging. I decided I wanted to see if I could tag my blog with topics.The first part was figuring out a set of tags from the given documents. I did some searching and found a paper called TopicGPT which was already exploring ways to use LLMs to do tasks that were previously done by BERT encoder models, and prior to that, latent dirichlet allocation.I adapted the template in the paper with GGUF quants of Mistral 7B using llama-cpp-python (ancient history by now!).Prompt templates were really important, less so than now, and the code looked like this:system = f"""

Tagging my blog posts with BERTopic and LLMs

Tagging my blog posts with BERTopic and LLMs

Other newsrooms on this story

Related reading

How I use pluckmd to read blogs with an AI agent

Let's talk about LLM evaluation

A Visual Guide to Attention Variants in Modern LLMs

Source Score: Continuing Exploration of LLM Usage in Automated Workflows

Two months building an investment bot. What it taught me about LLMs

Programmatic SEO with Medium Tag Hubs (Done Ethically)

Other newsrooms on this story

Related reading

How I use pluckmd to read blogs with an AI agent

Let's talk about LLM evaluation

A Visual Guide to Attention Variants in Modern LLMs

Source Score: Continuing Exploration of LLM Usage in Automated Workflows

Two months building an investment bot. What it taught me about LLMs

Programmatic SEO with Medium Tag Hubs (Done Ethically)