The landscape of Large Language Models (LLMs) is rapidly evolving, with context lengths expanding from a few thousand tokens a year ago to millions of tokens now. This increase in context length has very real implications for enterprise applications, particularly in Retrieval Augmented Generation (RAG), document analysis, and summarization systems. While prior models were limited to processing a few pages of text, modern models like Meta's Llama 3.2 series can handle 131K tokens, which is the equivalent of a 200-page novel.This capability is very useful when working with enterprise data. Traditional RAG systems often require complex chunking and re-ranking strategies to work within context constraints. However, with extended context lengths, organizations can now process entire documents or multiple documents simultaneously, potentially simplifying architectures while improving accuracy. These advancements are particularly valuable when you're working with:Enterprise document RAG systemsMulti-document question answeringCode repository understanding and generationFinancial report processing and summarizationComplex tool and API interactions for agentic systemsThe problem, however, lies in implementing reliable and performant long context systems, which isn't as straightforward as simply using LLMs that have a higher theoretical context limit. Recent work shows that most models show degraded performance for context length thresholds much smaller then the maximum quoted context length for the model. For example, let's assume you're using a model with a maximum context length of 131k, try passing it any random sequence of 90,000 tokens and ask it to repeat back to you the last 100 words in the sequence. You'll find that the LLM has problems with this simple regurgitation task!This is a problem since currently, extended context length capabilities are primarily available in frontier models that come with significant usage costs. To enhance performance for long context length tasks, you need to teach the model how to effectively use and perform with long sequences. With the latest updates, the Together AI platform now supports fine-tuning on context lengths as large as 32k tokens, with longer sequence lengths to follow. By fine-tuning smaller models to handle longer contexts, organizations can achieve comparable performance at a fraction of the cost. This approach is particularly valuable for enterprise applications, where data privacy and ownership are crucial considerations.Long context fine-tuning is quite different from regular fine-tuning and presents its own challenges, let's discuss them below! In this technical deep dive, we'll explore and demonstrate:Problems that LLMs have working with long sequencesThe solution: fine-tuning on long sequencesPractical problems of long context fine-tuning and how we solved themReal-world example + code: improving the summarization capabilities of Llama 3.2 8BIf you would like to dive into code directly, please refer to the notebooks below:Notebook 1: We show how LLMs have a problem with simple repetition tasks when it comes to long-context inputs and how we can solve it.Notebook 2: We show how you can improve the summarization capabilities of Llama 3.2 8B by fine-tuning.Demonstrating the Long Context ProblemA recent paper showed diminishing returns when models are prompted with sequences longer than their optimal threshold. For example, in case of Llama 3.1 405B this threshold was after 32K tokens. The graph below from this paper shows which models degrade after a certain optimal token length:They discovered the main problems these models faced when dealing with long-context sequences:1. The "Lost in the Middle" ProblemModels struggle with information in middle sections, and the performance degradation increases with the context length. The ability to retrieve information becomes less reliable in the middle of the context.2. Effective Context Length LimitationsResearch from the RULER paper shows that the usable context length often falls short of advertised maximums. In reality, the effective length varies by task type, and the performance begins to decrease well before reaching the length limit. Interestingly, they also found that different tasks have different optimal context lengths.These findings suggest an important consideration for practitioners: you can't just use the maximum available length and need to experiment to find the optimal length for your task, which may vary from model to model!Simple Repetition TaskTo do a simple demonstration of this performance degradation, we conduct a toy experiment:We ask an LLM to repeat back to us the last `k` words in a provided sequence.To solve this repetition task, an LLM should be able to use a simple induction head that just copies a specific part of the input. For more information, you can read extensive research by Anthropic on visualizing induction heads. An induction head is a key component in Transformer models that identifies repeated sequences and uses previous occurrences to predict what comes next. This capability is fundamental to how Transformers process language, enabling them to learn from repetition and make predictions based on previously seen patterns.For the detailed analysis, please refer to the complete notebook here. In short, we demonstrate that an untuned Llama 3.1 70B model performs suboptimally on this task (with a Levenshtein Distance ratio of 0.37), and by fine-tuning on just 2000 long-context examples, we can get an 8B model to perform this task at almost perfect accuracy(Levenshtein Distance ratio of 0.81).
Long Context Fine-Tuning: A Technical Deep Dive
The landscape of Large Language Models (LLMs) is rapidly evolving, with context lengths expanding from a few thousand tokens a year ago to millions of tokens now. This increase in context length has very real implications for enterprise applications, particularly in Retrieval Augmented Generation (RAG), document analysis, and summarization systems. While prior models were limited to processing a few pages of text, modern models like Meta's Llama 3.2 series can handle 131K tokens, which is the equivalent of a 200-page novel.This capability is very useful when working with enterprise data. Traditional RAG systems often require complex chunking and re-ranking strategies to work within context constraints. However, with extended context lengths, organizations can now process entire documents or multiple documents simultaneously, potentially simplifying architectures while improving accuracy. These advancements are particularly valuable when you're working with:Enterprise document RAG systemsMulti-document question answeringCode repository understanding and generationFinancial report processing and summarizationComplex tool and API interactions for agentic systemsThe problem, however, lies in implementing reliable and performant long context systems, which isn't as straightforward as simply using LLMs that have a higher theoretical context limit. Recent work shows that most models show degraded performance for context length thresholds much smaller then the maximum quoted context length for the model. For example, let's assume you're using a model with a maximum context length of 131k, try passing it any random sequence of 90,000 tokens and ask it to repeat back to you the last 100 words in the sequence. You'll find that the LLM has problems with this simple regurgitation task!This is a problem since currently, extended context length capabilities are primarily available in frontier models that come with significant usage costs. To enhance performance for long context length tasks, you need to teach the model how to effectively use and perform with long sequences. With the latest updates, the Together AI platform now supports fine-tuning on context lengths as large as 32k tokens, with longer sequence lengths to follow. By fine-tuning smaller models to handle longer contexts, organizations can achieve comparable performance at a fraction of the cost. This approach is particularly valuable for enterprise applications, where data privacy and ownership are crucial considerations.Long context fine-tuning is quite different from regular fine-tuning and presents its own challenges, let's discuss them below! In this technical deep dive, we'll explore and demonstrate:Problems that LLMs have working with long sequencesThe solution: fine-tuning on long sequencesPractical problems of long context fine-tuning and how we solved themReal-world example + code: improving the summarization capabilities of Llama 3.2 8BIf you would like to dive into code directly, please refer to the notebooks below:Notebook 1: We show how LLMs have a problem with simple repetition tasks when it comes to long-context inputs and how we can solve it.Notebook 2: We show how you can improve the summarization capabilities of Llama 3.2 8B by fine-tuning.Demonstrating the Long Context ProblemA recent paper showed diminishing returns when models are prompted with sequences longer than their optimal threshold. For example, in case of Llama 3.1 405B this threshold was after 32K tokens. The graph below from this paper shows which models degrade after a certain optimal token length:They discovered the main problems these models faced when dealing with long-context sequences:1. The "Lost in the Middle" ProblemModels struggle with information in middle sections, and the performance degradation increases with the context length. The ability to retrieve information becomes less reliable in the middle of the context.2. Effective Context Length LimitationsResearch from the RULER paper shows that the usable context length often falls short of advertised maximums. In reality, the effective length varies by task type, and the performance begins to decrease well before reaching the length limit. Interestingly, they also found that different tasks have different optimal context lengths.These findings suggest an important consideration for practitioners: you can't just use the maximum available length and need to experiment to find the optimal length for your task, which may vary from model to model!Simple Repetition TaskTo do a simple demonstration of this performance degradation, we conduct a toy experiment:We ask an LLM to repeat back to us the last `k` words in a provided sequence.To solve this repetition task, an LLM should be able to use a simple induction head that just copies a specific part of the input. For more information, you can read extensive research by Anthropic on visualizing induction heads. An induction head is a key component in Transformer models that identifies repeated sequences and uses previous occurrences to predict what comes next. This capability is fundamental to how Transformers process language, enabling them to learn from repetition and make predictions based on previously seen patterns.For the detailed analysis, please refer to the complete notebook here. In short, we demonstrate that an untuned Llama 3.1 70B model performs suboptimally on this task (with a Levenshtein Distance ratio of 0.37), and by fine-tuning on just 2000 long-context examples, we can get an 8B model to perform this task at almost perfect accuracy(Levenshtein Distance ratio of 0.81).













