Storia in 1 fonti

FAQ: Building LLMs with RedPajama-v2, a 30 trillion token web dataset

Over the past several months, we have been amazed by the community's engagement with the RedPajama-V2 dataset. With over 20,000 downloads per month, the 30 trillion tokens of deduplicated web data, along with their quality signals, have been used to train leading models like the recently released Snowflake Arctic LLM.RedPajama-v2 is designed as a high-recall vs high-precision dataset. This approach enables researchers to employ different data selection techniques and experimentally discover recipes that produce downstream models with desired properties. Data selection from RedPajama v2 can also be used to fine-tune pre trained models on class of documents that imbue certain functionality or specialization required of models.To facilitate dataset use and help the community maximize its value, we've compiled a list of frequently asked questions.Q: Should I use the RedPajama-V2 Dataset out of the box?RedPajama-V2 is conceptualized as a pool of data that serves as a foundation for creating high quality datasets. The dataset is thus not intended to be used out of the box and, depending on the application, data should be filtered out using the quality signals that accompany the data. With this dataset, we take the view that the optimal filtering of data is dependent on the intended use. Our goal is to provide all the signals and tooling that enables this.Q: Is RedPajama-V2 deduplicated?The raw dataset is not deduplicated. This is an intentional design choice to preserve as much information in the raw data as possible to facilitate research into the role and best method of deduplication. Instead, we provide duplication as one of the signals — the ids of documents which are duplicated across the entire corpus. The deduplication was performed using a Bloomfilter and the hashes of the web documents (i.e., of the raw .wet documents). These ids can be used to deduplicate the dataset as shown below. In particular, the dataset loaded via the HuggingFace dataloader provides the raw data with the quality signals and duplication tags and is not deduplicated. We also provide minhash signatures for further fuzzy deduplication at different levels of similarity.Q: What does the structure of the dataset look like?The basic structure of the dataset largely follows the output logic of the CCNet pipeline where data is partitioned into shards and grouped according to language and the perplexity bucket. The documents (i.e., the text data) are organised according to the following structure: The basic structure of the dataset largely follows the output logic of the CCNet pipeline where data is partitioned into shards and grouped according to language and the perplexity bucket. The documents (i.e., the text data) are organised according to the following structure: documents///_.json.gz

Raccontata da

together.ai

Timeline cronologica

mercoledì 27 maggio 2026·together.ai
FAQ: Building LLMs with RedPajama-v2, a 30 trillion token web dataset
Over the past several months, we have been amazed by the community's engagement with the RedPajama-V2 dataset. With over 20,000 downloads per month, the 30 trillion tokens of…

FAQ: Building LLMs with RedPajama-v2, a 30 trillion token web dataset