THE Allen Institute for AI created the open language model, or OLMo, which is a large open source language model whose goal is to advance the science of language models through open research. It marks a major step in the evolution of major language models.
Unlike current large open language models like Lama And Mistral, which could limit access to their training data, architectures or evaluation methodologies, OLMo stands out by offering full access to its pre-training data, training code, model weights and its evaluation suite. This opening aims to enable academics and researchers to collectively study and advance the field of language modeling.
OLMo represents a collaborative effort to advance the science of language models. The developers behind the LLM are on a mission to empower academics and researchers by giving them access to the training code, models, and assessment code needed for open research.
The architecture of OLMo is built on that of AI2 Dolma dataset, which presents an open corpus of three trillion tokens. It includes full model weights for four 7B scale model variants, each trained to at least 2T tokens. Innovative aspects of OLMo include its training approaches, its size, and the diversity of data it was trained on. The unique features that set it apart from its predecessors are its open source nature and comprehensive release of training and assessment tools.
OLMo’s key differentiators include:
Complete pre-training data and code: OLMo is built on AI2’s Dolma dataset, comprising an open corpus of three trillion tokens that covers a diverse mix of web content, academic publications, codes, books, and encyclopedic documents. This dataset is publicly available, allowing researchers to understand and leverage the exact data used to train the models.
Full frame version: The framework includes not only model weights, but also training code, inference code, training metrics, and logs for four 7B-scale model variants. It even provides over 500 checkpoints per model for in-depth evaluation, all under the Apache 2.0 license.
Assessment and reference tools: AI2 is released Paloma, a benchmark for evaluating language models in various domains. This allows for standardized performance comparisons and deeper insights into the model’s capabilities and limitations.
Unlike its contemporaries like Llama and Mistral, who have made significant contributions to the AI landscape through their respective advancements and specializations, OLMo’s commitment to openness and transparency sets a new precedent. It promotes a collective and transparent approach to ethically understanding, improving and advancing the capabilities of language models.
AI2’s development of OLMo is a collaborative effort involving partnerships with several organizations and institutions. AI2 has partnered with AMD and CSC, using the GPU portion of the pre-exascale LUMI supercomputer powered entirely by an AMD processor. This collaboration extends to the hardware and computing resources necessary for the development of OLMo.
AI2 has partnered with organizations such as AI surge And MosaicML for data and training code. These partnerships are crucial to providing the diverse data sets and sophisticated training methodologies that support OLMo’s capabilities. The collaboration with the Paul G. Allen School of Computer Science and Engineering at the University of Washington and Databricks, Inc.. also played a central role in the realization of the OLMo project.
It’s important to note that OLMo’s current architecture is not the same as the models that power chatbots or AI assistants, which use instruction-based models. However, it is on the roadmap. According to AI2, there will be many improvements to the model in the future. In the coming months, there are plans to iterate on OLMo by introducing different model sizes, modalities, datasets, and capabilities to the OLMo family. This iterative process aims to continually improve the performance and usefulness of the model to the research community.
OLMo’s open and transparent approach, along with its advanced capabilities and commitment to continuous improvement, make it a major step in the evolution of LLMs.