Skip to main content

CLICK HERE FOR THE FULL BLOG ARCHIVES

Show more

Fwd: Top Important LLM Papers for the Week from 29/01 to 04/02

Forwarded.

Marie Seshat Landry
CEO / Spymaster
Marie Landry's Spy Shop
www.marielandryceo.com


---------- Forwarded message ---------
From: Youssef Hosni from To Data & Beyond <youssefh@substack.com>
Date: Tue, Feb 6, 2024 at 6:06 PM
Subject: Top Important LLM Papers for the Week from 29/01 to 04/02
To: <marielandryx@gmail.com>


Stay Updated with Recent Large Language Models Research  ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌
Forwarded this email? Subscribe here for more

Top Important LLM Papers for the Week from 29/01 to 04/02

Stay Updated with Recent Large Language Models Research

Feb 6
 
READ IN APP
 

Large language models (LLMs) have advanced rapidly in recent years. As new generations of models are developed, researchers and engineers need to stay informed on the latest progress. This article summarizes some of the most important LLM papers published during the First Week of February 2024.

The papers cover various topics shaping the next generation of language models, from model optimization and scaling to reasoning, benchmarking, and enhancing performance. Keeping up with novel LLM research across these domains will help guide continued progress toward models that are more capable, robust, and aligned with human values.

Table of Contents:

  1. LLM Progress & Benchmarking

  2. LLM Reasoning

  3. LLM Training & Evaluation

  4. Transformers & Attention Based Models



1. LLM Progress & Benchmarking

1.1. OLMo: Accelerating the Science of Language Models

Language models (LMs) have become ubiquitous in both NLP research and commercial product offerings. As their commercial importance has surged, the most powerful models have become closed off, gated behind proprietary interfaces, with important details of their training data, architectures, and development undisclosed. 

Given the importance of these details in scientifically studying these models, including their biases and potential risks, we believe it is essential for the research community to have access to powerful, truly open LMs. To this end, this technical report details the first release of OLMo, a state-of-the-art, truly Open Language Model, and its framework to build and study the science of language modeling. 

Unlike most prior efforts that have only released model weights and inference code, we release OLMo and the whole framework, including training data and training and evaluation code. We hope this release will empower and strengthen the open research community and inspire a new wave of innovation.

1.2. Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research

Language models have become a critical technology for tackling a wide range of natural language processing tasks, yet many details about how the best-performing language models were developed are not reported. 

In particular, information about their pretraining corpora is seldom discussed: commercial language models rarely provide any information about their data; even open models rarely release datasets they are trained on, or an exact recipe to reproduce them. 

As a result, it is challenging to conduct certain threads of language modeling research, such as understanding how training data impacts model capabilities and shapes their limitations. 

To facilitate open research on language model pretraining, we release Dolma, a three trillion tokens English corpus, built from a diverse mixture of web content, scientific papers, code, public-domain books, social media, and encyclopedic materials. In addition, we open-source our data curation toolkit to enable further experimentation and reproduction of our work. 

In this report, we document Dolma, including its design principles, details about its construction, and a summary of its contents. We interleave this report with analyses and experimental results from training language models on intermediate states of Dolma to share what we have learned about important data curation practices, including the role of content or quality filters, deduplication, and multi-source mixing. 

Dolma has been used to train OLMo, a state-of-the-art, open-language model and framework designed to build and study the science of language modeling.

1.3. CroissantLLM: A Truly Bilingual French-English Language Model

We introduce CroissantLLM, a 1.3B language model pretrained on a set of 3T English and French tokens, to bring to the research and industrial community a high-performance, fully open-sourced bilingual model that runs swiftly on consumer-grade local hardware. 

To that end, we pioneer the approach of training an intrinsically bilingual model with a 1:1 English-to-French pretraining data ratio, a custom tokenizer, and bilingual finetuning datasets. We release the training dataset, notably containing a French split with manually curated, high-quality, and varied data sources. 

To assess performance outside of English, we craft a novel benchmark, FrenchBench, consisting of an array of classification and generation tasks, covering various orthogonal aspects of model performance in the French Language. 

Additionally, rooted in transparency and to foster further Large Language Model research, we release codebases, and dozens of checkpoints across various model sizes, training data distributions, and training steps, as well as fine-tuned Chat models, and strong translation models. 

We evaluate our model through the FMTI framework, and validate 81 % of the transparency criteria, far beyond the scores of even most open initiatives. This work enriches the NLP landscape, breaking away from previous English-centric work to strengthen our understanding of multilingualism in language models.

1.4. Efficient Exploration for LLMs

We present evidence of substantial benefit from efficient exploration in gathering human feedback to improve large language models. In our experiments, an agent sequentially generates queries while fitting a reward model to the feedback received. 

Our best-performing agent generates queries using double Thompson sampling, with uncertainty represented by an epistemic neural network. Our results demonstrate that efficient exploration enables high levels of performance with far fewer queries. Further, both uncertainty estimation and the choice of exploration scheme play critical roles.

1.5. SliceGPT: Compress Large Language Models by Deleting Rows and Columns

Large language models have become the cornerstone of natural language processing, but their use comes with substantial costs in terms of computing and memory resources. Sparsification provides a solution to alleviate these resource constraints, and recent works have shown that trained models can be sparsified post-hoc. 

Existing sparsification techniques face challenges as they need additional data structures and offer constrained speedup with current hardware. In this paper we present SliceGPT, a new post-training sparsification scheme that replaces each weight matrix with a smaller (dense) matrix, reducing the embedding dimension of the network. 

Through extensive experimentation, we show that SliceGPT can remove up to 25% of the model parameters (including embeddings) for LLAMA2–70B, OPT 66B, and Phi-2 models while maintaining 99%, 99% and 90% zero-shot task performance of the dense model respectively. 

Our sliced models run on fewer GPUs and run faster without any additional code optimization: on 24GB consumer GPUs we reduce the total compute for inference on LLAMA2–70B to 64% of that of the dense model; on 40GB A100 GPUs we reduce it to 66%. 

We offer a new insight, computational invariance in transformer networks, which enables SliceGPT and we hope it will inspire and enable future avenues to reduce memory and computation demands for pre-trained models. 

1.6. OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer

Recent studies have advocated for fully open foundation models to promote transparency and open science. As an initial step, the Open Whisper-style Speech Model (OWSM) reproduced OpenAI's Whisper using publicly available data and open-source toolkits. 

To reproduce Whisper, the previous OWSM v1 through v3 models were still based on Transformer, which might lead to inferior performance compared to other state-of-the-art speech encoders. In this work, we aim to improve the performance and efficiency of OWSM without extra training data. 

We present E-Branchformer-based OWSM v3.1 models at two scales, i.e., 100M and 1B. The 1B model is the largest E-Branchformer-based speech model that has been made publicly available. 

It outperforms the previous OWSM v3 in a vast majority of evaluation benchmarks while demonstrating up to 25% faster inference speed. We publicly release the data preparation scripts, pre-trained models, and training logs.

1.7. Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling

Large language models are trained on massive scrapes of the web, which are often unstructured, noisy, and poorly phrased. Current scaling laws show that learning from such data requires an abundance of both compute and data, which grows with the size of the model being trained. 

This is infeasible both because of the large compute costs and duration associated with pre-training, and the impending scarcity of high-quality data on the web. In this work, we propose Web Rephrase Augmented Pre-training (WRAP) that uses an off-the-shelf instruction-tuned model prompted to paraphrase documents on the web in specific styles such as "like Wikipedia" or in "question-answer format" to jointly pre-train LLMs on real and synthetic rephrase. 

First, we show that using WRAP on the C4 dataset, which is naturally noisy, speeds up pre-training by sim3x. At the same pre-training compute budget, it improves perplexity by more than 10% on average across different subsets of the Pile and improves zero-shot question answer accuracy across 13 tasks by more than 2%. 

Second, we investigate the impact of the re-phrasing style on the performance of the model, offering insights into how the composition of the training data can impact the performance of LLMs in OOD settings. 

Our gains are attributed to the fact that re-phrased synthetic data has higher utility than just real data because it (i) incorporates style diversity that closely reflects downstream evaluation style, and (ii) has higher 'quality' than web-scraped data.

1.8. Weaver: Foundation Models for Creative Writing

This work introduces Weaver, our first family of large language models (LLMs) dedicated to content creation. Weaver is pre-trained on a carefully selected corpus that focuses on improving the writing capabilities of large language models. 

We then fine-tune Weaver for creative and professional writing purposes and align it to the preference of professional writers using a suit of novel methods for instruction data synthesis and LLM alignment, making it able to produce more human-like texts and follow more diverse instructions for content creation. 

The Weaver family consists of models of Weaver Mini (1.8B), Weaver Base (6B), Weaver Pro (14B), and Weaver Ultra (34B) sizes, suitable for different applications and can be dynamically dispatched by a routing agent according to query complexity to balance response quality and computation cost. 

Evaluation of a carefully curated benchmark for assessing the writing capabilities of LLMs shows Weaver models of all sizes outperform generalist LLMs several times larger than them. Notably, our most capable Weaver Ultra model surpasses GPT-4, a state-of-the-art generalist LLM, on various writing scenarios, demonstrating the advantage of training specialized LLMs for writing purposes. 

Moreover, Weaver natively supports retrieval-augmented generation (RAG) and function calling (tool usage). We present various use cases of these abilities for improving AI-assisted writing systems, including integration of external knowledge bases, tools, or APIs, and providing personalized writing assistance. Furthermore, we discuss and summarize a guideline and best practices for pre-training and fine-tuning domain-specific LLMs.

1.9. SymbolicAI: A framework for logic-based approaches combining generative models and solvers

We introduce SymbolicAI, a versatile and modular framework employing a logic-based approach to concept learning and flow management in generative processes. 

SymbolicAI enables the seamless integration of generative models with a diverse range of solvers by treating large language models (LLMs) as semantic parsers that execute tasks based on both natural and formal language instructions, thus bridging the gap between symbolic reasoning and generative AI. 

We leverage probabilistic programming principles to tackle complex tasks and utilize differentiable and classical programming paradigms with their respective strengths. The framework introduces a set of polymorphic, compositional, and self-referential operations for data stream manipulation, aligning LLM outputs with user objectives. 

As a result, we can transition between the capabilities of various foundation models endowed with zero- and few-shot learning capabilities and specialized, fine-tuned models or solvers proficient in addressing specific problems. 

In turn, the framework facilitates the creation and evaluation of explainable computational graphs. We conclude by introducing a quality measure and its empirical score for evaluating these computational graphs and propose a benchmark that compares various state-of-the-art LLMs across a set of complex workflows. 

We refer to the empirical score as the "Vector Embedding for Relational Trajectory Evaluation through Cross-similarity", or VERTEX score for short. The framework codebase and benchmark are linked below.

1.10. Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens

Are n-gram language models still relevant in this era of neural large language models (LLMs)? Our answer is yes, and we show their values in both text analysis and improving neural LLMs. Yet this necessitates modernizing n-gram models in two aspects. 

First, we train them at the same data scale as neural LLMs — 1.4 trillion tokens. This is the largest n-gram model ever built. Second, existing n-gram models use small n which hinders their performance; we instead allow n to be arbitrarily large, by introducing a new infty-gram LM with backoff. 

Instead of pre-computing n-gram count tables (which would be very expensive), we develop an engine named infini-gram — powered by suffix arrays — that can compute infty-gram (as well as n-gram with arbitrary n) probabilities with millisecond-level latency. 

The infty-gram framework and infini-gram engine enable us to conduct many novels and interesting analyses of human-written and machine-generated text: we find that the infty-gram LM has fairly high accuracy for next-token prediction (47%), and can complement neural LLMs to greatly reduce their language modeling perplexities. 

When analyzing machine-generated text, we also observe irregularities in the machine — infty-gram agreement level concerning the suffix length, which indicates deficiencies in neural LLM pretraining and the positional embeddings of Transformers. We open-source our infini-gram engine in the hopes of enabling more studies on how to best use verbatim information retrieved from large text corpora.

1.11. From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness, and Causality through Four Modalities

Multi-modal Large Language Models (MLLMs) have shown impressive abilities in generating reasonable responses concerning multi-modal contents. 

However, there is still a wide gap between the performance of recent MLLM-based applications and the expectation of the broad public, even though the most powerful OpenAI's GPT-4 and Google's Gemini have been deployed. 

This paper strives to enhance understanding of the gap through the lens of a qualitative study on the generalizability, trustworthiness, and causal reasoning capabilities of recent proprietary and open-source MLLMs across four modalities: ie, text, code, image, and video, ultimately aiming to improve the transparency of MLLMs. 

We believe these properties are several representative factors that define the reliability of MLLMs, in supporting various downstream applications. 

To be specific, we evaluate the closed-source GPT-4 and Gemini and 6 open-source LLMs and MLLMs. Overall we evaluate 230 manually designed cases, where the qualitative results are then summarized into 12 scores (ie, 4 modalities times 3 properties). 

In total, we uncover 14 empirical findings that are useful to understand the capabilities and limitations of both proprietary and open-source MLLMs, towards more reliable downstream multi-modal applications.

1.12. RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval

Retrieval-augmented language models can better adapt to changes in world state and incorporate long-tail knowledge. However, most existing methods retrieve only short contiguous chunks from a retrieval corpus, limiting holistic understanding of the overall document context. 

We introduce the novel approach of recursively embedding, clustering, and summarizing chunks of text, constructing a tree with differing levels of summarization from the bottom up. At inference time, our RAPTOR model retrieves from this tree, integrating information across lengthy documents at different levels of abstraction. 

Controlled experiments show that retrieval with recursive summaries offers significant improvements over traditional retrieval-augmented LMs on several tasks. 

On question-answering tasks that involve complex, multi-step reasoning, we show state-of-the-art results; for example, by coupling RAPTOR retrieval with the use of GPT-4, we can improve the best performance on the QuALITY benchmark by 20% in absolute accuracy.

View arXiv pageView PDFAdd to collection

1.13. Weak-to-Strong Jailbreaking on Large Language Models

Although significant efforts have been dedicated to aligning large language models (LLMs), red-teaming reports suggest that these carefully aligned LLMs could still be jailbroken through adversarial prompts, tuning, or decoding. 

Upon examining the jailbreaking vulnerability of aligned LLMs, we observe that the decoding distributions of jailbroken and aligned models differ only in the initial generations. 

Comments

Most Popular Blog Posts

How to Become an OSINT Professional: A Step-by-Step Guide

How to Become an OSINT Professional: A Step-by-Step Guide In today’s information-driven world, Open Source Intelligence (OSINT) has become a critical skill in various fields such as law enforcement, cybersecurity, journalism, and private investigation. OSINT professionals collect, analyze, and utilize publicly available data to gain actionable insights for a wide array of purposes, from uncovering threats to uncovering fraud. The best part? Almost anyone with the right mindset and skills can become proficient in OSINT. If you’re interested in becoming an OSINT professional, here’s a comprehensive guide to help you get started. What Is OSINT? Open Source Intelligence refers to the process of gathering and analyzing publicly available information to produce actionable intelligence. This includes data from sources like websites, social media platforms, news outlets, public records, and more. The beauty of OSINT is that it is completely legal and does not require access to classified dat

Organic Food Under Siege: Disinformation Campaigns Threaten Sustainable Solutions

Organic Food Under Siege: Disinformation Campaigns Threaten Sustainable Solutions The Seeds of Doubt: How Misinformation Targets Organic Farming Food security is a global challenge, but the solution isn't as simple as lining supermarket shelves with GMO-laden produce. Organic farming practices, which prioritize natural methods and biodiversity, offer a sustainable and healthy alternative. However, this vital movement faces a growing threat: disinformation campaigns pushing a pro-GMO agenda. This blog post sheds light on how misinformation is undermining organic food security. We'll explore how these campaigns target consumer trust, the potential consequences, and steps we can take to support organic solutions. Tactics of Deception: Sowing Doubt in Organic Practices Disinformation campaigns targeting organic food often rely on these tactics: False Equivalency: Creating a false impression that GMOs are just as healthy and sustainable as organic options. Cherry-Picking Sc

Key Information about Marie Seshat Landry's Projects and Initiatives

Key Information about Marie Seshat Landry's Projects and Initiatives Marie Seshat Landry has established numerous initiatives focused on sustainability, peace, and technological innovation. Here are some key aspects based on her online presence and provided documents: SearchForOrganics.com Marie Seshat Landry owns and operates SearchForOrganics.com , a platform dedicated to promoting organic products and sustainable practices. The site aims to educate consumers about the benefits of organic living and support organic producers. Summary of Key Missions and Projects: Mission WW3 Objective : Prevent the outbreak of a third world war through peacebuilding efforts. Outcome : Declared victory on July 19, 2024, promoting global harmony. PeaceMakerGPT Objective : Use AI to detect and mitigate hate speech, fostering peaceful communication. Impact : Significant contributions to conflict resolution and peacebuilding. Universal Declaration of Organic Rights (UDOR 2024) Focus : Sustainability,

Espionage Legends: Unveiling the Stories of Remarkable Spies Throughout History

Espionage Legends: Unveiling the Stories of Remarkable Spies Throughout History Introduction: In the shadowy world of espionage, tales of daring, treachery, and clandestine operations have captivated audiences for centuries. From the exotic allure of Mata Hari to the shocking betrayal of Kim Philby, history has been shaped by the actions of spies. Join us as we delve into the intriguing lives of ten legendary spies who operated in different eras and on various sides of conflicts. Brace yourself for a thrilling journey through the annals of espionage. Mata Hari: Dancing with Deception Mata Hari, the enigmatic exotic dancer, captivated audiences with her sensuality, but her true talent lay in the realm of espionage. Discover the fascinating story of this femme fatale who became embroiled in the treacherous world of international espionage during World War I. Kim Philby: The Double Agent Extraordinaire Unmasking the true identity of a double agent is like peeling back layers of deception.

The World's Most Famous Spies: Real-Life Espionage Stories That Shaped History 🌍📖

Discover the world's most famous spies and their thrilling real-life espionage stories that shaped history. 🌍📖 The World's Most Famous Spies: Real-Life Espionage Stories That Shaped History 🌍📖 Introduction: Throughout history, spies have played a crucial role in shaping world events and influencing the outcomes of wars and conflicts. In this article, we'll explore the lives and accomplishments of some of the most famous spies, whose daring and cunning acts of espionage had a significant impact on history. 🕵️‍♂️🕵️‍♀️🌍 Mata Hari: The Exotic Dancer Turned Spy 💃🕵️‍♀️ Mata Hari, born Margaretha Zelle, was a Dutch exotic dancer and courtesan who became a spy for Germany during World War I. She was eventually caught by French authorities and executed in 1917. Her captivating story continues to inspire books, movies, and even an opera. 🎭🎥 Sidney Reilly: The Ace of Spies ♠️🔍 Sidney Reilly was a Russian-born British spy who is often considered the inspiration for Ian Flem

From Awareness to Action: Steps to Becoming Anti-Racist

In a world where racism still exists, it is important to take action to become an anti-racist. This means being actively opposed to racism in all forms and actively working to dismantle systems that perpetuate it. Here are some steps you can take to become an anti-racist: Educate Yourself: Start by learning about the history of racism and how it has impacted communities and individuals. There are many resources available such as books, documentaries, and online courses. Recognize Your Own Bias: Everyone has biases, but recognizing them is the first step to addressing them. Take time to reflect on your own attitudes and beliefs and how they may contribute to racism. Listen and Learn: Listen to the experiences and perspectives of people of color without judgment or defensiveness. Be open to learning from their experiences. Speak Up: When you witness racism or hear racist remarks, speak up and challenge it. Silence can be interpreted as complicity. Take Action: Find ways to actively comba

Logo

Logo