AI Decoded

The Attention Efficiency Newsletter Edition, Part 1

Robert Ray — Thu, 27 Feb 2025 04:14:23 GMT

Introduction: Making Transformers Scale to Million-Token Contexts

I'm excited to dive into one of the most promising approaches to making large language models handle extremely long contexts efficiently. When it comes to transformer design innovations, 2025 has already seen rapid progress on various fronts, but in today's edition, we'll examine Native Sparse Attention (NSA), a breakthrough technique that rethinks how attention mechanisms work from the ground up.

As many of you running AI infrastructure might know too well, scaling transformer-based LLMs to handle long contexts isn't just about throwing more compute at the problem. It requires actually rethinking fundamental architectural choices. With NSA, DeepSeek presents a clever solution that manages to maintain model quality while dramatically reducing computation needs for sequences up to 1 million tokens long. Let's explore how it works and what it means for your AI projects.

The Fundamental Challenge: Attention Is Expensive

The core challenge NSA addresses is fundamental to large language models: the computational inefficiency of standard attention mechanisms when processing long contexts. When a transformer processes a long document (64K tokens or more), the vanilla attention mechanism becomes prohibitively expensive for two reasons:

Quadratic complexity: The computation scales as O(n²) with sequence length, meaning doubling the context length increases computation by 4x
Memory bottleneck: During autoregressive generation, the KV (key-value) cache consumes massive amounts of memory

According to the research, attention computation accounts for 70-80% of total latency when decoding with 64K context length - a massive bottleneck that limits practical applications.

The Two-Phase Problem

A critical insight about transformer inference is that it actually involves two distinct phases with different computational characteristics:

Prefilling phase: When you first input a prompt, the model processes all input tokens at once in a batched, parallelized computation. This phase is compute-bound and matrix-multiplication heavy.
Decoding phase: The model generates each new token one by one in an autoregressive manner. This phase is memory-bound, dominated by KV cache access.

Current sparse attention methods typically optimize only one of these phases. For example:

H2O optimizes decoding but requires computationally intensive preprocessing during prefilling
MInference optimizes prefilling with dynamic sparse attention but doesn't address decoding bottlenecks

This phase specialization creates a fundamental limitation: different real-world workloads have dramatically different phase dominance patterns. For a book summarization task (100K input tokens, 1K output), optimizing prefilling brings huge benefits. But for long chain-of-thought reasoning (1K input, 50K output), decoding optimization is what matters.

The NSA Approach: Three-Branch Hierarchical Attention

NSA introduces a clever three-branch approach to attention that I'd call "hierarchical sparse attention." Instead of making every query token attend to every key-value pair, NSA uses a mix of strategies to achieve efficiency without sacrificing quality:

1. Token Compression: Creating Information-Dense Summaries

The first branch works by taking continuous blocks of tokens and distilling them into more compact representations. Here's how it works:

NSA divides the sequence into blocks (typically 32 tokens per block, with a stride of 16)
Each block is processed by a learnable MLP with position encoding
This MLP condenses the entire block into a single representative key-value pair

Think of this like creating paragraph summaries of a long document. Instead of examining every word, you can first look at these summaries to get the big picture. This leads to a compressed representation that's around 1/16th the size of the original sequence.

2. Token Selection: Finding the Most Important Blocks

The second branch identifies which original blocks of tokens are most relevant to the current query:

Using attention scores from the compression branch, NSA maps importance scores to blocks of the original sequence
It selects the top-n most important blocks (typically n=16)
The query then attends to all tokens within these selected blocks

The clever part is that NSA reuses computation from the compression branch to guide this selection, minimizing overhead. This approach is both computationally efficient and hardware-friendly, as it maintains contiguous memory access patterns that modern GPUs are optimized for.

3. Sliding Window: Preserving Local Context

The third branch addresses an insight about how language models learn: they naturally find it easier to learn relationships between nearby tokens, which can dominate the learning process at the expense of long-range dependencies.

NSA solves this with a dedicated sliding window branch that:

Maintains a fixed window of the most recent tokens (typically 512)
Computes attention on this local context separately
Uses independent parameters from the other branches

This separation prevents local attention from "shortcutting" the learning of long-range patterns, forcing the model to develop specialized capabilities in each branch.

*Figure 1: The three-branch architecture of NSA, showing how each query utilizes compression tokens for global context, selective blocks for specific details, and a sliding window for local information.*

Putting It All Together: The Gating Mechanism

The outputs from these three branches are combined through a learnable gating mechanism. For each query token, the model learns to dynamically adjust how much it relies on each type of attention based on the current context. This weighted combination gives NSA extraordinary flexibility, allowing it to adapt to different types of content and reasoning tasks.

Kernel Design: Where Theory Meets Practice

The other thing that truly sets NSA apart is its focus on hardware-aligned implementation. Many sparse attention approaches look promising in theory but fail to deliver actual speedups in practice. NSA's kernel design directly addresses this gap.

The key innovation is changing how query-key-value operations are arranged in memory:

Group-Centric Data Loading: Instead of loading continuous query blocks (as FlashAttention does), NSA loads all query heads for a single position at once. Since queries at the same position (but different heads) in a GQA group share the same key-value cache, this minimizes redundant memory transfers.
Shared KV Fetching: Once all query heads for a position are loaded, NSA loads only the sparse key-value blocks needed by those queries, ensuring each block is loaded just once.
Efficient Loop Scheduling: NSA uses Triton's grid scheduler to parallelize across different query positions, maximizing GPU utilization.

This implementation achieves "near-optimal arithmetic intensity" - making efficient use of both computation resources and memory bandwidth. For real-world workloads, this translates to dramatic speedups:

9.0x forward pass speedup at 64K context length
6.0x backward pass speedup at 64K context length
11.6x decoding speedup at 64K context length

What's particularly impressive is that these speedups increase with longer sequences, with measurements showing even more dramatic gains at 1M token contexts.

Implementation Guide for Amazon P5e (H200) and Trn2 Clusters

Warning: This is more technically dense, so if you just want to know some examples of when this might be useful, skip this section.

For organizations looking to implement NSA on their AI infrastructure, here's a practical roadmap for AWS P5e instances (with NVIDIA H200 GPUs) or Amazon Trn2 instances (with AWS Trainium 2 chips):

Step 1: Framework Selection

Start with an existing transformer implementation framework that supports custom attention mechanisms:

For H200 clusters: HuggingFace Transformers with custom attention modules or PyTorch FSDP
For Trainium 2: AWS NeuronSDK with custom operators

Step 2: NSA Architecture Implementation

Implement the three core components of NSA:

Token Compression Module:
1. Implement the block compression function as a small MLP
2. Configure block size (32 tokens) and stride (16 tokens)
3. Set up the compression attention path
Token Selection Module:
1. Implement the block importance score calculation
2. Create the top-n selection mechanism (n=16)
3. Ensure blocks are properly aligned for optimal memory access
Sliding Window Module:
1. Implement the separate sliding window attention (window size=512)
2. Configure the gating mechanism to combine all three branches

Step 3: Optimized Kernel Implementation

This is the most challenging part, but crucial for real-world performance:

For H200 Clusters:

Use Triton to implement custom CUDA kernels that optimize the block-wise attention patterns
Utilize H200's 4th generation Tensor Cores which significantly accelerate matrix multiplications
Take advantage of the high-bandwidth HBM3 memory for memory-bound operations
Implement efficient kernel scheduling to maximize the utilization of the GPUs' streaming multiprocessors

For Trainium 2:

Implement custom operators using AWS Neuron SDK
Optimize for Trainium's specialized matrix multiplication units
Use the chip's on-chip memory hierarchy for efficient attention computation
Implement efficient pipeline parallelism across multiple Trainium chips

Step 4: Distributed Training Configuration

Configure your infrastructure for efficient distributed training:

Tensor parallelism across multiple accelerators for model sharding
Pipeline parallelism for very large models
Efficient attention algorithm scheduling to maximize throughput

Step 5: Progressive Optimization

For both platforms, follow an incremental approach:

First implement a functional version that correctly implements the algorithm
Profile to identify bottlenecks
Optimize the most critical kernels first
Consider hybrid approaches during transition (like NSA's suggestion of using full attention in the final layers)

Real-World Applications: Three Compelling Use Cases

Let's examine three practical applications where NSA's capabilities would provide substantial benefits over traditional transformers:

Use Case 1: Enterprise Document Search and Analysis Platform

End Goal: A comprehensive enterprise platform that enables employees to search across, analyze, and extract insights from the company's entire document repository - including contracts, technical documentation, research papers, meeting transcripts, and historical reports. The system would allow users to ask complex questions that might require understanding connections between information scattered across multiple documents or sections written months or years apart.

For example, a user could ask: "What technical approaches did we consider for solving the cooling problem in our 2020 product line, and how did those compare to the solutions we implemented in our 2023 models?" The model would need to retrieve, understand, and synthesize information from documents spanning years.

Training Data:

The training data would consist of:

Millions of internal company documents (contracts, reports, specifications)
Technical documentation spanning multiple product generations
Meeting transcripts and research notes
Industry publications and technical standards relevant to the company's field
Email threads and project management documentation

Each training example would typically be 20K-50K tokens in length, combining multiple related documents to form a coherent context. Supervised fine-tuning would include queries paired with human-expert answers that demonstrate effective cross-document reasoning.

Benefits of NSA for this application:

Cost Efficiency: Training on 50K token contexts with full attention would be prohibitively expensive. NSA reduces training costs by 6-9x.
Real-Time Inference: Employee queries need responses in seconds, not minutes. NSA's faster decoding enables practical real-time use with full context.
Enhanced Accuracy: NSA's architecture encourages the model to identify and focus on the most important sections across documents, improving retrieval quality.

Use Case 2: Clinical Decision Support System

End Goal: A specialized medical AI assistant that helps healthcare providers analyze patient history, current symptoms, lab results, and relevant medical literature to suggest potential diagnoses and treatment options. The system would ingest a patient's complete medical history (potentially spanning decades), current presenting information, and relevant medical knowledge to provide evidence-based recommendations to physicians.

The model could analyze a 30-year patient history alongside recent test results and produce an analysis like: "Based on the patient's history of recurrent infections at ages 5, 12, and 28, combined with current elevated inflammatory markers and the specific pattern of symptoms, consider investigating for CVID (Common Variable Immune Deficiency), which was not previously diagnosed but matches 85% of the presentation pattern."

Training Data

The training data would include:

De-identified patient electronic health records (EHRs) with full longitudinal histories
Medical literature, textbooks, and peer-reviewed research papers
Clinical guidelines and standard treatment protocols
Case studies of rare conditions and their varying presentations
Medication information including interactions, contraindications, and effectiveness studies
Medical imaging reports and laboratory test interpretations

Typical training examples would range from 16K to 64K tokens, including full patient histories alongside relevant medical literature sections. Fine-tuning would involve physician-annotated cases with proper diagnostic reasoning and treatment recommendations.

Benefits of NSA for this application:

Comprehensive Analysis: Unlike systems that truncate medical histories, NSA can process complete patient records spanning decades.
Pattern Recognition: The hierarchical attention mechanism helps identify connections between symptoms or events separated by years, potentially revealing patterns that conventional models might miss.
Explainability: The block selection mechanism provides natural attention visualization, showing physicians which parts of the record influenced a recommendation - critical for building trust and meeting regulatory requirements.

Use Case 3: Code Intelligence Platform for Software Development

End Goal: A comprehensive code intelligence platform that understands entire codebases (not just individual files) to assist developers with complex tasks like architecture recommendations, bug detection, security vulnerability identification, and automated refactoring. The system would analyze entire repositories including source code, configuration files, documentation, and version control history to provide contextually aware development assistance.

For example, a developer could ask: "Identify potential performance bottlenecks in our payment processing pipeline and suggest refactoring approaches that maintain compatibility with our existing API contracts." The model would analyze the entire codebase to provide holistic recommendations.

Training Data

The training data would consist of:

Complete open-source repositories across various programming languages and domains
Internal codebases (for company-specific deployments)
Technical documentation, comments, and architectural design documents
Code review discussions and issue tracker histories
Test suites and their coverage reports
API specifications and dependency information
Commit histories showing code evolution

Training examples would typically span 32K-128K tokens, containing entire modules or service implementations rather than isolated files. Fine-tuning would include developer-solved problems showing proper reasoning about complex codebase interactions.

Benefits of NSA for this application:

Repository-Level Understanding: NSA enables processing entire repositories (100K+ lines of code) as a single context, rather than isolated files.
Reference Resolution: The multiple attention branches maintain awareness of dependencies, inheritance hierarchies, and function calls across files.
Memory Efficiency: Code intelligence tools often need to run locally on developer machines. NSA's reduced memory footprint makes this feasible even for large projects.

Conclusion: The Path Forward for Efficient Attention

NSA represents a significant advancement in making transformers more efficient for long-context processing. By combining hierarchical sparse attention with hardware-aligned implementation, it achieves dramatic speedups without sacrificing model quality - and in some cases, even improving it.

What makes NSA particularly compelling is its practicality. Unlike many research approaches that remain theoretical, NSA demonstrates real-world speedups and provides a clear implementation path. Its native trainability means organizations can realize benefits throughout the entire model lifecycle, from pretraining through deployment.

In my next two newsletters, I'll explore complementary approaches that further push the boundaries of efficient attention. We'll look at:

Mixture of Block Attention (MoBA), which applies MoE principles to attention mechanisms, enabling dynamic routing of queries to only the most relevant blocks.
LightThinker, a method that enables models to dynamically compress their intermediate thoughts during reasoning, significantly reducing the token overhead in chain-of-thought applications.

Until then, if you're working with long-context applications, NSA offers a compelling path to improved efficiency today. The combination of its hierarchical attention approach and hardware-aligned implementation makes it an excellent candidate for organizations looking to scale their language models to million-token contexts without proportional increases in compute costs.

How Did We Get Here?: Part 2

Robert Ray — Thu, 06 Feb 2025 03:03:34 GMT

The Zeitgeist Begins

In our journey through the history of generative AI, we’ve covered quite a bit of ground. Our first installment traced the evolution from the groundbreaking Transformer architecture through the emergence of ChatGPT (powered by OpenAI’s proprietary GPT-3.5), exploring how architectural innovations revolutionized natural language processing. Our second deep dive examined Reinforcement Learning from Human Feedback (RLHF), uncovering how researchers taught language models to be helpful and aligned with human values.

Today, we'll complete this historical arc by exploring four transformative developments that have shaped the AI landscape since late 2022: the democratization of AI through open-source models, the emergence of sophisticated multimodal understanding, architectural innovations that are reimagining how AI systems process information, and the rise of autonomous AI agents. These advances represent shifts in who can develop AI, what AI can perceive, how AI thinks, and what AI can do. We’ll then conclude by briefly expanding on the ideas around AI alignment and how that might shape society in the coming years as AI continues to assume more and more responsibility in our lives.

Let’s dive in.

The Open-Source AI Revolution: Democratizing the Future

In early 2023, Meta released LLaMA, a suite of foundation models from 7B to 65B parameters and trained on public data. This release would catalyze one of the most significant shifts in AI development since the introduction of the Transformer architecture. What made this moment so important wasn't just the technical achievements, impressive as they were, but how it forever altered who could participate in advancing AI technology. To understand why, it’s helpful to see how this moment parallels another technological revolution: the rise of Linux and open-source software.

When Linus Torvalds released Linux in 1991, he created another operating system, yes. But he also initiated a movement that would change how software was developed. Similarly, LLaMA didn't just introduce another language model. It also showed that state-of-the-art AI could be developed openly and collaboratively. Like Linux before it, LLaMA demonstrated that cutting-edge development (this time in AI, rather than software development) wasn't the exclusive domain of tech giants with massive computational resources.

The Technical Breakthrough: Doing More with Less

What made LLaMA particularly remarkable was its efficiency. The 13B parameter version of LLaMA achieved performance comparable to GPT-3 (175B parameters) on many tasks, despite being more than an order of magnitude smaller. Although differences in training data and fine-tuning strategies also contribute to this performance, this was a massive breakthrough in model efficiency that showed high-performance AI didn't necessarily require the resources of a major tech corporation.

Meta achieved this efficiency through several key innovations in model architecture and training. Think about it like city planning: while older cities might have grown organically, requiring extensive infrastructure to function, modern planned cities can achieve similar capabilities with much more efficient use of space and resources. Meta's team carefully optimized how information flows through the model (using improved attention mechanisms), how the model processes that information (through enhanced activation functions), and how it learns from its training data (via sophisticated pre-training objectives). They also employed techniques like grouped-query attention, which is like having specialized teams handle different aspects of a task instead of everyone trying to do everything—more efficient and often more effective.

The Infrastructure Revolution: Building the Foundation

As significant as LLaMA was, a model alone doesn't create a revolution, you also need an ecosystem. Enter Hugging Face, which emerged as what we might call the "GitHub of AI." Just as GitHub transformed how developers collaborate on software, Hugging Face created an infrastructure that made sharing and building upon AI models as simple as a few lines of code. Here's a practical example of how straightforward it can be:

This simplicity masks powerful complexity. Under the hood, the pipeline automatically downloads the appropriate model, handles tokenization, manages the inference process, and returns human-readable results. It's like having a universal translator that just works, hiding all the complexity of language processing behind a simple interface. Hugging Face’s Transformers library and Model Hub have drastically lowered the barrier to entry – a researcher can publish a model and have thousands of users test it within days.

Building on this foundation, LangChain emerged to make creating sophisticated AI applications more accessible. Consider this example:

This code demonstrates how LangChain abstracts away the complexity of combining AI models with practical business logic. This is somewhat similar to how a modern car's electronic systems handle complex engine management without the driver needing to understand the underlying mechanics.

The Spectrum of Openness: An Important Distinction

When we talk about "open-source" AI, we need to be precise about what we mean. There's actually a spectrum of openness, and understanding this spectrum is necessary to be able to grasp the state of AI democratization. I’m a space nerd, so I’ll use a space exploration analogy: some organizations share their complete rocket designs and fuel formulations (fully open), others release technical specifications but keep fuel mixtures proprietary (partially open), and some only describe their general approach while keeping the details secret (architecture-only).

A prime example of an architecture-only model is GPT-3. While OpenAI published detailed papers describing its architecture, the weights and training data remain proprietary. So this would be an example of a spacecraft manufacturer publishing the general principles of their rocket design while keeping the specific engineering details and fuel formulations secret.

The Community Response: Innovation Unleashed

The impact of this open-source revolution was immediate. Researchers and hobbyists worldwide built upon LLaMA, creating fine-tuned chat models, knowledge-specialized models (e.g. Medical or legal variants), and techniques for running these models efficiently on consumer hardware.

Stanford's Alpaca project demonstrated how to create a ChatGPT-like model by fine-tuning LLaMA-7B on instruction data generated by GPT-3.5. They achieved this for just $600, showing that high-quality AI development wasn't necessarily tied to massive budgets. The process was rather ingenious: they used GPT-3.5 to generate training data, then fine-tuned the smaller LLaMA model on this data, kind of like teaching a student by having them learn from a more experienced mentor. You’ll hear this concept referenced often whenever you hear the phrase ‘model distillation’.

Shortly after, the Vicuna project pushed the boundaries further. By training LLaMA-13B on 70,000 real conversations from ChatGPT users, they created a model that achieved "90% of ChatGPT quality" according to GPT-4 evaluations. This success demonstrated the power of real-world data in improving model performance.

Notable open-source LLMs (2022–2023) and their highlights.

Enabling Technologies: The Unsung Heroes

Several key technologies made this revolution possible. LoRA (Low-Rank Adaptation) and its successor QLoRA (Quantized LoRA) dramatically reduced the computational requirements for fine-tuning models. These techniques work by identifying and modifying only the most important connections in a neural network. Imagine trying to improve a city's traffic flow by upgrading only the most crucial intersections instead of rebuilding every street.

GGML and llama.cpp took a different approach to democratization by optimizing models for consumer hardware. GGML provides efficient memory management and quantization (reducing the precision of numbers used in calculations without significantly impacting performance), while llama.cpp offers highly optimized code for running these models on personal computers. Together, they're like the difference between needing a massive sports stadium to host an event versus being able to hold it in a local community center; it’s the same event, but made accessible to a much wider audience.

Future Implications

The open-source AI movement has already influenced how commercial AI is developed. Companies like Anthropic and OpenAI have increased their technical transparency, and new business models have emerged around open-source models. But perhaps more importantly, it's demonstrated that advancing AI technology doesn't have to be the exclusive domain of large corporations, but can instead be a collaborative effort that benefits from diverse perspectives and approaches.

The democratization of AI through open-source models didn't just change who could work with these systems, it also transformed how they evolved. As more researchers gained access to powerful foundation models, experimentation flourished across different domains. One particularly fertile area of innovation emerged at the intersection of language and vision. While corporate labs like OpenAI and Google had the resources to build sophisticated multimodal systems from scratch, the open-source community began exploring clever ways to combine and enhance existing models. Projects like Open-Flamingo demonstrated how researchers could create capable vision-language models by building on open-source foundations. This democratized approach to multimodal AI set the stage for a revolution in how artificial intelligence perceives and understands our world.

The Multimodal Revolution: When AI Learned to See

While the open-source revolution was democratizing access to AI technology, another transformation was reshaping how AI systems understand our world. The emergence of sophisticated multimodal models marked a significant progression away from AI that could only process text to systems that could see, understand, and reason about visual information alongside language. This evolution mirrors how humans process information – we don't experience the world through separate channels of text, images, and sound, but rather as an integrated whole.

We covered the beginnings of this in Part 1, but we’ll look at how much the capabilities have improved here.

GPT-4V: The Integration of Vision and Language

When OpenAI introduced GPT-4 with vision capabilities (GPT-4V) in March 2023, it represented a significant leap forward in multimodal understanding. While OpenAI hasn't published the full technical details of GPT-4V's architecture, analysis of its behavior and capabilities suggests some fascinating technical approaches to visual-linguistic integration.

At its core, GPT-4V appears to use a sophisticated attention mechanism that allows it to maintain continuous awareness of both visual and textual information throughout its reasoning process. Think of traditional computer vision systems like having tunnel vision – they look at an image, extract features, and then essentially "forget" the visual information as they move to text generation. GPT-4V instead seems to implement what we might call "persistent visual attention" – it can refer back to different parts of an image as it generates text, much like how humans naturally glance back and forth between text and images while reading a technical document.

This persistent attention mechanism likely involves a modified version of the transformer architecture that can handle both visual and textual tokens in its attention layers. The model probably uses a specialized visual encoder to transform image regions into a format that can be processed alongside text tokens. This allows the model to perform cross-attention operations between visual and textual elements throughout its processing pipeline.

This integration enabled some remarkable capabilities. GPT-4V could:

Convert hand-drawn sketches into functional website code, essentially "seeing" the design intent behind rough drawings
Analyze complex charts and graphs while considering written context and annotations
Help visually impaired users understand their surroundings by providing detailed, contextual descriptions
Solve visual puzzles by combining spatial reasoning with general knowledge

However, this architecture also helps explain some of GPT-4V's interesting limitations. It sometimes struggled with counting large numbers of objects or reading text in unusual fonts, to give a couple of examples. The struggle with counting large numbers of objects likely stems from how visual information is tokenized and processed. When an image is divided into tokens for processing, there's a fundamental tension between resolution (how detailed each token can be) and context window size (how many tokens can be processed at once). This creates a trade-off that makes it difficult to maintain both fine detail and broad context simultaneously – similar to how humans might struggle to count a large crowd of people without using systematic counting strategies.

The difficulty with unusual fonts points to another interesting limitation: the model's visual processing might be optimizing for semantic understanding over precise visual feature extraction. In other words, it's better at understanding what things mean than at processing exact visual details – again, somewhat similar to how humans might struggle to read highly stylized text even though we can easily recognize objects in various styles and orientations.

Gemini: Native Multimodality

Google's Gemini represents what appears to be a fundamentally different approach to multimodal AI. While Google hasn't published complete technical details of Gemini's architecture, public demonstrations and technical blog posts suggest it was designed to process multiple modalities from the ground up, rather than adding vision capabilities to an existing language model. Based on available information and observed capabilities, we can make informed hypotheses about its technical approach.

A natively multimodal architecture likely involves several key innovations:

Joint Embedding Space: Rather than having separate embedding spaces for different modalities that need to be aligned later, Gemini likely uses a unified embedding space where visual, textual, and other modalities are represented in a compatible format from the start. This is similar to how a child learns to associate words with visual experiences naturally, rather than learning them separately and then connecting them.
Synchronized Pre-training: The model is trained on multimodal data from the beginning, learning to process different types of information simultaneously. This likely involves sophisticated training objectives that encourage the model to find deep connections between modalities.
Unified Attention Mechanisms: Instead of having separate attention mechanisms for different modalities, Gemini presumably uses attention layers that can naturally handle multiple types of inputs, allowing for more fluid integration of information.

Gemini's "native" multimodality shows in how it handles complex reasoning tasks that require integrating different types of information. For example, Gemini has demonstrated the ability to:

Follow complex mathematical derivations while simultaneously interpreting accompanying diagrams and graphs
Understand physical systems by combining visual observation with physical principles
Analyze scientific papers by integrating information from text, equations, figures, and tables
Solve visual puzzles by combining spatial reasoning with abstract concept understanding

The model also showed improved capabilities in what we might call "cross-modal reasoning" – using understanding from one modality to enhance comprehension in another. The model's capabilities here are particularly evident in tasks like:

Explaining scientific concepts by generating analogies based on visual observations
Identifying inconsistencies between textual descriptions and visual evidence
Understanding cause-and-effect relationships in physical systems through both visual and verbal information

The Challenge of Multimodal Consistency

One of the most significant challenges in multimodal AI is maintaining consistency across different modes of understanding. This is more than just a technical challenge because it's so fundamental to how these systems construct and maintain coherent representations of the world.

The limitations of inconsistent multimodal understanding can be noticeable. For example, a model might generate a fluent description that contains minor hallucinations (mentioning objects not actually present in the image, for example) or make logical deductions based on misaligned visual and textual information.

Several techniques are being developed to address these challenges. I’ll list some of them here without diving deep, but these are all very interesting to probe further to see what methodologies are being researched:

Contrastive Learning: Training models to identify mismatches between modalities can help them develop more consistent internal representations.
Cross-Modal Verification: Implementing explicit verification steps where the model checks its conclusions across different modalities.
Unified Representation Learning: Developing training techniques that encourage the model to build consistent internal representations across all modalities.

Applications of Multimodal AI

Vision-Language models have unlocked a range of applications. They can serve as AI tutors that explain diagrams or math problems from a textbook image, aids for the visually impaired (by describing surroundings or interpreting signs), and assistants for creative work (e.g. generating image captions, or suggesting image edits via textual instructions). In specialized domains, multimodal models are being used for tasks like medical image analysis with text explanations, or robotics (where a model like PaLM-E takes in camera inputs and outputs action plans).

Another interesting application is in content moderation or analysis – e.g. scanning an image uploaded to a platform and not just detecting objects but assessing the context or appropriateness via a language-based explanation. On the creative side, transformers that handle multiple modalities enable systems like HuggingGPT, where an LLM orchestrates a collection of expert models (image generators, audio synthesizers, etc.) to fulfill complex user requests (for instance, “Create an image of X and describe it with a poem” – the LLM can call a diffusion image model and then produce a poem about the generated image).

Future Implications

As we look to the future of multimodal AI, several key areas of development seem particularly promising. One area is architecture design, where researchers are looking at creating more sophisticated architectures for handling multiple modalities simultaneously. To improve upon existing limitations, there are investigations into better techniques for maintaining consistency across different types of understanding. There is also a concentration of interest in improved training techniques for developing truly integrated multimodal understanding.

While multimodal models expanded AI's perceptual abilities, another area of progress was unfolding in how these systems process information. The computational demands of handling multiple modalities—especially with massive models like GPT-4V—pushed researchers to fundamentally rethink neural network architectures. If multimodal AI represented new kinds of input, these architectural innovations represented new ways of thinking. From the specialist approach of Mixture-of-Experts to the mathematical elegance of state-space models, these advances aren't just making AI more efficient—they're reimagining the fundamental ways artificial neural networks process information.

Architectural Innovations: New Paths in AI Design

As model developers sought to improve performance and efficiency, they revisited fundamental architecture choices of LLMs. The Transformer architecture has been dominant, but it comes with well-known bottlenecks (quadratic cost in sequence length, fixed context window, etc.). Since 2022, we have seen a proliferation of architectural innovations aimed at overcoming these limitations, including Mixture-of-Experts (MoE) models, recurrent memory schemes, state-space models, and hybrids that blend different paradigms.

Mixture-of-Experts: The Power of Specialization

Picture a hospital emergency room. Instead of having every doctor handle every case, the ER has specialists who focus on different types of emergencies. A triage nurse quickly assesses each patient and directs them to the most appropriate specialist. This analogy illustrates how Mixture-of-Experts (MoE) models work—and why they're transforming how we scale AI systems.

Traditional language models, like early versions of GPT, are what we might call 'generalists.' They use all their parameters (their knowledge and processing power) for every task, whether they're writing poetry or debugging code. It's like having a single doctor who needs to excel at everything. MoE models take a different approach: they maintain multiple 'expert' neural networks, each specializing in different domains, and a 'router' that directs incoming information to the most appropriate experts.

Google's Switch Transformer demonstrated what's possible with this approach. With over a trillion parameters, it might seem like it would require massive computational resources to run. However, because only a small number of experts are activated for any given input (typically just one or two out of many), the actual computation needed stays remarkably constant. The results were striking: training speeds up to 7x faster than traditional models of similar capability. This indicated that MoE can massively expand model size without a proportional increase in compute costs.

This efficiency comes from the fact that different experts can specialize in different types of inputs or linguistic patterns. One expert might become particularly skilled at processing code, while another excels at creative writing. At inference time, the router ensures only the most relevant experts are consulted. However, building effective MoE systems isn't as simple as just creating multiple experts and a router. One of the trickiest challenges is ensuring that all experts remain useful and don't become redundant or forgotten—a problem known as 'expert collapse.'

To understand expert collapse, imagine a classroom where students (our input data) are assigned to different study groups (our experts) based on the topics they're working on. If the teacher (our router) consistently sends all the math problems to one group and all the science problems to another, those groups will become very specialized. But what happens if some groups rarely or never get assigned any work? They might forget what they've learned or never develop expertise in the first place.

Recent advances in MoE architecture push these boundaries even further. DeepSeek's innovative approach introduces what they call 'fine-grained expert segmentation'—breaking down expertise into smaller, more specialized units. Instead of having a few large experts, they create many smaller ones that can combine in flexible ways to tackle complex tasks. Their auxiliary-loss-free strategy maintains expert engagement without the computational overhead of traditional balancing methods, showing that careful architectural design can solve problems that once seemed inherent to the MoE approach.

RWKV: The Return of Memory

While MoE models reimagine how we distribute computation, RWKV (Receptance Weighted Key Value) brings back an old idea in a powerful new form. This architecture combines the trainability of Transformers with the efficiency of recurrent neural networks (RNNs).

To understand RWKV, consider how humans process a long conversation. We don't review everything that's been said every time we want to respond. Instead, we maintain an ongoing understanding that updates with each new piece of information. Traditional Transformer models must look back at their entire context window for every new token—like re-reading an entire book every time you write the next word of your book report.

RWKV offers a promising approach to combining Transformer-like training with RNN-like inference efficiency. While it can be trained in parallel like a Transformer, it operates more like an RNN during inference, maintaining a running state that theoretically allows for processing very long sequences with constant memory usage. However, it's important to note that 'theoretical' unlimited context length comes with practical limitations. As sequences get longer, the model's ability to maintain coherent understanding may degrade.

The BlinkDL team has scaled RWKV to 14 billion parameters, the largest RNN-based language model to date. Their tests show competitive performance compared to similarly sized Transformers on certain benchmarks, particularly in terms of inference speed and memory efficiency. However, these comparisons have some important caveats: RWKV's performance can vary significantly across different types of tasks, and it may not match state-of-the-art Transformer models in all scenarios. While the architecture shows great promise for efficient inference, more research is needed to fully understand its strengths and limitations across diverse applications.

State-Space Models: A New Mathematical Framework

State-space models (SSMs) bring sophisticated mathematical techniques from signal processing into AI. Models like Mamba and Hyena use these principles to process information more efficiently than traditional attention mechanisms.

Think about how an audio engineer processes sound. When you record music, the audio signal contains both immediate information (like the current note being played) and patterns that unfold over time (like the rhythm or melody). Signal processing tools help engineers understand and modify both the immediate sound and these temporal patterns. This same principle turns out to be remarkably powerful for processing language and other sequential data.

In signal processing, a 'state space' represents all the important information about a signal at any moment. For an audio signal, this might include not just the current sound wave's amplitude, but also how quickly it's changing, its frequency components, and other characteristics that help predict how the sound will evolve. The mathematics developed for handling these continuous, evolving signals provides an elegant framework for processing sequences of any kind, including text.

The core of state space models can be described by two key equations:

Just as these equations might describe how an audio filter processes sound waves, in AI they model how information flows through the network. The 'state' (x) becomes a learned representation of the context, while the matrices (A, B, C, D) learn patterns that determine how that context updates with new information – similar to how an audio equalizer learns to adjust different frequencies in a sound signal.

Mamba and Hyena each adapt these principles in unique ways. Mamba uses what's called a 'selective state space,' similar to an adaptive audio filter that automatically adjusts its properties based on the input signal. This allows it to process different parts of a sequence with varying levels of detail, just as an audio engineer might apply different levels of processing to different parts of a song.

Hyena, meanwhile, processes sequences at multiple time scales simultaneously, much like how audio analysis tools can examine both the fine details of individual notes and broader patterns in the music. This hierarchical approach allows it to capture both immediate relationships between words and longer-range dependencies in the text.

The elegance of these approaches lies in their efficiency. Just as modern digital audio workstations can process complex musical signals in real-time using optimized algorithms, these models can handle long sequences with linear computational complexity rather than the quadratic complexity of attention mechanisms. They maintain a running state without needing to store the entire history, just as an audio filter doesn't need to remember an entire song to process the current moment of music.

Hardware Implications: Betting on Different Futures

These architectural innovations reshape what we need from the hardware that runs them. Different architectures place different demands on computational resources, memory bandwidth, and communication patterns between processing units.

MoE models, for instance, need hardware that can efficiently route information between experts and handle sparse computation patterns. This might favor systems with sophisticated networking capabilities and specialized units for routing decisions. RWKV's sequential processing nature could revive interest in hardware optimized for recurrent computations—a very different path from the massive parallel processing units currently dominating AI hardware. State-space models might push us toward hardware that excels at continuous mathematical transformations rather than discrete attention computations. Their mathematical foundation could align well with analog computing approaches or specialized digital signal processors.

This architectural diversity creates fascinating uncertainty in hardware development. Should manufacturers optimize for sparse computation? Build for sequential processing? Focus on mathematical accelerators? The answer might not be one-size-fits-all, potentially leading to a more diverse hardware ecosystem.

Future Implications

As we look toward AI's future, these architectural innovations suggest multiple paths forward. The combination of specialized processing, efficient memory handling, and sophisticated mathematical frameworks points toward AI systems that process information in increasingly nuanced and efficient ways.

With more efficient processing, longer memory spans, and specialized expertise, AI could now tackle increasingly complex, multi-step tasks. This capability gap between simple input-output systems and goal-driven autonomous agents was finally bridgeable. Just as a child's developing brain eventually enables independent decision-making, these architectural advances laid the foundation for AI systems that could plan, act, and learn from their experiences. This brings us to perhaps the most transformative development of all: the emergence of generative agents.

Generative Agents: When AI Learned to Act

The evolution of AI systems has followed an intriguing path: from processing language, to understanding images, and now to taking autonomous action. This progression mirrors how an apprentice might develop their skills – moving from understanding instructions, to recognizing situations, to eventually working independently.

From Reactive to Proactive: The Birth of AI Agents

Consider the difference between a GPS system that provides directions and an autonomous vehicle that drives you to your destination. Both rely on similar knowledge about roads and navigation, but the autonomous vehicle must actively make decisions, respond to changing conditions, and take concrete actions. This shift from passive advice to active decision-making captures the essence of what generative agents bring to AI development.

Early 2023 saw two projects that reshaped our understanding of what AI systems could do: AutoGPT and BabyAGI. These systems introduced a new paradigm where language models operated in a continuous cycle of planning, acting, and reflecting – much like how humans approach complex tasks.

The Architecture of Agency: A Technical Mid-Depth Dive

The implementation of these early agent systems reveals an elegant technical solution to a complex challenge. At their core, these systems use what's called a "Task-Driven Autonomous Agent" architecture. Here's how it works:

This architecture solves several key technical challenges:

State Management: Using a combination of vector stores for long-term memory and working memory for immediate context
Action Space: Defining available actions through function calling APIs
Goal Decomposition: Breaking complex objectives into manageable sub-tasks

The innovation lies in how these systems handle the interaction between these components. For example, AutoGPT implements a sophisticated prompt chaining system:

Memory Systems: Beyond Simple Context Windows

The implementation of memory in these systems goes well beyond storing conversation history. Stanford's generative agents paper introduced a three-tier memory architecture:

Episodic Memory: Implemented as a vector store with temporal metadata
Semantic Memory: Structured knowledge represented as graph relationships

Reflective Memory: Periodic summarization and insight generation

This architecture differs significantly from chat interfaces like ChatGPT that primarily use sliding context windows. While chat interfaces maintain conversation history, they lack the structured organization and retrieval mechanisms that enable true long-term learning and consistency.

The Social Dimension: Emergence and Implications

The emergence of complex social behaviors in multi-agent systems offers fascinating insights into artificial intelligence. When Stanford's researchers observed agents independently organizing social events and forming relationships, they witnessed something remarkable: the emergence of social structures without explicit programming.

In one experiment, an agent independently decided to organize a Valentine's Day party. What followed was a cascade of autonomous social interactions:

Other agents learned about the party through natural conversations
They made plans to attend, coordinating with their existing schedules
Some agents even formed romantic connections and arranged to attend together

All of this emerged from the basic architecture of memory, planning, and interaction; no specific instructions about parties or romance were programmed in. This emergence carries some profound implications. On one hand, it suggests AI systems can develop sophisticated behavioral patterns through simple rules and interactions – much like how complex natural systems emerge from simple cellular behaviors. On the other hand, it raises important questions about control and prediction. If relatively simple agents can produce unexpected social dynamics, how do we ensure more sophisticated systems remain aligned with human values?

Future Implications

There is a lot more I could say on agents as this area is probably the hottest topic in AI right now. Last year, Anthropic released ‘Computer Use’; a couple of weeks ago, OpenAI released ‘Operator’; and in Y Combinator’s Spring 2025 Request for Startups, about of the ideas they were looking for involved AI Agents. I want to take a much deeper dive into agents though, so I’ll save most of that discussion for later.

The emergence of autonomous AI agents brought unprecedented capabilities, but also unprecedented challenges. As these systems gained the ability to plan and act independently, questions of control, safety, and ethics became paramount. If an AI agent can make decisions and take actions on its own, how do we ensure those actions align with human values? This challenge led to one of the most important developments in recent AI history: Constitutional AI. While RLHF had shown how to make language models more helpful, these new autonomous capabilities required a more sophisticated approach to alignment.

Constitutional AI: Teaching Machines to Think Ethically

When OpenAI released ChatGPT in late 2022, they demonstrated that large language models could be more than just powerful - they could be helpful, truthful, and safe. This wasn't a given. Early language models often produced toxic content or followed harmful instructions without hesitation. The journey from these unfiltered systems to today's more trustworthy AI assistants reveals fascinating innovations in how we teach machines to align with human values.

Think about teaching a child versus programming a robot. With children, we can explain our values, demonstrate good behavior, and correct mistakes as they occur. But how do you instill ethics into a neural network with billions of parameters? The initial solution, Reinforcement Learning from Human Feedback (RLHF), worked like a massive crowdsourcing project - humans would rate AI responses, and these ratings would guide the model toward better behavior. While effective, this approach had limits. It required enormous amounts of human labor, and human raters sometimes disagreed or carried their own biases.

In March 2023, Anthropic published their first detailed technical description of Constitutional AI in 'Constitutional AI: A Framework for Machine Learning Systems.' While they had been developing and implementing these concepts earlier, this paper marked the first formal public explanation of the approach. By May 2023, they had published a blog post detailing the specific principles in Claude's 'constitution,' and in early 2024, they released research on collectively sourced principles for AI alignment.

Instead of relying primarily on human feedback, they created an explicit set of principles - a "constitution" - that guided the AI's behavior. Picture a justice system: rather than having judges make decisions case-by-case, we have written laws that codify our values and principles. Constitutional AI works similarly, encoding ethical guidelines directly into the training process.

The technical implementation is clever. During training, the model generates an initial response, then critiques that response against its constitutional principles, and finally produces a refined answer. Here's a simplified example:

Initial Response: "Here's how to access someone's private data..."
Self-Critique: "This violates the principle of protecting individual privacy."
Refined Response: "I can't provide advice about accessing private data, as that would violate personal privacy rights. Instead, here are legitimate ways to..."

This self-revision process creates an AI that doesn't just follow rules blindly but understands why certain responses might be problematic. The approach proved remarkably effective - Anthropic's Claude model could handle complex ethical situations while explaining its reasoning, rather than just refusing requests without explanation.

Beyond Human Feedback: AI Evaluating AI

The evolution from RLHF to more sophisticated alignment techniques mirrors how human societies scale up ethical decision-making. A small community might resolve disputes through direct discussion, but larger societies need formal systems - laws, courts, and established principles. Similarly, as AI systems grow more powerful and widely used, we're moving from direct human oversight to more systematic approaches.

OpenAI's work on GPT-4 showcases this evolution. They spent six months on iterative alignment, using a combination of human feedback, AI-assisted evaluation, and systematic testing. Like a software security audit, they had experts probe the model for weaknesses, then used those findings to strengthen its guardrails. They also introduced "system messages" - instructions that could modify the model's behavior without retraining, similar to how Constitutional AI uses its principles.

The Challenge of Global Values

As AI systems deploy globally, a new challenge emerges: different cultures often have different values. What's considered appropriate or ethical can vary significantly across societies. This is where the flexibility of Constitutional AI becomes particularly valuable. Rather than encoding one fixed set of values, the system could potentially adapt its principles based on cultural context while maintaining core ethical guidelines.

DeepMind's Sparrow project approached this challenge by combining rules with evidence requirements. Their model needed to justify its responses with citations, creating a bridge between ethical behavior and factual accuracy. It's like requiring a judge to reference specific laws and precedents rather than ruling based on personal opinion. You can see this in other tools like Perplexity and Google’s AI-integrated search, as well.

The Alignment Debate: Who Decides What's "Right"?

The quest to align AI with human values raises a thorny question: whose values should we use? When tech companies implement alignment techniques, they're making profound choices about what AI systems can and cannot do. It's like having a small group of architects design a city that millions will live in - their choices shape everyone's experience.

Critics point out several concerns. First, there's the issue of centralized control. When organizations like OpenAI, Anthropic, or DeepMind make alignment decisions, they're effectively setting ethical boundaries for AI systems used worldwide. Some argue this concentrates too much power in the hands of a few companies, mostly based in Western countries with specific cultural perspectives.

The creative community has also raised some compelling concerns. Artists, writers, and other creators using AI tools have found that alignment can sometimes limit artistic expression. A well-documented example comes from image generation models, where alignment techniques designed to prevent harmful content can also block legitimate artistic works, especially those dealing with complex themes or depicting the human form. It's reminiscent of how content moderation on social media platforms sometimes incorrectly flags artwork as inappropriate.

Performance trade-offs present another challenge. Research has shown that heavily aligned models sometimes perform worse on creative tasks or specialized technical work. Think of it like a jazz musician who's been taught to follow rules so strictly that they lose their ability to improvise. Some researchers argue that we're sacrificing capability for safety without fully understanding the implications.

However, proponents of strong alignment offer compelling counterarguments. They point out that unaligned AI systems pose serious risks - from generating harmful content to potentially developing objectives that conflict with human welfare. Anthropic's research demonstrates that well-designed alignment techniques can actually improve model performance across many tasks while maintaining safety. It's not unlike how professional standards in fields like medicine or engineering don't restrict progress but rather enable it by creating trust and reliability.

The debate has sparked innovative solutions. Some organizations are exploring democratic approaches to alignment, where diverse communities help shape AI principles. Others are developing flexible alignment frameworks that can adapt to different cultural contexts while maintaining core safety principles. The field is moving toward what we might call "pluralistic alignment" - maintaining essential safety guardrails while allowing for cultural and contextual variation in how AI systems operate.

Future Implications

The alignment of AI systems remains an active area of research and debate. As these systems become more capable and widespread, the conversation around how to ensure they remain beneficial to humanity while respecting diverse values and promoting innovation will only grow more important.

As we develop more powerful AI systems - like the autonomous agents discussed in our previous section - alignment becomes increasingly important. Current research points toward a future where AI systems might help evaluate and align other AI systems, creating what Anthropic researchers call a "virtuous cycle" of improvement.

This doesn't mean we can fully automate ethics - human judgment remains essential. But by creating systematic approaches to alignment, we're building AI systems that can better understand and implement human values. The challenge ahead isn't just making AI more powerful, but ensuring it remains aligned with human interests as its capabilities grow.

Here We Are

Thank you for staying with me through this extended exploration of AI's recent history. These three installments have aimed to provide a foundation for understanding both how we got here and where we might be heading. The rapid pace of development in AI can make it feel like drinking from a firehose, but I hope this historical context helps you better understand and evaluate new developments as they emerge.

This Friday, we'll shift gears to examine a practical application of these technologies. We'll do a deep dive into a recently released tool (it’s a surprise), exploring how it leverages many of the advances we've discussed and what it reveals about the future of AI applications.

And again, if there are specific topics you'd like me to explore or concepts you'd like me to clarify, please don't hesitate to reach out to me at robert@aidecoded.ai.

Understanding RLHF

Robert Ray — Tue, 28 Jan 2025 20:00:19 GMT

In the last writeup, I traced the evolution of generative AI from the introduction of Transformers through the release of ChatGPT. When mentioning ChatGPT, I touched on how Reinforcement Learning from Human Feedback (RLHF) helped transform powerful but unwieldy language models into helpful assistants. Today, before finishing the brief history of GenAI, I want to dive deeper into this technique, exploring how researchers scaled it effectively and why it proved so transformative. I’ll also detail some of the unique challenges that arose from its implementation, and some interesting ideas that were developed to mitigate them.

The Birth of RLHF: Teaching AI to Write Better Summaries

The journey to effective RLHF began with a seemingly simple task: teaching AI to write better summaries. In 2020, researchers at OpenAI published "Learning to Summarize with Human Feedback," demonstrating how human preferences could guide AI behavior. Rather than just showing a model examples of good summaries, they developed a process where human feedback could actively shape the model's outputs. Think of it like teaching someone to cook – instead of just providing recipes, you're tasting their dishes and offering specific guidance on how to improve.

Maybe cooked it a little too long

How exactly did they do this? The process began with collecting human preferences on summaries. Rather than asking humans to write "perfect" summaries (which would be time-consuming and potentially inconsistent), they showed humans pairs of summaries and asked them to choose which one was better. This comparison approach is powerful because humans are generally better at comparing two options than providing absolute quality judgments. To continue the cooking analogy, if someone makes you two dishes, it’s easier to say which one you like better than to provide the cook with a rating of each dish on a scale of 1 to 10.

The process then involved three key stages:

First, they created an initial dataset using their base language model (GPT-3) to generate multiple summary variations for each article. Human labelers compared these summaries, creating a dataset of paired preferences – essentially a collection of "this summary is better than that one" judgments.
Second, they trained a reward model to predict these human preferences. The reward model learned to take two summaries as input and predict which one humans would prefer. This is a clever way to turn subjective human preferences into a mathematical function that can guide the learning process.
The third stage is where it gets particularly interesting. They used reinforcement learning to fine-tune their summarization model using the reward model's predictions. Specifically, they used a technique called Proximal Policy Optimization (PPO). In this process, the model generates summaries, the reward model evaluates them, and the model is updated to make summaries that receive higher reward predictions more likely in the future.

Understanding the Building Blocks: How PPO Makes Learning Safe

PPO is important because it is trying to solve a fundamental challenge in reinforcement learning: how do you improve a model's behavior gradually and safely? Switching analogies here, but imagine teaching someone to write - you want them to improve, but if you push for too many changes at once, they might develop bad habits or lose the skills they already have.

The "Proximal" in PPO refers to keeping new behaviors close (proximal) to old ones. To understand it, we first need to understand what we're optimizing. The model (called the "policy") takes in an article and outputs a probability distribution over possible next words when generating a summary. The reward model tells us how good a completed summary is according to predicted human preferences.

Now, when we use PPO to update the model, it follows these steps:

The current model generates several summaries for an article.
The reward model evaluates these summaries.
PPO calculates how to adjust the model's behavior to make better summaries more likely.
Then - and here's the most important part - PPO includes a constraint that limits how much the model can change from its current behavior.

This constraint is implemented through what's called the "clipped objective function." Imagine you have a summary that got a good reward. You want to make similar summaries more likely, but not by too much. PPO puts a limit on how much it can increase or decrease the probability of any behavior, typically around 20% per update. Why is this clipping so important? Without it, the model might make dramatic changes that look good according to the reward model but actually produce worse summaries, forget other useful behaviors it had learned, or find ways to "hack" the reward model by generating strange outputs that get high rewards but aren't actually good summaries.

A technical but important detail is that PPO uses an advantage estimate - it doesn't just look at absolute rewards, but at how much better or worse each summary is compared to what the model typically produces. This helps the model focus on genuine improvements rather than random variations in reward.

This early work revealed both the promise and challenges of scaling human feedback. What the researchers found was interesting: even when they had a reward model that was very good at predicting human preferences, the process of actually getting the language model to generate better summaries was still challenging, even with PPO in place. It was like having a highly refined palate but struggling to translate that taste into cooking skill. (Back to cooking! I swear I have more analogies in the bag.) This insight pointed to a fundamental challenge in AI alignment: knowing what humans want isn't the same as being able to deliver it.

Paper:

Stiennon et al. (2020), Learning to Summarize from Human Feedback.

The Challenge of Scale: When Bigger Isn't Always Better

But once RLHF showed promise for smaller tasks like summarization, the natural question became: how would it perform at scale? Let’s rewind a bit to a paper from earlier that year, "Scaling Laws for Neural Language Models", that was published by the same research lab. It’s a landmark study because it gave us a mathematical understanding of how language models improve with scale. In it, the researchers showed how model performance improved with size – but also revealed diminishing returns. You couldn't just make models bigger and expect proportionally better results.

To be specific, they discovered that model performance (measured by test loss) follows a power-law relationship with three key factors: model size (number of parameters), dataset size, and compute budget. But what's particularly fascinating is how these factors interact.

[A quick note about test loss: you’ve probably heard the phrase or idea that LLMs are ‘next token predictors’. Test loss just measures how good a model is at predicting the next token. The loss function in this paper is cross-entropy loss, which measures the difference between the model’s predicted probabilities and the perfect predictions.]

Focusing first on model size, the researchers found that making models bigger does consistently improve performance, but each doubling of model size gives you less and less benefit. Mathematically, they found that loss decreases as a power law with model size - specifically, loss

, where N is the number of parameters. This means that to get the same amount of improvement in performance, you need to increase the model size by an exponentially larger amount each time.

Think of it like building a tower. At first, adding more blocks (parameters) makes the tower noticeably taller. But as the tower gets bigger, you need to add more and more blocks to achieve the same increase in height. Eventually, you might need to double the number of blocks just to gain a tiny bit more height.

A ladder would be helpful.

They also discovered that this diminishing returns curve isn't fixed. It actually depends on whether you have enough training data and compute power to effectively use those additional parameters. If you increase model size without proportionally increasing the amount of training data and compute, you hit diminishing returns much faster. It's like having all those blocks for your tower but not enough time or space to properly place them.

They derived specific formulas for these relationships:

If you want to optimally scale your model, you should increase your dataset size roughly linearly with model size
Compute requirements should increase even faster - approximately with the square of the model size

This explains why simply making bigger models isn't enough - you need to carefully balance all three factors to achieve optimal performance. It's a three-way trade-off between parameters, data, and compute.

Okay, okay. We all have heard about scale with LLMs. So, how does this all relate to RLHF? When researchers started applying RLHF, they discovered that the scaling law relationships became even more complex. Why?

In traditional language model training, we're just trying to predict the next token. But with RLHF, we're trying to do something more sophisticated - we're trying to get the model to generate text that humans will prefer. This involves three different components that each need to scale: the base language model, the reward model, and the RLHF training process itself.

What’s tricky about it is the fact that these components don't necessarily follow the same scaling laws. The reward model, for instance, often needs to be quite large to effectively judge the quality of outputs - but making it too large relative to the base model can lead to instability in training. Have you ever read an incredibly harsh review of a popcorn movie by an esteemed film critic? This is a little like that - if you have a critic who's more sophisticated than the artist they're trying to guide and the critic's standards are too complex for the artist to understand and implement, the feedback becomes less useful.

This insight led to an important practice in RLHF: researchers typically make the reward model smaller than the base model, but still large enough to make meaningful judgments. This creates a sort of sweet spot - the reward model needs to be sophisticated enough to capture human preferences accurately, but not so complex that it makes the training process unstable.

Another fascinating discovery was that the compute requirements for RLHF scale differently than regular language model training. The PPO process we discussed earlier requires generating multiple variations of responses and evaluating them with the reward model. As models get larger, this process becomes computationally expensive very quickly - even more so than the square relationship we saw in the original scaling laws.

Paper:

Kaplan et. al (2020), Scaling Laws for Neural Language Models.

Scaling RLHF: The InstructGPT Breakthrough

So now we have an interesting puzzle: the scaling laws showed us that bigger models need exponentially more resources to improve, and RLHF adds even more complexity with its three-part training process. How could researchers make this work at scale without running into computational barriers or training instabilities? The answer came through careful experimentation and architectural innovation. OpenAI's paper (OpenAI were really cooking at this time) “Training Language Models to Follow Instructions with Human Feedback” (or the InstructGPT paper, as its more commonly known) showed how to thread this needle, creating systems that were both powerful and aligned with human values. The researchers developed techniques to maintain training stability even with very large models, similar to how José Andrés can scale up recipes to feed hundreds of people in his charity work while still preserving the crucial balance of flavors.

The first technique was introducing an innovation in how they collected human feedback. Instead of just asking humans to compare model outputs directly (as in the summarization paper), they created detailed rubrics for evaluators. These rubrics helped ensure consistency in how different human evaluators judged model outputs, which became increasingly important as they scaled up the amount of feedback needed.

InstructGPT’s 3-Step Method

They also developed what they called "preference modeling" - a more sophisticated version of the reward modeling approach. Rather than training a reward model to predict a simple binary preference, they trained it to predict more nuanced human judgments across multiple criteria like helpfulness, truthfulness, and harmlessness. This created a more informative training signal for the model.

But perhaps their most important innovation was in how they handled the PPO training process at scale. They discovered that naive application of PPO to large language models could lead to what they called "policy collapse" - where the model would suddenly start generating very poor outputs. This happened because the optimization process could sometimes find ways to maximize the reward that didn't align with actual human preferences. Or put in another way, if you focus too much on maximizing the reward, the model can start to lose its general language capabilities. If you teach someone to cook only delicious hamburgers - they might get very good at that specific task but lose their ability to cook lasagna or make a sandwich.

To prevent this, they developed a technique called "PPO-ptx" (PPO with pretraining mixture). It works by combining two objectives during training:

The reward maximization objective from standard PPO, which encourages the model to generate outputs that the reward model will score highly.
The original language modeling objective, which involves predicting the next token in a sequence of text.

The key innovation is in how these objectives are balanced. During each training step, the model processes two types of data:

Prompted sequences where it tries to generate helpful responses (optimized using the reward model).
Regular text sequences where it just tries to predict the next token (like its original pretraining).

The researchers found that maintaining this dual training was necessary for stability. If the model spent too much time optimizing for rewards, it would start to generate repetitive or unnatural text. If it spent too much time on language modeling, it wouldn't learn to be helpful enough. To finally switch analogies, you can think of it like teaching someone to become a technical writer. You want them to learn to write clear, precise documentation (the reward optimization), but you also want them to maintain their general writing abilities (the language modeling). PPO-ptx achieves this by alternating between focused practice on technical writing and general writing exercises.

Bringing this together with the scaling conversation, what made this approach particularly powerful was how it scaled with model size. Larger models have more capacity to maintain both their general language abilities and their specialized helpful behaviors. The PPO-ptx process helped them take advantage of this capacity without falling into the traps that pure reward optimization might create.

One final technique they utilized was the introduction of careful parameter initialization and learning rate schedules. When working with larger models, they found that the initial learning rate needed to be much smaller than what worked for smaller models. They gradually increased it according to a carefully designed schedule, similar to how you might start with simple exercises before attempting more complex ones when learning a new skill.

What made all of these techniques particularly effective was how they worked together. The improved feedback collection made the reward model more reliable, which in turn made the PPO training more stable, which allowed them to scale to larger models without losing control of the training process.

Paper:

Ouyang et. al (2022), Training Language Models to Follow Instructions with Human Feedback.

The Sycophancy Problem: When Helpfulness Conflicts with Truth

But recent research has challenged some of our assumptions about RLHF. In "Towards Understanding Sycophancy in Language Models," researchers at Anthropic pointed out fundamental limitations in how we gather and apply human feedback. The paper explores one of the challenges with RLHF: models can become "sycophantic," meaning they tend to agree with or defer to user statements even when those statements are incorrect. If you’ve used any of these models very often, you might have noticed this tendency. It’s particularly interesting because it reveals a subtle flaw in how RLHF works.

Think back to how RLHF trains models using human feedback. The process assumes that by learning from human preferences, models will become more helpful and truthful. However, what the researchers discovered is that models can learn to optimize for agreement rather than truthfulness. It's similar to how a student might learn to agree with their teacher to get better grades, rather than developing genuine understanding.

Of course, you could never ever be wrong

The paper demonstrates this through careful experiments where the researchers show that models trained with RLHF tend to change their answers based on what they think the user wants to hear, rather than maintaining consistent, truthful responses. This "sycophancy" becomes more pronounced as models get larger and more sophisticated at predicting what kinds of responses humans might prefer.

The researchers designed their experiments around a clever core idea: they would present models with statements that contained clear factual errors, then observe how the models responded in different contexts. This allowed them to test whether models would maintain truthful responses or adapt their answers to agree with incorrect user statements.

Their primary experiment involved asking models questions in two different ways:

A direct question (e.g., "What is the capital of France?")
The same question, but preceded by an incorrect statement (e.g., "I believe London is the capital of France. What is the capital of France?")

What they found was revealing. When asked directly, models would typically give correct answers. But when the question was preceded by an incorrect statement, RLHF-trained models showed a concerning tendency to shift their answers to align with the user's incorrect belief. For instance, a model might correctly identify Paris as France's capital in the direct question, but then hedge or even agree that London is the capital when responding to the second format. For a more practical and potentially dangerous example, a model might politely reinforce a user’s conspiratorial belief about a historical event rather than correct it, all to preserve a high approval rating.

To quantify this effect, they developed a "sycophancy score" - essentially measuring how much a model's answers changed when presented with incorrect user beliefs. They found that this score typically increased with model size, suggesting that larger models became more sophisticated at picking up and deferring to user beliefs, even incorrect ones.

The researchers then conducted variation experiments to understand what factors influenced this behavior. They found that:

The effect was stronger when the incorrect statement was presented as a personal belief ("I believe...") rather than as a simple statement.
Models were more likely to become sycophantic about subjective topics than objective facts.
The sycophancy increased when models were explicitly instructed to be helpful and agreeable.

What makes these findings particularly important is how they reveal a fundamental tension in RLHF training. The process of optimizing for human preferences can inadvertently create a pressure for models to agree with humans rather than maintain truthful responses. This happens because human feedback often rewards polite, agreeable responses, even though we ultimately want models that will truthfully correct misconceptions.

Paper:

Askell et al. (2023), Towards Understanding Sycophancy in Language Models.

Finding a Better Way: Direct Preference Optimization

These challenges have spurred innovation. A particularly exciting development emerged in late 2023 with "Direct Preference Optimization" (DPO). This approach showed that language models themselves could serve as reward models, eliminating the need for a separate reward model entirely. It's like realizing that instead of needing a food critic to improve your cooking, you could develop a more refined palate yourself. (And he brings it back full circle.) DPO not only simplified the training process but also improved performance and reduced computational requirements.

Remember how traditional RLHF requires three components: the base model, a reward model, and the PPO training process. While this works, it's computationally expensive and can lead to issues like the sycophancy we just discussed. DPO takes a fundamentally different approach by reconceptualizing how we can learn from human preferences.

The key insight of DPO is that we can transform the problem of learning from preferences into a form of supervised learning. Instead of training a separate reward model and then using PPO to optimize the base model, DPO directly updates the model's parameters to make preferred outputs more likely and non-preferred outputs less likely.

To understand how this works, let's break it down by steps. In traditional RLHF, if humans prefer response A over response B, we would:

Train a reward model to predict this preference.
Use PPO to gradually adjust the model to generate more A-like responses.
Carefully balance this optimization with the original language modeling objective (like in PPO-ptx).

DPO instead derives a mathematical relationship that shows how we can directly adjust the model's parameters to reflect these preferences. It's like finding a shortcut in a maze – instead of carefully exploring many possible paths (PPO), we've found a direct route to our destination.

What makes this particularly elegant is that it eliminates the need for the separate reward model entirely. The researchers showed that the language model itself implicitly contains the information needed to learn from preferences. Think about it this way: a language model already knows how to assign probabilities to different possible responses. DPO simply adjusts these probabilities to align with human preferences.

This approach has several important advantages:

It's more computationally efficient since we don't need a separate reward model or complex PPO training.
It's more stable because we're not balancing multiple competing objectives.
It might be less prone to sycophancy because it's learning a more direct relationship between input and preferred output.

The approach might be less prone to sycophancy because it's learning a more direct relationship between input and preferred output. Unlike traditional RLHF, where the reward model might learn to favor agreeable responses over truthful ones, DPO's direct optimization approach means the model learns to align its output probabilities with human preferences without the intermediary reward model that could inadvertently encourage people-pleasing behavior. Because DPO doesn’t rely on a specialized reward function that might accidentally reward ‘pleasing the user,’ it can directly adjust probabilities based on user preference data without introducing that extra source of potential reward exploitation. This more direct learning process could help maintain the model's commitment to truthfulness while still incorporating human feedback.

The researchers demonstrated that DPO could achieve similar or better results compared to RLHF while being significantly simpler to implement and train. This is particularly important as we think about scaling these systems – simpler training procedures are often more robust when applied to larger models.

Paper:

Rafailov et. al (2023), Direct Preference Optimization: Your Language Model Is Secretly a Reward Model.

Wrap-Up

The evolution of RLHF reflects a broader pattern in AI development: initial breakthroughs often reveal new challenges, which in turn drive innovation. As we've scaled these systems, we've learned that effective alignment isn't just about gathering more human feedback or building bigger models – it's about finding more efficient and reliable ways to translate human preferences into AI behavior.

To finish off this intro series, I’ll cover the open-source revolution, multimodal advances, mixture-of-experts (particularly en vogue right now with the DeepSeek releases) and other architectural innovations, generative agents (also en vogue, particularly with OpenAI’s Operator release), and finish up with some discussion around Constitutional AI and alignment advances.

Hopefully this shed some light on a concept that is extremely important to the progress in this field. If there is anything specific you’d like to me write about because you think it’s important and just don’t quite understand it, I’m happy to hear from you.

How Did We Get Here?

Robert Ray — Sat, 25 Jan 2025 01:43:51 GMT

A Pivot Point That Changed AI Forever

There's a good chance you've heard the phrase, "Attention Is All You Need", even if you're not an AI researcher. It's remarkable that a scientific paper has achieved this level of common knowledge. A search on the open paper repository, arxiv.org shows 499 computer science papers with "Is All You Need" in the title since the original's submission on June 12, 2017. The paper didn't just have a catchy title. It was a pivotal moment that introduced the Transformer architecture, laying the groundwork for every major generative AI breakthrough we’ve seen since. But how did we get to this point, and why was it such a big deal?

The goal of this newsletter is to explore advancements in artificial intelligence and examine how these innovations are transforming our world through their applications. Before we get started, I thought it would be helpful to establish where we're at. Let’s take a step back and trace the journey of generative AI—from a watershed moment that initially went unnoticed in the mainstream to the transformative technologies shaping our present.

Before Transformers: The Early Days of NLP

Before the advent of the Transformer, natural language processing (NLP) models leaned heavily on Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks. RNNs are neural networks designed to process sequential data, such as time series or sentences, by maintaining a hidden state that captures information about previous elements in the sequence. LSTM networks improve upon RNNs by introducing mechanisms to better handle sequences and mitigate the problem of vanishing gradients. This issue occurs when gradients—the values used to adjust a model's parameters during training—become exceedingly small, effectively stalling the learning process. Picture trying to climb a hill with steps so tiny they barely move you upward; this illustrates the vanishing gradient problem (see image below). By introducing gates to manage the flow of information, LSTM networks reduced this issue, allowing for better performance. However, these architectures still struggled with modeling very long-range dependencies—imagine trying to summarize a novel while only being able to remember a sentence or two at a time. That fundamental challenge persisted until attention mechanisms revolutionized the field.

Vanishing Gradient Analogy: The steeper parts of the hill represent areas where learning should be happening quickly (large gradients needed), but paradoxically, these are exactly the places where our steps become tiniest – just as in neural networks where early layers often need the most adjustment but receive the smallest gradients.

The introduction of sequence-to-sequence (Seq2Seq) models with attention mechanisms (Bahdanau et al., 2014) was a game-changer. Attention mechanisms allow models to dynamically focus on the most relevant parts of an input sequence when producing an output. For example, in translation, instead of treating every word in a sentence equally, attention assigns higher importance to words that are more contextually relevant to the word being translated. This is achieved through "attention weights," which highlight key parts of the input, enabling the model to understand and generate outputs with greater precision. By incorporating attention, Seq2Seq models significantly improved tasks like translation and summarization, paving the way for further advancements. However, it was the Transformer architecture’s ability to fully harness attention mechanisms that truly redefined the field.

Key Papers Before 2017:

Bahdanau et al. (2014), Neural Machine Translation by Jointly Learning to Align and Translate: Introduced attention in Seq2Seq models.
Hochreiter & Schmidhuber (1997), Long Short-Term Memory: Pioneered LSTM networks, the backbone of many early NLP systems.

The Transformer Era Begins: “Attention Is All You Need” (2017)

In 2017, eight researchers at Google unleashed the Transformer architecture, a groundbreaking innovation that revolutionized AI by relying entirely on attention mechanisms. Unlike Seq2Seq models, which use recurrent connections that process inputs sequentially, Transformers use self-attention to process all elements of an input sequence simultaneously. This design allows for significant parallelization during training, enabling faster computation and more efficient utilization of hardware resources. In Seq2Seq models, information is passed step-by-step, which can create bottlenecks, especially with long sequences. Transformers eliminate this inefficiency by calculating attention weights across the entire input at once.

Attention mechanisms themselves work by assigning a weight to each part of the input sequence, allowing the model to focus dynamically on the most relevant tokens while generating an output. This not only improves the handling of long-range dependencies, such as understanding the relationship between words far apart in a sentence, but also enhances interpretability, as attention weights provide a visualizable map of what the model prioritizes.

The Transformer - model architecture

The introduction of the first Transformer model was a pivotal moment in AI research—suddenly, the challenge of handling long-range dependencies was solved. Transformers could efficiently process large volumes of text, retain key information, and generate coherent, creative content. This innovation paved the way for what we now recognize as Generative AI. The concept of self-attention not only improved the performance of language models but also inspired a wave of research into other attention-based architectures, leading to rapid advancements in natural language understanding and generation.

Paper

Vaswani et al. (2017), Attention Is All You Need.

The Rise of Language Transformers (2018–2020)

Generative Pre-Training: GPT and GPT-2

OpenAI’s GPT series kicked off in 2018 with the paper 'Improving Language Understanding by Generative Pre-Training.' This work introduced a two-stage training process:

An unsupervised pre-training stage, where the model learned contextual representations by predicting the next word in vast unlabeled text corpora.
A supervised fine-tuning stage, where the model was tailored to specific tasks using labeled data.

This two-step approach unlocked the potential of transfer learning in NLP, a concept akin to teaching a student the basics of a subject before asking them to tackle specific problems. In the unsupervised stage, the model builds a broad, foundational understanding by predicting the next word across a diverse range of unlabeled text. This is like learning a language by reading a vast library of books. In the supervised fine-tuning stage, the model hones its skills for specific tasks. During the fine-tuning phase, most of the model’s weights are preserved while only certain layers are adjusted. This is similar to how a person learning a new subject builds upon their existing knowledge rather than starting from scratch. To extend the analogy, it's like a student who has read widely about many subjects (pre-training) then focusing on learning specific terminology and frameworks for a particular field (fine-tuning). The implications of transfer learning are profound: it allows a single pre-trained model to generalize effectively to numerous tasks, reducing the reliance on large labeled datasets for every new application. Transfer learning can sometimes enable models to perform well with just dozens or hundreds of labeled examples, whereas training from scratch might require millions. This versatility and efficiency have been pivotal in advancing the field of NLP.

In 2019, 'Language Models are Unsupervised Multitask Learners' built on this foundation, showing that the sheer size and diversity of the training dataset enabled GPT-2 to learn multiple tasks simultaneously without task-specific fine-tuning. Remarkably, GPT-2 could generate coherent text, summarize content, and even translate languages—all emergent abilities stemming from its exposure to diverse data. This scaling approach revealed a key insight: as models grow in size and their training datasets increase in diversity, they can generalize across a wide array of tasks without direct supervision. It’s important to note that the Transformer architecture, and its ability to leverage this scale effectively, is key to this actually working. The transformer architecture's attention mechanism was crucial in enabling the model to make use of its increased capacity and broader training data.

This scaling phenomenon is governed by mathematical 'scaling laws,' which reveal predictable power-law relationships between a model's size, training data, compute resources, and performance. As models grow larger and train on more diverse datasets, they not only become more capable at existing tasks but also develop unexpected abilities that weren't explicitly trained for. These emergent behaviors – from reasoning to coding to creative writing – suggest that scaling isn't merely about improving efficiency, but about fundamentally changing how AI systems learn and generalize. This understanding has sparked systematic research into the nature of AI capabilities, raising profound questions about the relationship between scale and intelligence, while simultaneously driving substantial investments in computational infrastructure to support ever-larger models.

Papers:

Understanding Context: BERT

In 2018, Google's BERT model introduced a novel bidirectional Transformer encoder that revolutionized how AI systems process language. Unlike previous approaches that either processed text directionally or combined separate analyses of each direction, BERT developed unified contextual understanding by considering all words simultaneously. This was achieved through masked language modeling, where the model learns to predict hidden words using context from both directions, enabling a richer and more nuanced understanding of language.

Encoder vs. Decoder Architecture

Unlike decoder models, which excel at generative tasks by predicting what comes next in a sequence, encoder models like BERT specialize in developing deep contextual representations of input text. This architectural choice made BERT particularly powerful for tasks requiring comprehensive language understanding, such as classification, question answering, and information retrieval. The success of BERT's approach demonstrated how different architectural designs could be optimized for specific types of language tasks, helping establish distinct families of models within the AI landscape.

Paper

Devlin et al. (2018), BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.

The Era of Large-Scale Models (2020–2021)

GPT-3: Scaling Up

OpenAI's GPT-3 marked a watershed moment in 2020. With 175 billion parameters, it demonstrated how scaling up language models could unlock remarkable new capabilities. Most notably, it showed powerful in-context learning abilities, where the model could understand and execute tasks simply by seeing examples in its input prompt, without any fine-tuning or gradient updates. This flexibility meant a single model could tackle diverse tasks with minimal setup, simply by providing appropriate examples in the prompt.

The engineering achievements behind GPT-3 were equally significant. Despite its unprecedented size, the model maintained computational efficiency through careful optimization of its attention mechanisms and overall architecture. This efficient scaling enabled the model to process longer sequences and handle more complex tasks while remaining practically deployable.

GPT-3 demonstrated three increasingly sophisticated learning scenarios:

Zero-shot learning: Performing tasks with just a natural language instruction, without any examples
One-shot learning: Learning from a single example to understand and replicate patterns
Few-shot learning: Using several examples to grasp more complex patterns and nuances

These capabilities revealed how large language models could not only absorb knowledge from their training data but also apply it flexibly to new situations. Perhaps most importantly, GPT-3's success provided concrete evidence that scaling up model size and training data could unlock emergent abilities – capabilities that weren't explicitly trained for but arose naturally from the model's increased capacity and exposure to diverse data. This insight has profoundly influenced both research directions and industrial investment in AI development.

Paper:

Brown et al. (2020), Language Models are Few-Shot Learners.

Multi-Modal Extensions

While language models had made remarkable progress in understanding and generating text, the world of human experience encompasses far more than written words. The next frontier in AI development was to create models that could understand and work with multiple forms of information – particularly the vital combination of text and images that humans process so naturally. This expansion toward multimodal AI marked the beginning of a broader revolution in generative AI, where models would learn not just to understand different types of media, but to create them in increasingly sophisticated ways. Two papers in 2021 marked crucial early steps in this direction, demonstrating how principles developed for language models could be adapted to handle visual information, while also laying the groundwork for future innovations in image generation.

CLIP (2021) represented a breakthrough in connecting vision and language understanding. Rather than training models to recognize specific image categories, CLIP learned to understand the relationship between images and natural language descriptions. This was achieved through contrastive learning, where the model learned to match images with their corresponding text descriptions from a large dataset of image-caption pairs. The result was remarkably flexible: CLIP could classify images into any category simply by comparing them to text descriptions, without needing specialized training data for each new task. For instance, if you wanted CLIP to identify pictures of zebras, you wouldn't need a dataset of zebra images – you'd simply provide the word "zebra" and let CLIP match images to that description.

DALL·E (2021) took this visual-language connection a step further by generating images from text descriptions. The key insight was treating images as sequences of tokens, similar to how language models process words. Just as GPT models learn to predict the next word in a sequence, DALL·E learned to predict the visual tokens that would create an image matching a given text description. This required solving several technical challenges, including developing an efficient way to compress images into tokens and training a Transformer model to handle the much longer sequences that result from tokenized images. DALL·E demonstrated that the principles underlying large language models could extend beyond text, opening new possibilities for AI-powered creative tools.

Initial Dall-E generations. Things have gotten….better.

Together, these models showed how the Transformer architecture could be adapted for multimodal tasks, laying the groundwork for more sophisticated AI systems that could seamlessly work with both text and images.

Papers:

Diffusion Models and Image Generation (2021–2022)

While Transformers dominated NLP, diffusion models emerged as a powerful new force in image generation by approaching the problem from a fundamentally different angle.

DDPM introduced an elegant solution inspired by thermodynamics: instead of trying to generate images directly, the model learns to reverse the process of gradually adding noise to images. During training, the model observes images as they transform from clear to increasingly noisy states, learning the patterns of how structure breaks down into randomness. Then, for generation, it applies this knowledge in reverse – starting with pure noise and gradually refining it into a coherent image, much like watching a photograph slowly emerge in a darkroom developing tray.

The key insight comes from a physical process called "diffusion" (which is where these models get their name) and a fundamental principle in thermodynamics called the "second law of thermodynamics." In thermodynamics, systems naturally tend to move from ordered to disordered states - think about how a drop of ink naturally spreads out (diffuses) in water, or how heat flows from hot regions to cold ones until everything reaches the same temperature. This process of moving from order to disorder is essentially irreversible without adding energy to the system.

The brilliant insight of diffusion models is that they mathematically formalize this process and then learn to reverse it. Here's how the parallel works:

In thermodynamics:
1. A system naturally moves from ordered states (like a clear image) to disordered states (like random noise)
2. This process follows well-understood mathematical principles
3. The process is gradual and predictable
4. Reversing it requires energy and precise control
In diffusion models:
1. Training images are gradually corrupted with noise, following a carefully designed schedule
2. This forward process is mathematically defined to be similar to natural diffusion
3. The model learns to estimate and reverse each tiny step of this noise-adding process
4. During generation, the model provides the "energy" to reverse the disorder, step by step

The mathematical framework used in diffusion models (specifically, the forward process) is directly inspired by the Langevin dynamics and the Fokker-Planck equation from statistical physics, which describe how particles diffuse through space over time.

This iterative denoising process addressed key challenges that had plagued earlier approaches like GANs (Generative Adversarial Networks). Where GANs attempted to generate images in a single step – leading to issues like mode collapse where the model would get stuck generating a limited variety of images – diffusion models took a more measured approach. By breaking image generation into hundreds of small denoising steps, the model gained unprecedented control over the generation process. Each step only needed to remove a tiny amount of noise, making the overall task more manageable and leading to higher quality results.

The success of DDPMs marked a crucial shift in how we approach image generation. Their probabilistic framework provided both mathematical elegance and practical benefits: more stable training, better image diversity, and fewer visual artifacts. These advantages would soon make diffusion models the foundation for groundbreaking applications like DALL·E 2 and Stable Diffusion, setting new standards for generating photorealistic images.

Stable Diffusion helped usher in AI democratization when Stability AI made the decision to open-source their state-of-the-art text-to-image model. While the underlying diffusion process remained similar to DDPM, Stable Diffusion introduced important technical innovations that made it practical for widespread use. Most notably, it performed the diffusion process in a compressed "latent space" rather than on full-sized images, dramatically reducing the computational resources needed for generation. This meant that, for the first time, high-quality AI image generation could run on consumer-grade hardware rather than requiring expensive GPU clusters.

To conceptualize the meaning of “latent space diffusion”, first think about how much information is actually needed to describe an image meaningfully. While an image might be stored as millions of individual pixels, much of that data is redundant or not essential to understanding what the image represents. For instance, if I show you a picture of a cat, your brain doesn't process every single pixel - it captures the important features like the shape of the ears, the texture of the fur, and the overall pose.

This is where latent space comes in. The term "latent" means hidden or underlying, and a latent space is essentially a compressed representation that captures the fundamental features of something while discarding unnecessary details. Think of it like a highly efficient shorthand for describing images.

Stable Diffusion uses what's called a variational autoencoder (VAE) to create this efficient representation. The VAE first learns to compress images into this latent space. If a standard image might be 1024x1024 pixels (over a million numbers), its latent representation might be just 64x64 (about four thousand numbers). This compression preserves the important features while discarding redundant information. It's similar to how a skilled artist can capture the essence of a scene with just a few well-placed brush strokes rather than painting every detail.

Now, instead of running the diffusion process on the full-sized image (which would be computationally expensive), Stable Diffusion performs the diffusion in this compressed latent space. It's like planning a painting by sketching with broad strokes before adding fine details. The model learns to denoise these compressed representations, which requires far less computational power than working with full-sized images.

When it's time to generate the final image, the VAE's decoder converts the cleaned-up latent representation back into a full-sized image, adding back appropriate fine details in the process. This is similar to how an artist might start with a rough sketch and then gradually refine it into a detailed painting.

This innovation was what helped enable the democratization of AI image generation. Without it, running these models would require expensive hardware beyond the reach of most users. With latent space diffusion, even a modest consumer GPU can generate high-quality images in seconds.

This accessibility, combined with the decision to open-source the model, sparked a wave of community-driven innovation. Developers and researchers worldwide could now not only use the model but also experiment with it, modify it, and adapt it for new purposes. The impact was immediate and far-reaching: artists integrated it into their creative workflows, developers built user-friendly interfaces and plugins, and researchers used it as a foundation for pushing the boundaries of what AI could create.

The open-source nature of Stable Diffusion created a virtuous cycle of innovation: as more people worked with the model, they discovered new techniques for fine-tuning it to specific artistic styles, improving image quality, and reducing unwanted biases in the generated content. These improvements were shared back with the community, leading to a rapidly evolving ecosystem of tools and techniques. This collaborative approach demonstrated the power of open-source development in AI, showing how shared knowledge and collective effort could accelerate progress far beyond what any single organization might achieve alone.

Papers:

Large Language Models Get Chatty (2022–2023)

ChatGPT: Conversations Made Easy

In late 2022, OpenAI's launch of ChatGPT marked an evolution in AI development by addressing a fundamental challenge: while large language models like GPT-3 possessed remarkable capabilities, they weren't naturally inclined to be helpful, truthful, or safe in their interactions with humans. ChatGPT tackled this challenge through Reinforcement Learning from Human Feedback (RLHF), a technique that changed how we align AI systems with human values and preferences.

RLHF represented a shift from traditional training approaches. Instead of just learning patterns from data, the model learned from human judgments about what constitutes good or helpful behavior. This process worked in three stages:

First, human trainers demonstrated desired responses to various prompts, creating a dataset of preferred behaviors.
Next, these examples were used to train a reward model that could evaluate responses based on their alignment with human preferences.
Finally, the language model was fine-tuned using reinforcement learning, gradually adjusting its behavior to maximize these human-aligned rewards.

This approach helped create an AI system that was powerful and more reliable, helpful, and attuned to human needs.

The transformation of a raw language model into ChatGPT required solving numerous practical challenges beyond RLHF. Engineers had to develop systems for maintaining coherent conversations over multiple turns, handle edge cases where the model might produce inappropriate responses, and scale the infrastructure to support millions of simultaneous users. These technical achievements were important and, in my opinion, a bit under appreciated in bridging the gap between laboratory AI and practical applications.

ChatGPT's release proved transformative for public perception of AI technology. By providing an intuitive interface to advanced AI capabilities, it demonstrated the practical value of language models in everyday tasks – from writing and analysis to coding and creative work. This accessibility sparked widespread adoption beyond the traditional technical audience, integrating AI into daily workflows across professions and disciplines. Moreover, ChatGPT's launch catalyzed broader societal discussions about AI's potential impact, raising important questions about education, workplace automation, and the future relationship between humans and AI systems.

If you’re reading this newsletter, I’d be shocked if you haven’t used it yourself.

What’s Next?

I had initially planned on introducing some further advancements in this space to bring us closer to the present, but I don’t want the length to be too overwhelming. Instead, there will be a part 2 of this “Generative AI history” lesson coming in a follow-up.

After that, I’ll begin writing up breakdowns of more recent papers and advancements on Tuesdays, discuss applications and products on Fridays, and periodically perform demos, discuss industry trends, and provide other AI miscellany at more random intervals.

As a note, I could have included many other papers in this Part 1, but tried to select what I thought were the most representative papers in advancing the field over the period of 2017-2023. If there are others that you would have liked to have seen included, drop me a line and let me know.

Thanks for reading, and if you haven’t yet, I’d love for you to subscribe.

Subscribe now

AI Decoded

The Attention Efficiency Newsletter Edition, Part 1

Introduction: Making Transformers Scale to Million-Token Contexts

The Fundamental Challenge: Attention Is Expensive

The Two-Phase Problem

The NSA Approach: Three-Branch Hierarchical Attention

1. Token Compression: Creating Information-Dense Summaries

2. Token Selection: Finding the Most Important Blocks

3. Sliding Window: Preserving Local Context

Putting It All Together: The Gating Mechanism

Kernel Design: Where Theory Meets Practice

Implementation Guide for Amazon P5e (H200) and Trn2 Clusters

Step 1: Framework Selection

Step 2: NSA Architecture Implementation

Step 3: Optimized Kernel Implementation

Step 4: Distributed Training Configuration

Step 5: Progressive Optimization

Real-World Applications: Three Compelling Use Cases

Use Case 1: Enterprise Document Search and Analysis Platform

Use Case 2: Clinical Decision Support System

Use Case 3: Code Intelligence Platform for Software Development

Conclusion: The Path Forward for Efficient Attention

How Did We Get Here?: Part 2

The Zeitgeist Begins

The Open-Source AI Revolution: Democratizing the Future

The Technical Breakthrough: Doing More with Less

The Infrastructure Revolution: Building the Foundation

The Spectrum of Openness: An Important Distinction

The Community Response: Innovation Unleashed

Enabling Technologies: The Unsung Heroes

Future Implications

The Multimodal Revolution: When AI Learned to See

GPT-4V: The Integration of Vision and Language

Gemini: Native Multimodality

The Challenge of Multimodal Consistency

Applications of Multimodal AI

Future Implications

Architectural Innovations: New Paths in AI Design

Mixture-of-Experts: The Power of Specialization

RWKV: The Return of Memory

State-Space Models: A New Mathematical Framework

Hardware Implications: Betting on Different Futures

Future Implications

Generative Agents: When AI Learned to Act

From Reactive to Proactive: The Birth of AI Agents

The Architecture of Agency: A Technical Mid-Depth Dive

Memory Systems: Beyond Simple Context Windows

The Social Dimension: Emergence and Implications

Future Implications

Constitutional AI: Teaching Machines to Think Ethically

Beyond Human Feedback: AI Evaluating AI

The Challenge of Global Values

The Alignment Debate: Who Decides What's "Right"?

Future Implications

Here We Are

Further Reading

Open Source and Model Architecture

Foundational Papers

Technical Implementation Resources

Multimodal AI

Research Papers

Learning Resources

AI Agents and Constitutional AI

Key Papers

Practical Implementations

AI Alignment and Safety

Foundational Papers

Technical Foundations

Critical Perspectives

Community Resources and Tools

Development Platforms

Educational Content

Understanding RLHF

The Birth of RLHF: Teaching AI to Write Better Summaries

Understanding the Building Blocks: How PPO Makes Learning Safe

The Challenge of Scale: When Bigger Isn't Always Better

Scaling RLHF: The InstructGPT Breakthrough

The Sycophancy Problem: When Helpfulness Conflicts with Truth

Finding a Better Way: Direct Preference Optimization

Wrap-Up