How to train a Million Context LLM — with Mark Huang of Gradient.ai

FromLatent Space: The AI Engineer Podcast — Practitioners talking LLMs, CodeGen, Agents, Multimodality, AI UX, GPU Infra and all things Software 3.0

Start listening View podcast show

How to train a Million Context LLM — with Mark Huang of Gradient.ai

FromLatent Space: The AI Engineer Podcast — Practitioners talking LLMs, CodeGen, Agents, Multimodality, AI UX, GPU Infra and all things Software 3.0

ratings:

Length:

58 minutes

Released:

May 30, 2024

Format:

Podcast episode

Description

<150 Early Bird tickets left for the AI Engineer World’s Fair in SF! Prices go up soon.Note that there are 4 tracks per day and dozens of workshops/expo sessions; the livestream will air <30% of the content this time. Basically you should really come if you dont want to miss out on the most stacked speaker list/AI expo floor of 2024. Apply for free/discounted Diversity Program and Scholarship tickets here. We hope to make this the definitive technical conference for ALL AI engineers.Exactly a year ago, we declared the Beginning of Context=Infinity when Mosaic made their breakthrough training an 84k token context MPT-7B.A Brief History of Long ContextOf course right when we released that episode, Anthropic fired the starting gun proper with the first 100k context window model from a frontier lab, spawning smol-developer and other explorations. In the last 6 months, the fight (and context lengths) has intensified another order of magnitude, kicking off the "Context Extension Campaigns" chapter of the Four Wars:* In October 2023, Claude's 100,000 token windows was still SOTA (we still use it for Latent Space’s show notes to this day).* On November 6th, OpenAI launched GPT-4 Turbo with 128k context.* On November 21st, Anthropic fired back extending Claude 2.1 to 200k tokens.* Feb 15 (the day everyone launched everything) was Gemini's turn, announcing the first LLM with 1 million token context window.* In May 2024 at Google I/O, Gemini 1.5 Pro announced a 2m token context windowIn parallel, open source/academia had to fight its own battle to keep up with the industrial cutting edge. Nous Research famously turned a reddit comment into YaRN, extending Llama 2 models to 128k context. So when Llama 3 dropped, the community was ready, and just weeks later, we had Llama3 with 4M+ context!A year ago we didn’t really have an industry standard way of measuring context utilization either: it’s all well and good to technically make an LLM generate non-garbage text at 1m tokens, but can you prove that the LLM actually retrieves and attends to information inside that long context? Greg Kamradt popularized the Needle In A Haystack chart which is now a necessary (if insufficient) benchmark — and it turns out we’ve solved that too in open source:Today's guest, Mark Huang, is the co-founder of Gradient, where they are building a full stack AI platform to power enterprise workflows and automations. They are also the team behind the first Llama3's 1M+ and 4M+ context window finetunes.Long Context Algorithms: RoPE, ALiBi, and Ring AttentionPositional encodings allow the model to understand the relative position of tokens in the input sequence, present in what (upcoming guest!) Yi Tay affectionately calls the OG “Noam architecture”. But if we want to increase a model’s context length, these encodings need to gracefully extrapolate to longer sequences.ALiBi, used in models like MPT (see our "Context=Infinity" episode with the MPT leads, Jonathan Frankle and Abhinav), was one of the early approaches to this space. It lets the context window stretch as it grows, using a linearly decreasing penalty between attention weights of different positions; the further two tokens are, the higher the penalty. Of course, this isn’t going to work for usecases that actually require global attention across a long context.In more recent architectures and finetunes, RoPE (Rotary Position Embedding) encoding is more commonly used and is also what Llama3 was based on. RoPE uses a rotational matrix to encode positions, which empirically performs better for longer sequences.The main innovation from Gradient was to focus on tuning the theta hyperparameter that governs the frequency of the rotational encoding.Audio note: If you want the details, jump to 15:55 in the podcast (or scroll down to the transcript!)By carefully increasing theta as context length grew, they were able to scale Llama3 up to 1 million tokens and potentially beyond. Once you've scaled positional embeddings,

Released:

May 30, 2024

Format:

Podcast episode

Titles in the series (75)

The podcast by and for AI Engineers! We are the first place over 50k developers hear news and interviews about Software 3.0 - Foundation Models changing every domain in Code Generation, Computer Vision, AI Agents, and more, directly from the founders, builders, and thinkers involved in pushing the cutting edge. Striving to give you both the definitive take on the Current Thing down to the first introduction to the tech you'll be using in the next 3 months! We break news and exclusive interviews from tiny (George Hotz), Databricks, Glean, Replit, Roboflow, MosaicML, UC Berkeley, OpenAI, and more. www.latent.space

Skip carousel

More Episodes from Latent Space: The AI Engineer Podcast — Practitioners talking LLMs, CodeGen, Agents, Multimodality, AI UX, GPU Infra and all things Software 3.0

Skip carousel

Related podcast episodes

Skip carousel

Discover this podcast and so much more

How to train a Million Context LLM — with Mark Huang of Gradient.ai

How to train a Million Context LLM — with Mark Huang of Gradient.ai

Description

Titles in the series (75)

More Episodes from Latent Space: The AI Engineer Podcast — Practitioners talking LLMs, CodeGen, Agents, Multimodality, AI UX, GPU Infra and all things Software 3.0

Related podcast episodes