51 min listen
ICLR 2024 — Best Papers & Talks (Benchmarks, Reasoning & Agents) — ft. Graham Neubig, Aman Sanger, Moritz Hardt)
ICLR 2024 — Best Papers & Talks (Benchmarks, Reasoning & Agents) — ft. Graham Neubig, Aman Sanger, Moritz Hardt)
ratings:
Length:
269 minutes
Released:
Jun 10, 2024
Format:
Podcast episode
Description
Our second wave of speakers for AI Engineer World’s Fair were announced! The conference sold out of Platinum/Gold/Silver sponsors and Early Bird tickets! See our Microsoft episode for more info and buy now with code LATENTSPACE.This episode is straightforwardly a part 2 to our ICLR 2024 Part 1 episode, so without further ado, we’ll just get right on with it!Timestamps[00:03:43] Section A: Code Edits and Sandboxes, OpenDevin, and Academia vs Industry — ft. Graham Neubig and Aman Sanger* [00:07:44] WebArena* [00:18:45] Sotopia* [00:24:00] Performance Improving Code Edits* [00:29:39] OpenDevin* [00:47:40] Industry and Academia[01:05:29] Section B: Benchmarks* [01:05:52] SWEBench* [01:17:05] SWEBench/SWEAgent Interview* [01:27:40] Dataset Contamination Detection* [01:39:20] GAIA Benchmark* [01:49:18] Moritz Hart - Science of Benchmarks[02:36:32] Section C: Reasoning and Post-Training* [02:37:41] Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection* [02:51:00] Let’s Verify Step By Step* [02:57:04] Noam Brown* [03:07:43] Lilian Weng - Towards Safe AGI* [03:36:56] A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis* [03:48:43] MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework[04:00:51] Bonus: Notable Related Papers on LLM CapabilitiesSection A: Code Edits and Sandboxes, OpenDevin, and Academia vs Industry — ft. Graham Neubig and Aman Sanger* Guests* Graham Neubig* Aman Sanger - Previous guest and NeurIPS friend of the pod!* WebArena * * Sotopia (spotlight paper, website)* * Learning Performance-Improving Code Edits* OpenDevin* Junyang Opendevin* Morph Labs, Jesse Han* SWE-Bench* SWE-Agent* Aman tweet on swebench* LiteLLM* Livecodebench* the role of code in reasoning* Language Models of Code are Few-Shot Commonsense Learners* Industry vs academia* the matryoshka embeddings incident* other directions* UnlimiformerSection A timestamps* [00:00:00] Introduction to Guests and the Impromptu Nature of the Podcast* [00:00:45] Graham's Experience in Japan and Transition into Teaching NLP* [00:01:25] Discussion on What Constitutes a Good Experience for Students in NLP Courses* [00:02:22] The Relevance and Teaching of Older NLP Techniques Like Ngram Language Models* [00:03:38] Speculative Decoding and the Comeback of Ngram Models* [00:04:16] Introduction to WebArena and Zotopia Projects* [00:05:19] Deep Dive into the WebArena Project and Benchmarking* [00:08:17] Performance Improvements in WebArena Using GPT-4* [00:09:39] Human Performance on WebArena Tasks and Challenges in Evaluation* [00:11:04] Follow-up Work from WebArena and Focus on Web Browsing as a Benchmark* [00:12:11] Direct Interaction vs. Using APIs in Web-Based Tasks* [00:13:29] Challenges in Base Models for WebArena and the Potential of Visual Models* [00:15:33] Introduction to Zootopia and Exploring Social Interactions with Language Models* [00:16:29] Different Types of Social Situations Modeled in Zootopia* [00:17:34] Evaluation of Language Models in Social Simulations* [00:20:41] Introduction to Performance-Improving Code Edits Project* [00:26:28] Discussion on DevIn and the Future of Coding Agents* [00:32:01] Planning in Coding Agents and the Development of OpenDevon* [00:38:34] The Changing Role of Academia in the Context of Large Language Models* [00:44:44] The Changing Nature of Industry and Academia Collaboration* [00:54:07] Update on NLP Course Syllabus and Teaching about Large Language Models* [01:00:40] Call to Action: Contributions to OpenDevon and Open Source AI Projects* [01:01:56] Hiring at Cursor for Roles in Code Generation and Assistive Coding* [01:02:12] Promotion of the AI Engineer ConferenceSection B: Benchmarks * Carlos Jimenez & John Yang (Princeton) et al: SWE-bench: Can Language Models Resolve Real-world Github Issues? (ICLR Oral, Paper, website)* “We introduce SWE-bench, an evaluation framework consisting of 2,294 software engineering problems drawn from real GitHub issues and c
Released:
Jun 10, 2024
Format:
Podcast episode
Titles in the series (75)
97% Cheaper, Faster, Better, Correct AI — with Varun Mohan of Codeium by Latent Space: The AI Engineer Podcast — Practitioners talking LLMs, CodeGen, Agents, Multimodality, AI UX, GPU Infra and all things Software 3.0